Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft clipped reads are not correctly deduplicated #10

Closed
grahamgower opened this issue Dec 21, 2016 · 6 comments
Closed

Soft clipped reads are not correctly deduplicated #10

grahamgower opened this issue Dec 21, 2016 · 6 comments

Comments

@grahamgower
Copy link
Contributor

grahamgower commented Dec 21, 2016

Hi Mikkel,

We've noticed a problem with deduplication after trying out bwa-mem on some of our data. The underlying reason seems to be that PCR duplicates can exhibit different sequencing errors, and thus be soft-clipped at different positions. Deduplication based on the length and cigar string is thus problematic. Further, if the reads are soft-clipped at the beginning of a forward-mapping sequence (or the end of a reverse-mapping sequence), the reported mapping positions will also be different. The attached file shows 4 reads which should be marked as duplicates but are not.

-Graham

z.txt

@grahamgower
Copy link
Contributor Author

Maybe something like the attached patch will do the trick (it needs testing)...

softclip-dedup.patch.txt

@grahamgower
Copy link
Contributor Author

Oops, try this one instead.

softclip-dedup.patch2.txt

@MikkelSchubert
Copy link
Owner

Hey Graham,

Thank you very much for the report and for the patches!

While I see the problem, I am not sure that treating clipped bases as part of the alignment (which they are not, by definition) is something that you can rely on. The possibility of clipped reads extending past the contig termini also poses some questions, since the clipped bases could, potentially, map to different contigs (or not at all).

I have investigated this briefly, and between Picard MarkDuplicates and SAMTools rmdup, Picard appears to follow your strategy while SAMTools does not. I intend to look into this further, as time permits, and potentially add your patch in an upcoming update.

Best,
Mikkel

@MikkelSchubert
Copy link
Owner

Hey Graham,
I apologize for taking so long, but unfortunately I have been busy wrapping up several projects these last few months.

I have been working on a updated version of the script, inspired by your patch, which I intend to finalize by next week. The current version is attached, if you want to have a chance to look at it now.

Best,
Mikkel

rmdup_collapsed_softclip.txt

@grahamgower
Copy link
Contributor Author

Hi Mikkel,

I've looked over your new code and it looks right. I've not tested it much though. Do you have a test framework that you use for such things?

-G

@MikkelSchubert
Copy link
Owner

Hi Graham,
Unfortunately I do not yet have a framework for automatically testing that, though it is something I am interested in implementing. So I have manually been carrying out tests on various datasets, big and small, to ensure that the new version of rmdup_collapsed performs as expected. Long story short, I have have now released a version of PALEOMIX (v1.2.9) that includes the improved script, which should address this issue.

Once again thank you for reporting this issue, and apologies for taking so long. Do not hesitate to open additional issues if you should run into other problems with PALEOMIX.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants