Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umi_tools dedup : Run before salmon to dedup counts #576

Closed
jryge opened this issue Mar 5, 2021 · 5 comments
Closed

umi_tools dedup : Run before salmon to dedup counts #576

jryge opened this issue Mar 5, 2021 · 5 comments
Labels
bug Something isn't working
Milestone

Comments

@jryge
Copy link

jryge commented Mar 5, 2021

No description provided.

@jryge jryge added the bug Something isn't working label Mar 5, 2021
@jryge
Copy link
Author

jryge commented Mar 5, 2021

Additional description of issue
I have single end reads, where the UMIs are part of the index. With bcl2fastq I get a fastq file with the reads and one with the UMIs. To make it compatible with umi-tools in the rnaseq pipeline I added the UMI sequences to the beginning of the reads. This seems to work, the pipeline completes and umi_tools extracts and dedups the the reads (though "umi_tools dedup ... *.bam" takes a VERY long time ~24h). The issue for me is that the quantification of the reads with salmon seems to be done on the original bam files (from the STAR alignment) and not the de-duplicated ones. I compared the gene counts to a run on the same data without activating the umi part, and the they are practically identical (apart from some occasional minor rounding errors)...

Solution
It seems like the salmon read count quantification is done on the star alignments prior to deduplication, while is should be done on the de-deplicated bam file (with umi_tools desup). A "simple" reorder the workflow should do the trick.

@drpatelh drpatelh added this to the 3.1 milestone Apr 11, 2021
@drpatelh
Copy link
Member

This turned into quite a big job 😅 Salmon wants a BAM sorted by read name and umitools needs a BAM sorted by co-ordinate with an index. STAR produces the former which was handy to plug directly into Salmon. If we want to use umitools I am going to have to co-ordinate sort the transcriptome BAM from STAR, index, run umitools and then sort it again by name before running Salmon! Will mean more intermediate BAM files when using UMIs but no way around it I'm afraid 😏

Salmon takes the transcriptome BAM to perform the quantification so ideally we need to umi dedup that BAM file before the counting. However, the genome BAM is used by most downstream steps for the QC so we have to UMI dedup both BAMs separately.

@drpatelh
Copy link
Member

Fixed in #593

Tricky to test this because I don't have any UMI data @jryge but be great if you can make sure that the UMI's are being de-duplicated as expected when passed through the various steps in the pipeline.

@jryge
Copy link
Author

jryge commented Apr 15, 2021

Great, I'll give a spin to see if it works. These seemingly trivial issues often turns out more completed due to all the dependencies of the different tools...

Thanks for taking the time!

@drpatelh
Copy link
Member

Awesome. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants