Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter BAM output of UMI-tools dedup before passing to Salmon quant #828

Closed
lars-work-sund opened this issue May 20, 2022 · 3 comments
Closed
Labels
bug Something isn't working
Milestone

Comments

@lars-work-sund
Copy link

lars-work-sund commented May 20, 2022

Description of the bug

When using UMIs salmon reports millions of:
WARNING: Detected suspicious pair --- The names are different:

This is because UMI-tools includes unpaired and chimeric reads. This leads to orphan reads.

I believe setting the flags:
--unpaired-reads=discard
and
--chimeric-pairs=discard

Should solve the issue

Command used and terminal output

No response

Relevant files

No response

System information

No response

@lars-work-sund lars-work-sund added the bug Something isn't working label May 20, 2022
@lars-work-sund lars-work-sund changed the title UMI-tools renaming breaks name sorting UMI-tools breaks salmon quant May 20, 2022
@lars-work-sund
Copy link
Author

The problem seems more complicated than that. It seems to be the issue discussed here

@drpatelh
Copy link
Member

Running this command as mentioned in the linked issue above seems to fix the problem:

umi_tools prepare-for-rsem -I WT_REP1.umi_dedup.transcriptome.bam --stdout=test.bam

The log files for Salmon no longer chucks out warnings for unpaired reads so I assume the reads are being handled correctly:

[2022-05-20 11:34:37.622] [jointLog] [info] setting maxHashResizeThreads to 2
[2022-05-20 11:34:37.622] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2022-05-20 11:34:37.622] [jointLog] [info] numQuantThreads = 2
[2022-05-20 11:34:38.114] [fileLog] [info] quantification processed 0 fragments so far

[2022-05-20 11:34:38.114] [fileLog] [info] quantification processed 53169 fragments so far

[2022-05-20 11:34:37.990] [jointLog] [warning] Transcript Gfp_transgene_gene appears in the reference but did not appear in the BAM
[2022-05-20 11:34:37.990] [jointLog] [info] replaced 0 non-ACGT nucleotides with random nucleotides
[2022-05-20 11:34:38.714] [jointLog] [info] Thread saw mini-batch with a maximum of 2.20% zero probability fragments
[2022-05-20 11:34:38.716] [jointLog] [info] Thread saw mini-batch with a maximum of 2.00% zero probability fragments
[2022-05-20 11:34:38.716] [jointLog] [info] 


Completed first pass through the alignment file.
Total # of mapped reads : 53169
# of uniquely mapped reads : 47590
# ambiguously mapped reads : 5579



[2022-05-20 11:34:38.722] [jointLog] [info] Computed 123 rich equivalence classes for further processing
[2022-05-20 11:34:38.722] [jointLog] [info] Counted 52389 total reads in the equivalence classes 
[2022-05-20 11:34:38.723] [jointLog] [warning] Only 53169 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2022-05-20 11:34:38.723] [jointLog] [info] starting optimizer
[2022-05-20 11:34:38.724] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2022-05-20 11:34:38.724] [jointLog] [info] iteration = 0 | max rel diff. = 444.782
[2022-05-20 11:34:38.728] [jointLog] [info] iteration = 100 | max rel diff. = 1.47645e-16
[2022-05-20 11:34:38.728] [jointLog] [info] finished optimizer
[2022-05-20 11:34:38.728] [jointLog] [info] writing output
[2022-05-20 11:34:39.127] [jointLog] [info] Computing gene-level abundance estimates
[2022-05-20 11:34:39.135] [jointLog] [warning] Feature has no GFF ID
[2022-05-20 11:34:39.135] [jointLog] [info] There were 124 transcripts mapping to 124 genes
[2022-05-20 11:34:39.135] [jointLog] [info] NOTE: We recommend using tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html) for aggregating transcript-level salmon abundance estimates to the gene level.  It is more versatile, exposes more features, and allows considering multi-sample information during aggregation.
[2022-05-20 11:34:39.135] [jointLog] [info] Aggregating expressions to gene level
[2022-05-20 11:34:39.136] [jointLog] [info] done

@drpatelh drpatelh changed the title UMI-tools breaks salmon quant Filter BAM output of UMI-tools dedup before passing to Salmon quant May 23, 2022
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue May 23, 2022
drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue May 23, 2022
@drpatelh drpatelh added this to the 3.8 milestone May 23, 2022
@drpatelh
Copy link
Member

drpatelh commented May 24, 2022

As a temporary solution, I have copied the prepare-for-rsem.py script directly from the UMI-tools repo into the bin/ directory of the pipeline. I have attempted to provide the appropriate credits at the top of the script. This will allow us to get a release of the pipeline out without waiting for UMI-tools and containers to be released. When this does happen we can remove the script from the pipeline and use it directly from the official package.

I have created a separate issue to track this for future releases of the pipeline: #831

Fixed in drpatelh@2e3e668 drpatelh@d5034fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants