Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UMI tools #435

Merged
merged 17 commits into from
Jul 27, 2020
Merged

Add UMI tools #435

merged 17 commits into from
Jul 27, 2020

Conversation

grst
Copy link
Member

@grst grst commented Jul 6, 2020

Adds a umi_tools extract step before trimming and a umi_tools dedup step before QC/quantification.

PR checklist

  • PR is to dev rather than master
  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Learn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md

@apeltzer
Copy link
Member

apeltzer commented Jul 7, 2020

Could you have a look at this here too while you're on it ? ;-)

#432

grst added 2 commits July 8, 2020 17:34
The pipeline used to fail with `--skip_rsem` and gzipped reference data.
The reason for this was that the entire code block for decompressing
GTF and FASTA files was only executed because of RSEM requiring
them for building its reference.

By using `--skip_rsem` the references were not extracted
and the process combining normal+additional references
failed with a missing channel.

This commit ensures that the references are unzipped whenever
an additional fasta file is used. I added `--skip_rsem` as
additional test case to the CI script.
*/
if(params.with_umi) {
// preseq does not work on deduplicated BAM file. Pass it the raw BAM file.
bam.into {bam_umitools_dedup; bam_preseq}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure which of the QC tools should be executed on the deduplicated BAM and which on the raw BAM.

@grst grst marked this pull request as ready for review July 9, 2020 11:32
@grst grst requested review from ewels and a team July 9, 2020 11:55
@apeltzer apeltzer requested review from a team and removed request for a team and ewels July 9, 2020 12:34
@ewels
Copy link
Member

ewels commented Jul 9, 2020

Looking great! We've been planning to look into something similar for a while, though our UMIs are part of the barcode read so would be in a third FastQ file per sample. I guess this code is for inline UMIs that are part of the main read - do you see any overlap between this PR and what we will need to implement for barcode UMIs? Just trying to think about future-proofing 😄

@grst
Copy link
Member Author

grst commented Jul 9, 2020

I don't know how exactely you'd run UMI tools with an index read. Possibly one would have to run the extract step twice, once with R1+index and once with R2+index. In any case it should only affect the extract step, the dedup step should remain identical.

@ewels
Copy link
Member

ewels commented Jul 9, 2020

Yeah, I figured we might need a separate step to get the UMI into the read name in the FastQ files. But it's great that you think that the dedup step should work the same.

@amayer21
Copy link
Contributor

This is great!!! I need the UMI tools and was planning to try to add them into the RNAseq pipeline this week :-)
I've cloned this branch and made the corresponding Singularity container to test it on my data hopefully today

@kkamieniecka
Copy link

This is great!!! I need the UMI tools and was planning to try to add them into the RNAseq pipeline this week :-)
I've cloned this branch and made the corresponding Singularity container to test it on my data hopefully today

I have same plan for this week, hope I can make it work on my data

Copy link
Member

@TomKellyGenetics TomKellyGenetics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for doing this.

@apeltzer
Copy link
Member

@kmurat1 @amayer21 - once you've successfully tested, can you ping here?

@amayer21
Copy link
Contributor

Sorry I commented on the test I did on the slack channel and didn't copy the comment here... Here it is:

I've given it a try on small fastq file and end up with an error message coming from the use of RSEM and not from UMI tools...

Error executing process > 'merge_rsem_genes (GEO-417-CTRL24_NGS20-N105_AHCVGYDSXY_S443_L003_R1_001_umi_extracted_deduplicated.genes)'

(full job log here: nfcorernaseq_withUMItools_5335509.log)

This message doesn't appear and all goes well when using the option --skip_rsem.

I'm planning to look into it (I have never used RSEM before) but also trying to solve some majors problems with our cluster with our sys-admin, so haven't been able to put enough time on this yet...

@amayer21
Copy link
Contributor

amayer21 commented Jul 15, 2020

So, the error message from the merge_rsem_genes step was coming from the fact that I was processing only one fastq file. It ran without problems when I just duplicated my fastq file and ran it on both together. I guess it's pretty unusual to run a pipeline on only 1 sample... but maybe we still want to fix that just in case somebody needs to do so?

@grst
Copy link
Member Author

grst commented Jul 15, 2020

it should absolutely work with only one sample... do you want to try to fix it?

@amayer21
Copy link
Contributor

I will have a look tomorrow morning. I'm a newbie in NextFlow but really keen to learn and put it into practice.

@amayer21
Copy link
Contributor

How do I need to proceed? I don't think I can commit anything in the PR, can I? Do I need to fork grst:add-umi-tools then ask for a PR on that repository? Or shall I let you know what to change? There are only a couple of changes in one section of main.nf script

@grst
Copy link
Member Author

grst commented Jul 16, 2020

Hi @amayer21,

yes a PR against my fork should work.

@ewels
Copy link
Member

ewels commented Jul 16, 2020

Pasted from Slack for future googlers:

If it's just for small changes, it's best to propose them in comments on the pull request. See the GitHub docs

Basically, make a comment on one or more specific lines, then click the +- icon in the toolbar and write the code that you would like to see. The comment will show a diff which @gregor Sturm or anyone can then easily accept to commit into the PR.

This is what it looks like:

image

@amayer21
Copy link
Contributor

Thanks Phil. As explained on slack, I realised when trying to add comments to the code of this PR that the changes are in codes that hasn't been modified in this PR, so I can't comment it. This mean that the problem I saw when trying this branch is already present on the dev branch of the pipeline.

To summarise:
(1) I've tested the UMI-tools on single-end data and it worked when processing several fastq files
(2) when I was testing with only one input fastq file, it would fail with the default option but work with the--skip_rsem option. This can be fixed by 2 minors changes in the code of the merge_rsem_genes process.
(3) while looking at this section of the code, I've noticed that the outputs of "merge_rsem_genes" is called "rsem_tpm_gene.txt" while it doesn't contain the TPM (transcript per million) column of RSEM output, but the "expected counts" one. I propose to change the name of these outputs (need to be done in the code and in documentation)

So the UMI-tools seems to work to me (at least for single-end reads, I could test paired-end ones tomorrow if needed).
And for the RSEM issues, I can fork the dev branch and make the corrections there. NB: in the meantime I've also made a PR to Gregor's fork, so that could be an alternative (but I did change the name of rsem output without changing the documentation yet so maybe I should finish that first)?

Sorry I'm very new to collaborating on GitHub so I made things more complicated than they were...

Thank you very much for your help!

merge_rsem_gene was crashing when only 1 sample as input of the pipeline => now fixed  
I've also corrected the name of the output file of that same process as the column of RSEM output we keep is the "expected count" column and not the "TPM" one.  
I think it was confusing to call the output "rsem_tpm_gene.txt" as this wasn't "transcripts per million"
@grst
Copy link
Member Author

grst commented Jul 17, 2020

while looking at this section of the code, I've noticed that the outputs of "merge_rsem_genes" is called "rsem_tpm_gene.txt" while it doesn't contain the TPM (transcript per million) column of RSEM output

Oh really!? 😕
It should be the TPM column though, we need to fix that.
(or even store both)

@amayer21
Copy link
Contributor

It's easy to fix if we want to get the TPMs (just need to change -f 5 into -f 6).

$ cat GEO_160_TCR24_S162_L001_R1_umi_extracted_deduplicated.genes.results | head -n 1
gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
$ cat GEO_160_TCR24_S162_L001_R1_umi_extracted_deduplicated.genes.results | cut -f 5 | head -n 1
expected_count

Having said that, I've never used RSEM but it look to me that TPMs are easy to compute from counts, and several tools will start from raw counts rather than counts per millions and perform their own normalisation.

In the tutorial (https://github.com/bli25broad/RSEM_tutorial#single), they explain (looking at .isoforms.results):

The sixth column gives the expression level for each isoform in TPM (Transcript per Million). TPM is a relative measure of expression levels. It represents the number of copies each isoform should have supposing the whole transcriptome contains exactly 1 million transcripts. The fifth column provides the expected read count in each transcript, which can be utilized by tools like EBSeq, DESeq and edgeR for differential expression analysis. The format of gene-level result file, LPS_6h.genes.results, is very similar.

@amayer21
Copy link
Contributor

If we keep the expected counts in the final output (instead of or both with TPMs), we also need to update this page:
https://github.com/grst/rnaseq/blob/add-umi-tools/docs/output.md#rsem

@grst
Copy link
Member Author

grst commented Jul 17, 2020

I would keep both.
If I need counts rather than TPM, I would just go with featureCounts instead.

@amayer21
Copy link
Contributor

Hello,

I've made changes in my fork to keep both RSEM counts and TPMs.

Before to make a PR, I wanted to compare RSEM outputs before and after making the changes.
Surprisingly I found some difference in the counts for a few genes (see tsv table and plot). But when I ran it twice on identical samples and compared RSEM outputs, that also gave a difference (see plot with original or final pipeline). So it looks to me that it's more likely to come from the way RSEM estimate these corrected counts than from the modification I did to the pipeline.
My test file is really small (100,000 reads and only 70% mapping).
I've never used RSEM, don't need to use it and don't have time to go deeper into that now...
If you think it's ok, I'll do a PR on Gregor's fork.

@amayer21
Copy link
Contributor

Also, I said I was going to test on paired end data but my only paired end dataset with UMIs has only UMIs on read2. To make it go through the pipeline, I had to swap name between R1 and R2 (don't think it would be difficult to allow UMI on read2 in pipeline though). But then my "pseudo-read1" only had the UMI (so end up with empty reads after extraction of UMIs), so STAR didn't want to process them... The possibility to start with paired-end reads and move to single-end after UMI extraction seems more complicated to implement byt may be something we could think of at some point...

@apeltzer
Copy link
Member

Maybe you can do a PR on Gregors fork and we all have a look - looks like you invested quite some time now :-)

@amayer21
Copy link
Contributor

done :-)

@apeltzer
Copy link
Member

Are we good here now? :-)

@amayer21
Copy link
Contributor

I think so (as I said in previous comment, I'm not sure about the fact RSEM output is sligthly different each time I run it on a given input file. But it seems to me it's coming from RSEM itself and not from any modification done to the pipeline here).

@grst can you confirm?

Copy link
Member

@apeltzer apeltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me - @grst please merge if you#re happy too .-)

@grst
Copy link
Member Author

grst commented Jul 27, 2020

tbh, I don't know if RSEM results are expected to have some randomness - in any case, I can't see how they pipeline (and in particular your PR) would be causing this. Therefore, I'm merging this now.

@grst grst merged commit 4a0eeb4 into nf-core:dev Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants