Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in ReadGroups for QualiMap compatibility #241

Merged
merged 6 commits into from
Jul 8, 2019
Merged

Add in ReadGroups for QualiMap compatibility #241

merged 6 commits into from
Jul 8, 2019

Conversation

apeltzer
Copy link
Member

Adding the sequencing center was done a while ago to have properly set ReadGroups in the BAM files created by HiSAT2 or STAR. However, this isn't enough when using any tool or library based on the picard/GATK htsjdk, that checks for certain required readgroup information to be present in BAM files and corresponding headers (not only when sequencing center is set, but in all cases!). This caused QualiMap to fail unfortunately, which @olgabot found out here: #238

This PR adds the SM information to both HISAT2 and STAR and adds in the defaults without sequencing centers to be added in all cases, otherwise QualiMap will also fail when seqCenter isn't set by the user!

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

@drpatelh
Copy link
Member

Sorry @apeltzer It was supposed to be seq_center and not seqCenter because I have changed all of the variable names that require a value to this format.

You are absolutely right about the additional validation that will be required for tools using picard. I also think its a bit pointless adding in just the seq_center field because it defeats the purpose of having a properly formatted read group. This is why I create the entire read group for the chipseq pipeline and add in seq_center only if its set:
https://github.com/nf-core/chipseq/blob/333bba334512625775157aa0087c023cf0bde397/main.nf#L555-L559

Having said that, because rnaseq has multiple alignment options I think this should be added in a separate process if seq_center is set e.g. process addReadGroup using picard AddOrReplaceReadGroups. It will be much easier to maintain this way and at least its guaranteed to work with picard.

@olgabot
Copy link
Contributor

olgabot commented Jun 25, 2019

It's working for me now!! No mass failures at qualimap anymore:

2019-06-27_nextflow_rnaseq_scslam

Copy link
Contributor

@olgabot olgabot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working for me now! The one thing to change is to add params.seqCenter = false in nextflow.config for the default values.

@apeltzer
Copy link
Member Author

I will add general read group addition in this PR so we won't need to have another process AddReadGroups - STAR and HISAT2 can add that.

@apeltzer
Copy link
Member Author

Okay, giving this a better-detailed look I think we should have some of the more important/required options in the output by default, following this here:

https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

  • ID (which we add now)
  • PU
  • SM
  • PL
  • LB
  • CN

I would therefore also think about making the options a bit different now, e.g.

params.rg_seqCenter
params.rg_sampleID
params.rg_libraryID

That should be a.) easier to document and b.) follow best-practices for BAM files from GATK/Broad. Thoughts on this @olgabot @drpatelh @maxulysse @ewels ?

@drpatelh
Copy link
Member

@apeltzer I think we will need a separate process for this eventually anyway because rather annoyingly RSEM doesnt allow you to pass the read group ids to STAR when it wraps it to do the alignment. In my local pipeline, I MarkDuplicates and AddOrReplaceReadGroups after the mapping step. I add the read groups to run RNA-SeQC because its picard based and requires them.

As I suggested, if we arent overly fussed about some of the other ids then you should be able to just extract most of the information from the sample name as I have done for the chipseq and atacseq pipelines:
https://github.com/nf-core/chipseq/blob/333bba334512625775157aa0087c023cf0bde397/main.nf#L558

Which also means fewer parameters and checks...

@apeltzer
Copy link
Member Author

Okay that rsem is an entire different thing - then I’ll add that with your code!

@drpatelh
Copy link
Member

drpatelh commented Jun 26, 2019

I mean the other option is that we remove the seq_center parameter altogether. I think its only there for aesthetic purposes anyway because the read group is normally set when you map multiple libraries individually and then merge them together later in the pipeline using that information. With rnaseq we merge at the fastq-level so not really applicable...but some picard-based tools require the read group to be set regardless so maybe better to have it in.

@olgabot
Copy link
Contributor

olgabot commented Jul 1, 2019

FYI somehow the gtf_qualimap variable got lost if one provides only a GFF file, so this was added back in

@apeltzer apeltzer merged commit ad7e743 into nf-core:salmon Jul 8, 2019
@apeltzer apeltzer deleted the salmon-readgroups branch October 17, 2019 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants