Add in ReadGroups for QualiMap compatibility #241

apeltzer · 2019-06-23T12:04:10Z

Adding the sequencing center was done a while ago to have properly set ReadGroups in the BAM files created by HiSAT2 or STAR. However, this isn't enough when using any tool or library based on the picard/GATK htsjdk, that checks for certain required readgroup information to be present in BAM files and corresponding headers (not only when sequencing center is set, but in all cases!). This caused QualiMap to fail unfortunately, which @olgabot found out here: #238

This PR adds the SM information to both HISAT2 and STAR and adds in the defaults without sequencing centers to be added in all cases, otherwise QualiMap will also fail when seqCenter isn't set by the user!

PR checklist

This comment contains a description of changes (with reason)
If you've fixed a bug or added code that should be tested, add tests!
If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo
Ensure the test suite passes (nextflow run . -profile test,docker).
Make sure your code lints (nf-core lint .).
Documentation in docs is updated
CHANGELOG.md is updated
README.md is updated

drpatelh · 2019-06-23T20:34:01Z

Sorry @apeltzer It was supposed to be seq_center and not seqCenter because I have changed all of the variable names that require a value to this format.

You are absolutely right about the additional validation that will be required for tools using picard. I also think its a bit pointless adding in just the seq_center field because it defeats the purpose of having a properly formatted read group. This is why I create the entire read group for the chipseq pipeline and add in seq_center only if its set:
https://github.com/nf-core/chipseq/blob/333bba334512625775157aa0087c023cf0bde397/main.nf#L555-L559

Having said that, because rnaseq has multiple alignment options I think this should be added in a separate process if seq_center is set e.g. process addReadGroup using picard AddOrReplaceReadGroups. It will be much easier to maintain this way and at least its guaranteed to work with picard.

olgabot · 2019-06-25T11:37:00Z

It's working for me now!! No mass failures at qualimap anymore:

olgabot

Working for me now! The one thing to change is to add params.seqCenter = false in nextflow.config for the default values.

apeltzer · 2019-06-25T12:34:49Z

I will add general read group addition in this PR so we won't need to have another process AddReadGroups - STAR and HISAT2 can add that.

apeltzer · 2019-06-25T13:21:17Z

Okay, giving this a better-detailed look I think we should have some of the more important/required options in the output by default, following this here:

https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

ID (which we add now)
PU
SM
PL
LB
CN

I would therefore also think about making the options a bit different now, e.g.

params.rg_seqCenter
params.rg_sampleID
params.rg_libraryID

That should be a.) easier to document and b.) follow best-practices for BAM files from GATK/Broad. Thoughts on this @olgabot @drpatelh @maxulysse @ewels ?

drpatelh · 2019-06-25T20:26:57Z

@apeltzer I think we will need a separate process for this eventually anyway because rather annoyingly RSEM doesnt allow you to pass the read group ids to STAR when it wraps it to do the alignment. In my local pipeline, I MarkDuplicates and AddOrReplaceReadGroups after the mapping step. I add the read groups to run RNA-SeQC because its picard based and requires them.

As I suggested, if we arent overly fussed about some of the other ids then you should be able to just extract most of the information from the sample name as I have done for the chipseq and atacseq pipelines:
https://github.com/nf-core/chipseq/blob/333bba334512625775157aa0087c023cf0bde397/main.nf#L558

Which also means fewer parameters and checks...

apeltzer · 2019-06-25T20:39:39Z

Okay that rsem is an entire different thing - then I’ll add that with your code!

drpatelh · 2019-06-26T14:40:45Z

I mean the other option is that we remove the seq_center parameter altogether. I think its only there for aesthetic purposes anyway because the read group is normally set when you map multiple libraries individually and then merge them together later in the pipeline using that information. With rnaseq we merge at the fastq-level so not really applicable...but some picard-based tools require the read group to be set regardless so maybe better to have it in.

olgabot · 2019-07-01T16:11:59Z

FYI somehow the gtf_qualimap variable got lost if one provides only a GFF file, so this was added back in

Add in ReadGroups for QualiMap compatibility

7effd36

apeltzer requested review from drpatelh and olgabot June 23, 2019 12:04

apeltzer added 3 commits June 23, 2019 14:19

Fix typo

d55179e

Fix seqCenter

a741755

HISAT2 seq_center

b0b7be4

olgabot requested changes Jun 25, 2019

View reviewed changes

Add gtf_qualimap

5c13cc0

Remove trailing slash

d1b6d34

apeltzer merged commit ad7e743 into nf-core:salmon Jul 8, 2019

apeltzer deleted the salmon-readgroups branch October 17, 2019 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in ReadGroups for QualiMap compatibility #241

Add in ReadGroups for QualiMap compatibility #241

apeltzer commented Jun 23, 2019

drpatelh commented Jun 23, 2019

olgabot commented Jun 25, 2019

olgabot left a comment

apeltzer commented Jun 25, 2019

apeltzer commented Jun 25, 2019

drpatelh commented Jun 25, 2019

apeltzer commented Jun 25, 2019

drpatelh commented Jun 26, 2019 •

edited

Loading

olgabot commented Jul 1, 2019

Add in ReadGroups for QualiMap compatibility #241

Add in ReadGroups for QualiMap compatibility #241

Conversation

apeltzer commented Jun 23, 2019

PR checklist

drpatelh commented Jun 23, 2019

olgabot commented Jun 25, 2019

olgabot left a comment

Choose a reason for hiding this comment

apeltzer commented Jun 25, 2019

apeltzer commented Jun 25, 2019

drpatelh commented Jun 25, 2019

apeltzer commented Jun 25, 2019

drpatelh commented Jun 26, 2019 • edited Loading

olgabot commented Jul 1, 2019

drpatelh commented Jun 26, 2019 •

edited

Loading