Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine all VCFs across different variants caller, such as mutect2, strelka2 #738

Closed
Githubguanxudong opened this issue Sep 8, 2022 · 26 comments
Assignees
Labels
enhancement New feature or request

Comments

@Githubguanxudong
Copy link

Description of feature

We usually combine multiple software results in tumor variants calling, so I think it is necessary to use appropriate methods to combine them, such as union and consistency. I am looking forward to this feature

@Githubguanxudong Githubguanxudong added the enhancement New feature or request label Sep 8, 2022
@maxulysse
Copy link
Member

So do I
<3

@asp8200 asp8200 self-assigned this Oct 10, 2022
@asp8200
Copy link
Contributor

asp8200 commented Oct 10, 2022

What is the right tool for this? bcftools concat? We already have a module for that:

https://github.com/nf-core/modules/tree/master/modules/nf-core/bcftools/concat

@FriederikeHanssen
Copy link
Contributor

@Githubguanxudong could you elaborate a bit more? We are trying to tackle this during the hackathon. Which tools did you have in mind? Do you have some papers/examples on which level should be merged?

@asp8200
Copy link
Contributor

asp8200 commented Oct 10, 2022

As far as I could tell, we need to run bcftools concat with the option --allow-overlaps?

https://samtools.github.io/bcftools/bcftools.html

If one doesn't use --allow-overlaps, the results from bcftools concat a.vcf b.vcf and bcftools concat b.vcf a.vcf may be different.

When running without --allow-overlaps, one may get a warning like

The chromosome block chr1 is not contiguous, consider running with -a. 

but the tools still produces an output-file (which is seemingly missing some variants from the input-files!?).

It is not clear to me what the option --allow-overlaps actually does! 😬

@asp8200
Copy link
Contributor

asp8200 commented Nov 10, 2022

@FriederikeHanssen and @maxulysse : The vcf-file resulting from the concatenation may contain the same variant several times:

chr22   2123    .       C       G       28      PASS    SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    0/1:30:12:3:2:1,2:0,1:1,1:-6.9:PASS:63,0,27
chr22   3140    .       A       G       42      LowGQX;LowDepth;NoPassedVariantGTs      SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1/1:5:3:2:0:0,2:0,1:0,1:-8.9:LowGQX;LowDepth:78,6,0
[...]
chr22   2123    .       C       G       33.97   PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.967;CNN_1D=3.270;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=11.32;ReadPosRankSum=0.967;SOR=0.223  GT:AD:DP:GQ:PL  0/1:1,2:3:1:41,0,1

Such duplicates could perhaps be removed by the option --remove-duplicates for bcftools concat, but is that what we want?

Also, the vcf-file is not sorted. Should it perhaps be sorted?

@FriederikeHanssen
Copy link
Contributor

Hi @Githubguanxudong ! How do you usually deal with the above? :)

@amasplund
Copy link

@FriederikeHanssen
I was tagged by @asp8200 to look at this. We implemented a version of combining VCFs from structural variant callers before. We wanted to concatenate SV-calls from Manta, Cnvnator, Lumpy and Delly. So four different structural variant callers. The way our interpreters wanted it was that no variant calls were merged. One reason being that it can be hard to come to an agreement on which variants to merge. E.g. how large the overlap should be to merge. And for translocations how close the break-points should be. They also wanted to know from which variant caller the data came. We solved the concatenation using svdb --merge
[https://github.com/J35P312/SVDB]
and tweaked it in a way that it did not merge any calls :). Not even the duplicated calls were merged as they wanted to keep the calls separate, being able to filter in VarSeq on specific callers.
The vcf-files were sorted bcftools sort -O z before and after merging and indexed using tabix -p vcf.
Hope this is of any help. Just ping me if you have any questions.

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

We've decided that - for now - we just do concatenation of germline vcf-files.

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

@maxulysse @FriederikeHanssen @amasplund : I'll run bcftools sort z followed by tabix -p vcf on the concatenated vcf-file, ok?

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  testN
chr22   1982    .       A       G       0.1     RefCall .       GT:GQ:DP:AD:VAF:PL      ./.:16:286:238,48:0.167832:0,15,26
chr22   1982    .       A       G       220.18  .       DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60        GT:PL   0/1:255,0,255
chr22   1982    .       A       G       459.724 .       AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1   GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898

I wonder if that is what the users would want? 🤔

@maxulysse
Copy link
Member

So here I ran deepvariant, mpileup, strelka and freebayes, and I get the same variant called three times:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.bcftools.vcf.gz testN.strelka.variants.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 15:50:15
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  testN
chr22   1982    .       A       G       0.1     RefCall .       GT:GQ:DP:AD:VAF:PL      ./.:16:286:238,48:0.167832:0,15,26
chr22   1982    .       A       G       220.18  .       DP=251;VDB=9.63142e-13;SGB=-0.692831;RPBZ=0.633339;MQBZ=0;MQSBZ=0;BQBZ=0.628639;NMBZ=-1.07033;SCBZ=-0.471405;FS=0;MQ0F=0;AC=1;AN=2;DP4=61,47,11,13;MQ=60        GT:PL   0/1:255,0,255
chr22   1982    .       A       G       459.724 .       AB=0.167832;ABP=277.102;AC=1;AF=0.5;AN=2;AO=48;CIGAR=1X;DP=286;DPB=286;DPRA=0;EPP=3.0103;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=105.855;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=2017;QR=9863;RO=238;RPL=0;RPP=107.241;RPPR=519.821;RPR=48;RUN=1;SAF=24;SAP=3.0103;SAR=24;SRF=119;SRP=3.0103;SRR=119;TYPE=snp;technology.illumina=1   GT:DP:AD:RO:QR:AO:QA:GL 0/1:286:238,48:238:9863:48:2017:-95.4031,0,-799.898

I wonder if that is what the users would want? thinking

So this is just the germline indels?
If I understand well, in that case they're all concatenated together, so that's what I would expect.
But I'm thinking this is just the first step, and people might also want the variants to be merged, so in your case, having just a final one variant instead of 3

@FriederikeHanssen
Copy link
Contributor

yeah, is the meta info also the same across all?

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

yeah, is the meta info also the same across all?

Not sure I understand what you mean 🤔 The different variant-callers produces vcf-files with different INFO- and FORMAT-fields.

@FriederikeHanssen
Copy link
Contributor

yeah maybe this info is interesting to keep? Not sure if we want to just throw it out to only keep one line per variant or if there is a way to merge these fields

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

Here is the concatenated (and sorted) vcf-file resulting from the following cmd:

nextflow run main.nf -profile test,singularity  --input mapped_joint_bam.fixed.csv -dump-channels -ansi-log false --step variant_calling --concatenate_vcfs --tools deepvariant,freebayes,strelka,mpileup

testN.germline.vcf.gz

It is basically the result of running bcftools concat. I suggest that we just go with that for now.

@FriederikeHanssen
Copy link
Contributor

yes sounds good. can always add more later

@asp8200
Copy link
Contributor

asp8200 commented Nov 28, 2022

Just for the record: the cnvkit doesn't produce any vcf-files, so the concatenated vcf-file doesn't contain variants from the cnvkit.

@amasplund
Copy link

Some comments:

  • My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.
  • As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.
  • I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.

@asp8200
Copy link
Contributor

asp8200 commented Nov 30, 2022

Some comments:

  • My experience is that CNVkit works really bad without a reference dataset as it uses coverage distributions (log2ratios I think) to determine if something is out of the ordinary.
  • As the programs have different strengths (in regards to calling specific types of structural variants) and different ways of annotating the variants with meta data (e.g. quality), our users were interested in keeping all the info.
  • I can't find info about the origin of the variants (which variant caller) in the example Anders sent https://github.com/nf-core/sarek/files/10105274/testN.germline.vcf.gz. The experienced user can probably tell from the INFO field, but our users wanted this info. Is there a way of adding this info using bcftools concat.

@amasplund : Thanks for your feedback. It is much appreciated.

The vcf-file testN.germline.vcf.gz just contains the following line which shows which input-vcf-files were used:

##bcftools_concatCommand=concat --output testN.vcf.gz --threads 1 testN.deepvariant.vcf.gz testN.strelka.variants.vcf.gz testN.bcftools.vcf.gz testN.freebayes.vcf.gz; Date=Mon Nov 28 16:05:37 2022

I could imagine that it would be useful to have an INFO-field for each variant stating which variant-caller (or perhaps more realistically which vcf-file) found the given variant. Something like, for instance, VC=strelka. However, as far as I can see, bcftools concat doesn't support something like that.

@amasplund
Copy link

amasplund commented Nov 30, 2022

@asp8200 Can you ad "VC=strelka" manually before the concat? Or is that to hacky?? :)
Using svdb e.g. "set=delly" was added to the INFO field.

@amasplund
Copy link

@asp8200 Maybe the best is to send the example to some "real" interpreters and see what they think :). With the risk of getting divergent replies :).

@FriederikeHanssen
Copy link
Contributor

@asp8200 you could also ping the sarek community on slack to get some input. But as @amasplund mentioned quite possible we get some divergent opinions there, but that is always the case 😆 I like the idea of keeping all this INFO. Better a little more meta info than to little and to see how all of them compare this would be quite useful.

@amasplund
Copy link

A suggestion for CNVkit (our users like it a lot), which is more work demanding though, is to have a CNVkit option where you add a reference .cnn file.
https://cnvkit.readthedocs.io/en/stable/pipeline.html
I haven't done this myself so I am not into the details.
Users could then ad this for getting nice variant calls from CNVkit. I don't know if there is a NF-core pipeline for creating a CNVkit reference dataset??
Maybe for a future release :).

@FriederikeHanssen
Copy link
Contributor

@amasplund a bit of topic here: but with the newest sarek release you can submit your own reference.cnn. In addition, Maxime has just proposed a new workflow to create all these cohort-of-normal like references, like: pon for mutect2, one for msisensorpro and cnvkit

@asp8200
Copy link
Contributor

asp8200 commented Dec 1, 2022

Coming back to the point about adding an INFO-field to the vcf-files before concatenating them:

@amasplund come up with a nice, little bash+awk-script for adding some "constant" INFO-field to a vcf-file. How would Sarek run that script before doing the concatenation? As far as I can tell, there are two options:

  1. I make a "local" module with the bash-script, or
  2. I try to get the module (with the bash-script) enrolled in github.com/nf-core/modules

Neither of those solutions seem ideal. Anybody else have any suggestions on how to do this?

@asp8200
Copy link
Contributor

asp8200 commented Dec 7, 2022

Closed by #792

@asp8200 asp8200 closed this as completed Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Archived in project
Development

No branches or pull requests

5 participants