Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate variants called by ARTIC ONT pipeline #232

Closed
peterk87 opened this issue Jul 30, 2021 · 4 comments
Closed

Remove duplicate variants called by ARTIC ONT pipeline #232

peterk87 opened this issue Jul 30, 2021 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@peterk87
Copy link

There's an unresolved issue with the ARTIC pipeline where variants present in both pools may not be merged/deduplicated properly (see artic-network/fieldbioinformatics#53 and will-rowe/artic-tools#3), which may result in inflated numbers of variants found in these samples with BCFTools and reported in the viralrecon MultiQC report.

For example, in some samples I tested, the following variant calls are duplicated in the <sample>.pass.vcf.gz:

MN908947.3	28881	.	G	A	14025.3	PASS	TotalReads=985;SupportFraction=0.95057;SupportFractionByStrand=0.924196,0.976891;BaseCalledReadsWithVariant=902;BaseCalledFraction=0.902;AlleleCount=1;StrandSupport=37,11,455,482;StrandFisherTest=40;SOR=0.203982;RefContext=CAGTAGGGGAA;Pool=nCoV-2019_1	GT	1
MN908947.3	28881	.	G	A	14657.3	PASS	TotalReads=962;SupportFraction=0.946487;SupportFractionByStrand=0.910959,0.981867;BaseCalledReadsWithVariant=894;BaseCalledFraction=0.919753;AlleleCount=1;StrandSupport=43,9,437,473;StrandFisherTest=65;SOR=0.199358;RefContext=CAGTAGGGGAA;Pool=nCoV-2019_2	GT	1
MN908947.3	28882	.	G	A	14025.3	PASS	TotalReads=985;SupportFraction=0.870563;SupportFractionByStrand=0.780772,0.960173;BaseCalledReadsWithVariant=880;BaseCalledFraction=0.88;AlleleCount=1;StrandSupport=108,20,384,473;StrandFisherTest=176;SOR=0.440152;RefContext=AGTAGGGGAAC;Pool=nCoV-2019_1	GT	1
MN908947.3	28882	.	G	A	14657.3	PASS	TotalReads=962;SupportFraction=0.86984;SupportFractionByStrand=0.774981,0.964305;BaseCalledReadsWithVariant=864;BaseCalledFraction=0.888889;AlleleCount=1;StrandSupport=108,17,372,465;StrandFisherTest=191;SOR=0.462109;RefContext=AGTAGGGGAAC;Pool=nCoV-2019_2	GT	1
MN908947.3	28883	.	G	C	14025.3	PASS	TotalReads=985;SupportFraction=0.973349;SupportFractionByStrand=0.984946,0.961775;BaseCalledReadsWithVariant=884;BaseCalledFraction=0.884;AlleleCount=1;StrandSupport=7,19,485,474;StrandFisherTest=14;SOR=0.20202;RefContext=GTAGGGGAACT;Pool=nCoV-2019_1	GT	1
MN908947.3	28883	.	G	C	14657.3	PASS	TotalReads=962;SupportFraction=0.967308;SupportFractionByStrand=0.976838,0.957817;BaseCalledReadsWithVariant=885;BaseCalledFraction=0.910494;AlleleCount=1;StrandSupport=11,20,469,462;StrandFisherTest=8;SOR=0.303134;RefContext=GTAGGGGAACT;Pool=nCoV-2019_2	GT	1

The ARTIC pipeline handles variant calling for each PCR pool separately and then merges the VCF files, by default, with artic_vcf_merge, which doesn't appear to do any deduplication of variants (see fieldbioinformatics/artic/vcf_merge.py.

If ARTIC is run with --strict, then artic-tools v0.2.6 (which artic_minion v1.2.1 uses) keeps all copies of duplicate variants encountered as well:

https://github.com/will-rowe/artic-tools/blob/98532314bc2345ed64e3c8864e2a1b22b47b794f/artic/vcfCheck.cpp#L167

Although the code has changed in artic-tools v0.3.0 compared to v0.2.6:

will-rowe/artic-tools@v0.2.6...v0.3.0#diff-df805cd5df2f5f2218fd8b5ba16f89171028315340e591b941f6fe135e7cf8d9L167

image

artic-tools v0.3.0 also keeps the duplicate variant calls and logs that the variant was found in an amplicon overlap region:

[12:29:37] [artic-tools::check_vcf] variant at pos 28881: G->A
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region
[12:29:37] [artic-tools::check_vcf] variant at pos 28881: G->A
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region
[12:29:37] [artic-tools::check_vcf] variant at pos 28882: G->A
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region
[12:29:37] [artic-tools::check_vcf] variant at pos 28882: G->A
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region
[12:29:37] [artic-tools::check_vcf] variant at pos 28883: G->C
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region
[12:29:37] [artic-tools::check_vcf] variant at pos 28883: G->C
[12:29:37] [artic-tools::check_vcf]     located within an amplicon overlap region

Possible solution

One possible solution would be to add a process and script to keep only the first duplicate variant called by the ARTIC pipeline.

@peterk87 peterk87 added the bug Something isn't working label Jul 30, 2021
@drpatelh drpatelh added this to the 2.3 milestone Jan 7, 2022
@drpatelh
Copy link
Member

Hi @peterk87! Hope you are well and Happy New Year! Prepping for another release soon. I think we would need to write a small Python script or the likes to deduplicate the VCF file. I wonder whether we need to prioritise keeping one variant over another based on the entries in the final column? Are there any existing tools that allow you dededuplicate VCF files 🤔 Will take a look.

Also, be great if you have a full VCF file handy I could use to test this as generated from the pipeline.

@drpatelh
Copy link
Member

drpatelh commented Jan 14, 2022

@drpatelh drpatelh changed the title Nanopore variant calling stats may be inflated due to duplicate variants from ARTIC pipeline Duplicate variants called by ARTIC ONT pipeline Jan 14, 2022
@drpatelh
Copy link
Member

drpatelh commented Jan 14, 2022

Ok. This will be fixed in #252 and a new file called <sample>.pass.unique.vcf.gz will be written to the results directory. All downstream processes will use the de-duplicated VCF file which will also fix the reporting.

I ended up wiring in a module for vcflib/vcfuniq which worked just fine.

@drpatelh drpatelh changed the title Duplicate variants called by ARTIC ONT pipeline Remove duplicate variants called by ARTIC ONT pipeline Jan 17, 2022
@peterk87
Copy link
Author

Thanks @drpatelh ! This will be great for future analyses. vcflib looks very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants