-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicate variants called by ARTIC ONT pipeline #232
Comments
Hi @peterk87! Hope you are well and Happy New Year! Prepping for another release soon. I think we would need to write a small Python script or the likes to deduplicate the VCF file. I wonder whether we need to prioritise keeping one variant over another based on the entries in the final column? Are there any existing tools that allow you dededuplicate VCF files 🤔 Will take a look. Also, be great if you have a full VCF file handy I could use to test this as generated from the pipeline. |
This would be worth testing: Is on Bioconda: |
Ok. This will be fixed in #252 and a new file called I ended up wiring in a module for |
Thanks @drpatelh ! This will be great for future analyses. vcflib looks very useful. |
There's an unresolved issue with the ARTIC pipeline where variants present in both pools may not be merged/deduplicated properly (see artic-network/fieldbioinformatics#53 and will-rowe/artic-tools#3), which may result in inflated numbers of variants found in these samples with BCFTools and reported in the viralrecon MultiQC report.
For example, in some samples I tested, the following variant calls are duplicated in the
<sample>.pass.vcf.gz
:The ARTIC pipeline handles variant calling for each PCR pool separately and then merges the VCF files, by default, with
artic_vcf_merge
, which doesn't appear to do any deduplication of variants (see fieldbioinformatics/artic/vcf_merge.py.If ARTIC is run with
--strict
, thenartic-tools
v0.2.6 (whichartic_minion
v1.2.1 uses) keeps all copies of duplicate variants encountered as well:https://github.com/will-rowe/artic-tools/blob/98532314bc2345ed64e3c8864e2a1b22b47b794f/artic/vcfCheck.cpp#L167
Although the code has changed in
artic-tools
v0.3.0 compared to v0.2.6:will-rowe/artic-tools@v0.2.6...v0.3.0#diff-df805cd5df2f5f2218fd8b5ba16f89171028315340e591b941f6fe135e7cf8d9L167
artic-tools
v0.3.0 also keeps the duplicate variant calls and logs that the variant was found in an amplicon overlap region:Possible solution
One possible solution would be to add a process and script to keep only the first duplicate variant called by the ARTIC pipeline.
The text was updated successfully, but these errors were encountered: