Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Pathogen Profiler combine_vcf_variants.py script #345

Closed
taranewman opened this issue Apr 23, 2024 · 16 comments
Closed

Issue with Pathogen Profiler combine_vcf_variants.py script #345

taranewman opened this issue Apr 23, 2024 · 16 comments

Comments

@taranewman
Copy link

Hello,

I came across an error with the Pathogen Profiler combine_vcf_variants.py script that seems to occur in approximately half of my samples with TBProfiler v6.2.0. The same samples previously ran successfully using v4.3.0. The samples causing this error don't appear to have a clear lineage/QC pattern.

An SRA sample that produces this error is SRR10869015

If line 171 is commented out, then everything appears to run fine.

System specifications: conda, Linux HPC, SLURM

Command Failed:
/bin/bash -c set -o pipefail; bcftools view -c 1 -a <>targets_for_profile.vcf.gz | bcftools view -v snps | combine_vcf_variants.py --ref tbprofiler//tbdb.fasta --gff tbdb.gff --bam <>.bam |  snpEff ann -dataDir snpeff-5.2-0/data -noLog -noStats Mycobacterium_tuberculosis_h37rv -  | bcftools sort -Oz -o <>.vcf.gz && bcftools index <>vcf.gz
stderr:
Writing to ...
Traceback (most recent call last):
  File "<path to conda env> bin/combine_vcf_variants.py", line 171, in <module>
    variant.info.update({'AF':count/dp})
  File "pysam/libcbcf.pyx", line 2798, in pysam.libcbcf.VariantRecordInfo.update
  File "pysam/libcbcf.pyx", line 2621, in pysam.libcbcf.VariantRecordInfo.__setitem__
  File "pysam/libcbcf.pyx", line 698, in pysam.libcbcf.bcf_info_set_value
KeyError: 'unknown INFO: AF'
[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Could not read VCF/BCF headers from -
Cleaning
@jodyphelan
Copy link
Owner

Hi @taranewman ,

Thanks for reporting this. I'll take a look now.

@jodyphelan
Copy link
Owner

@taranewman , can you give me the command you used to download the reads and run tb-profiler? I ran it through without any error on my system but maybe there might be some differences in our commands

@taranewman
Copy link
Author

Hi @jodyphelan ,

Thanks so much for taking a look.

These are the commands:

Download the reads:
curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_1.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz

curl -L http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR108/015/SRR10869015/SRR10869015_2.fastq.gz -o SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz

Run fastp
fastp --cut_tail -i SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_1.fastq.gz -I SRR10869015_DNAseq_of_clinical_M._tubeculosis_isolates_2.fastq.gz -o SRR10869015_trimmed_R1.fastq.gz -O SRR10869015_trimmed_R2.fastq.gz

Run tb-profiler:

tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz --read2 SRR10869015_trimmed_R1.fastq.gz --prefix SRR10869015 --csv --call_whole_genome

Please let me know if you need any other info!

@jodyphelan
Copy link
Owner

Thanks, running those commands still seems to work for me to it might be the versions of one of the packages?
If you are using conda can you export the environemtent with conda list --explicit and attach the file here?

@taranewman
Copy link
Author

tbprofiler-conda-env.txt

Thanks for testing that out! Attached is the conda environment.

This has bcftools v1.20. I've tried downgrading this environment to use bcftools v1.12 which was the version we were using prior to the update but didn't have luck there.

@jodyphelan
Copy link
Owner

Oddly enough it still seems to run fine for me

image

Could you run it with --no_clean and send the targets_for_profile.vcf.gz file?

@taranewman
Copy link
Author

targets_for_profile.vcf.gz

Good to know, thank you for your time checking this!

I am running this within a nextflow wrapper so I'll look further on my side if there is something with nextflow that could be causing this.

@taranewman
Copy link
Author

Hi again,

I manually ran

bcftools view -c 1 -a targets_for_profile.vcf.gz | bcftools view -v snps

to produce
BCFTOOLS_INPUT_FOR_SCRIPT.vcf.gz

I then used this vcf file as input into the combine_vcf_variants.py script within a jupyter notebook.

I'm not sure why the error only occurred when running with the nextflow wrapper, but it looks like 'AF' is missing from the variant info here:

image

Do you think adding a line if 'AF' in variant.info: , similar to the 'DP4' line, could be appropriate here?

@taranewman
Copy link
Author

One thing I did notice was that the vcf file produced when running tbprofiler on this SRA sample outside of nextflow was that the VCF file was empty

image

Whereas the file produced when running the same command within nextflow had many more SNPs:

image

In this issue here it seems the AF column may need to be calculated first by the AN and AC columns? samtools/bcftools#1060

@jodyphelan
Copy link
Owner

jodyphelan commented Apr 27, 2024

Hi again, just double checking - are you using bcftools to variant call or the freebayes default?
If you are using bcftools then it might explain the error as the AF attribute isn't generated and as a consequence when the combine_vcf_variants.py script tries to add it, it fails due to it not being defined in the vcf header. A fix which should work is to add the definition if it doesnt exist.

if "AF" not in vcf.header.info.keys():
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
    vcf.header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">')

Seems to work on my end.

Bcftools also seems to be giving an empty vcf for me so I'll investigate that.

@taranewman
Copy link
Author

Thanks Jody! Yes, I am using bcftools.

@jodyphelan
Copy link
Owner

Thanks, I think we're getting closer.

When bcftools doesn't call any variants then the combine_vcf_variants.py script doesn't have any variants to analyse and doesn't cause any error. Which might explain why you sometimes get the error and sometimes not.

Why bcftools isn't working I'm not entirely sure yet. I noticed in your commands you seem to be using the forward read twice, is this a typo?

tb-profiler profile --threads 8 --platform illumina --mapper bwa --caller bcftools --depth 10 --af 0.1 --read1 SRR10869015_trimmed_R1.fastq.gz  --read2 SRR10869015_trimmed_R1.fastq.gz  --prefix SRR10869015  --csv  --call_whole_genome

This might explain why bcftools doesn't call any variants.

jodyphelan added a commit to jodyphelan/pathogen-profiler that referenced this issue Apr 27, 2024
@taranewman
Copy link
Author

Oops so sorry! Yes that was a typo, thanks for noticing that!

When I fix the command to use the reverse read, I'm getting the same variants in the vcf file and AF error I got when running within the nextflow wrapper. Mystery solved :)

@jodyphelan
Copy link
Owner

Great, I'll release the patch this week

@taranewman
Copy link
Author

Great, I'll release the patch this week

Thank you Jody! Will this be a pathogen-profiler v4.1.1 patch release?

@jodyphelan
Copy link
Owner

jodyphelan commented May 10, 2024

I've made a release as v4.2.0 as there were are few bigger things I changed in pathogen-profiler. This will be paired with tb-profiler v6.2.1. Should be available on conda tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants