Skip to content

Error in reading vcf file #91

Closed
janedanes opened this Issue Feb 4, 2013 · 14 comments

5 participants

@janedanes

I'm a newbie to python and am having trouble understanding why I can't parse my vcf file. Any help would greatly be appreciated!! Thanks so much.

for record in vcf.Reader(open('all.mergedSNPs.vcf', 'r')):
print record

Here is the output:

Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 216, in init
self._parse_metainfo()
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 254, in _parse_metainfo
key, val = parser.read_meta(line.strip())
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash
val = dict(item.split("=") for item in hashItems)
ValueError: dictionary update sequence element #4 has length 1; 2 is required

@seandavi
seandavi commented Feb 4, 2013

You might try updating to the newest PyVCF version, 0.6.3, to start. I think there were a few bugs fixed between your version (0.6.0) and the current version.

@janedanes

thanks!

ok I did that and I got this error

import vcf
vcf_reader = vcf.Reader(open('/Users/sgbiofuels/Desktop/all.mergedSNPs.vcf','r'))
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 225, in init
self.parse_metainfo()
File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 267, in parse_metainfo
key, val = parser.read_meta(line)
File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 166, in read_meta
return self.read_meta_hash(meta_string)
File "/Library/Python/2.7/site-packages/PyVCF-0.6.3-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 161, in read_meta_hash
val = OrderedDict(item.split("=") for item in hashItems)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 74, in __init

self.update(args, *kwds)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_abcoll.py", line 499, in update
for key, value in other:
ValueError: need more than 1 value to unpack

@jamescasbon
Owner

Its failling over on the metadata.

Can you share the file headers?

@janedanes
@amkenney
amkenney commented Apr 2, 2014

Hello, I am getting a very similar error, but I can't figure out why. I have pyvcf6.0 installed, which I used to sucessfully parse a vcf file recently.

This is the error I get:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    v = vcf.Reader(filename='CACid.MIM.snps.q40.3.26.14.vcf.gz')
  File "vcf/parser.py", line 216, in __init__
    self._parse_metainfo()
  File "vcf/parser.py", line 254, in _parse_metainfo
    key, val = parser.read_meta(line.strip())
  File "vcf/parser.py", line 166, in read_meta
    return self.read_meta_hash(meta_string)
  File "vcf/parser.py", line 161, in read_meta_hash
    val = dict(item.split("=") for item in hashItems)
ValueError: dictionary update sequence element #4 has length 4; 2 is required

I didn't use to get this error. It seems to only be with some new vcf files I made using the newest version of the GATK.

Thoughts?
Thanks,
Amanda

@amkenney
amkenney commented Apr 2, 2014

This is what my header looks like. None of the format lines have 4 # signs in them.

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.0-0-g6bad1c6,Date="Wed Mar 26 14:12:27 EDT 2014",Epoch=1395857547192,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[AHQT1G_q29.paired.nodup.realign.bam, BOG10G_q29.paired.nodup.realign.bam, DUNG_q29.paired.nodup.realign.bam, IM62.JGI_q29.paired.nodup.realign.bam, REM8G_q29.paired.nodup.realign.bam, SLP9G_q29.paired.nodup.realign.bam, CAC9N_q29.paired.nodup.realign.bam, KOOTN_q29.paired.nodup.realign.bam, MEN104_q29.paired.nodup.realign.bam, Mdent_q29.paired.nodup.realign.bam, LVR.bwamem.q29.paired.nodups.realign.bam, Mlac.WLF47.q29.nodup.paired.realign.bam, Mtil.BAG3.q29.nodup.paired.realign.bam, Mtil.STV115.q29.nodups.paired.realign.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=mg.new.ref.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 allow_bqsr_on_reduced_bams_despite_repeated_warnings=false validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false output_mode=EMIT_ALL_CONFIDENT_SITES heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY standard_min_confidence_threshold_for_calling=40.0 standard_min_confidence_threshold_for_emitting=40.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=scaffold_1,length=13654788>
##contig=<ID=scaffold_2,length=19086486>
##contig=<ID=scaffold_3,length=18883600>
##contig=<ID=scaffold_4,length=21212587>
##contig=<ID=scaffold_5,length=24294111>
##contig=<ID=scaffold_6,length=19424543>
##contig=<ID=scaffold_7,length=18155354>
##contig=<ID=scaffold_8,length=24938873>
##contig=<ID=scaffold_9,length=15137105>
##contig=<ID=scaffold_10,length=19333236>
##contig=<ID=scaffold_11,length=25550741>
##contig=<ID=scaffold_12,length=26648731>
##contig=<ID=scaffold_13,length=21248885>
##contig=<ID=scaffold_14,length=26119285>
##contig=<ID=scaffold_15,length=2017172>
##contig=<ID=scaffold_16,length=1459161>
##contig=<ID=scaffold_17,length=1210954>
##contig=<ID=scaffold_23,length=393100>
##contig=<ID=scaffold_34,length=217123>
##contig=<ID=scaffold_38,length=241004>
##contig=<ID=scaffold_43,length=184681>
##contig=<ID=scaffold_45,length=162397>
##contig=<ID=scaffold_47,length=159104>
##contig=<ID=scaffold_52,length=130840>
##contig=<ID=scaffold_53,length=120786>
##contig=<ID=scaffold_56,length=106712>
##contig=<ID=scaffold_57,length=132043>
##contig=<ID=scaffold_61,length=86401>
##contig=<ID=scaffold_63,length=86991>
##contig=<ID=scaffold_66,length=79613>
##contig=<ID=scaffold_68,length=137247>
...deleted a bunch of lines...
##reference=file:///panfs/pstor.storage/mim.bwa.bams.proper.only/mg.new.ref.fa
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  AHQT1G  BOG10G  CAC9N   DUNG    IM62.JGI        KootN   MEN104  Mdent   Mlac.WLF47      Mtil.BAG3       Mtil.STV115     REM8G   SLP9G   til.LVR
@amkenney
amkenney commented Apr 2, 2014

Nothing is appearing my = signs....I have no idea why

@martijnvermaat
Collaborator

I have no problems parsing that header with the latest PyVCF version (0.6.7). Could you try updating PyVCF? If you installed with pip, a simple pip install -U PyVCF should do it.

(I added fenced code blocks to your comments so they correctly show the example code.)

@amkenney
amkenney commented Apr 3, 2014

I will try the new version and let you know. Thanks!

@amkenney
amkenney commented Apr 3, 2014

Forgive my ignorance, but I only see version 0.6 available for download under releases. I haven't downloaded many program from github (my computer cluster people usually do it). How can I get 0.6.7 or the most recent update?

Thanks!

@martijnvermaat
Collaborator

How did you previously install PyVCF?

If you installed from PyPI (either using easy_install or pip), you can upgrade with one of the following:

easy_install -U PyVCF
pip install -U PyVCF

If you just want to download the source code, you can do so from the PyPI page, or use the "Download ZIP" button at the bottom-right of the GitHub page.

(You are right that releases after 0.6 were not tagged, so they are not listed on the GitHub releases page.)

@amkenney
amkenney commented Apr 3, 2014

I don't think I every truly "installed" PyVCF. I downloaded the latest release file (0.6). Then I copied the folder 'vcf' into the directory I needed to use PyVCF. Then I start python by calling idle from that directory and import vcf at the beginning of my script. I was introduced by PyVCF by someone else who did it that way, but I think he only did that way because of a single use need.

But, I would rather actually install PyVCF correctly so that I can use it from any directory without having to copy the 'vcf' folder every time. This is sort of a testament to my lack of understanding of how computer programs get installed/integrated on a computer.

I use a mac. I don't know what 'easy_install' or 'pip' are.

@martijnvermaat
Collaborator

I don't have a mac, but you could probably just run the following command to install PyVCF (needs administrative rights):

sudo easy_install PyVCF
@amkenney
amkenney commented Apr 3, 2014

Updating to 0.6.7 worked.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.