Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading/writing vcf breaks VCF header #210

Closed
jrderuiter opened this issue Oct 8, 2015 · 3 comments
Closed

Reading/writing vcf breaks VCF header #210

jrderuiter opened this issue Oct 8, 2015 · 3 comments

Comments

@jrderuiter
Copy link

I am trying to use PyVCF to write a simple script that filters VCF records based on certain criteria. However, this script seems to break the header of my VCF file, as bcftools no longer parses the resulting VCF, but has no problem with the VCF that is used as input.

def main(args):
    in_path = args.input
    out_path = args.output or path.splitext(in_path)[0] + '.som.vcf'

    with open(in_path, 'r') as vcf_in:
        reader = vcf.Reader(vcf_in)
        records = iter(reader)

        # Filter records here.

        # Write records to output VCF.
        with open(out_path, 'w') as vcf_out:
            writer = vcf.Writer(vcf_out, template=reader)
            for record in records:
                writer.write_record(record)

Input vcf header: https://gist.github.com/jrderuiter/ca373a8d4b4652379244
Output vcf header: https://gist.github.com/jrderuiter/e1a4b2d5e437f7652e8c

Bcftools errors on output vcf:

Could not parse the header line: "##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-0-g7e26428,Date="Wed Oct 07 09:11:47 CEST 2015",Epoch=1444201907400,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=./out/joint_calls.snps.vcf) mask=(RodBinding name= source=UNBOUND) out=./out/joint_calls.snps.filt.vcf.tmp/joint_calls.snps.filt.vcf filterExpression=[QD >
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
##GVCFBlock20-21=minGQ=20(inclusive),maxGQ=21(exclusive)
##GVCFBlock21-22=minGQ=21(inclusive),maxGQ=22(exclusive)
##GVCFBlock22-23=minGQ=22(inclusive),maxGQ=23(exclusive)
##GVCFBlock23-24=minGQ=23(inclusive),maxGQ=24(exclusive)
##GVCFBlock24-25=minGQ=24(inclusive),maxGQ=25(exclusive)
##GVCFBlock25-26=minGQ=25(inclusive),maxGQ=26(exclusive)
##GVCFBlock26-27=minGQ=26(inclusive),maxGQ=27(exclusive)
##GVCFBlock27-28=minGQ=27(inclusive),maxGQ=28(exclusive)
##GVCFBlock28-29=minGQ=28(inclusive),maxGQ=29(exclusive)
##GVCFBlock29-30=minGQ=29(inclusive),maxGQ=30(exclusive)
##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
##GVCFBlock30-31=minGQ=30(inclusive),maxGQ=31(exclusive)
##GVCFBlock31-32=minGQ=31(inclusive),maxGQ=32(exclusive)
##GVCFBlock32-33=minGQ=32(inclusive),maxGQ=33(exclusive)
##GVCFBlock33-34=minGQ=33(inclusive),maxGQ=34(exclusive)
##GVCFBlock34-35=minGQ=34(inclusive),maxGQ=35(exclusive)
##GVCFBlock35-36=minGQ=35(inclusive),maxGQ=36(exclusive)
##GVCFBlock36-37=minGQ=36(inclusive),maxGQ=37(exclusive)
##GVCFBlock37-38=minGQ=37(inclusive),maxGQ=38(exclusive)
##GVCFBlock38-39=minGQ=38(inclusive),maxGQ=39(exclusive)
##GVCFBlock39-40=minGQ=39(inclusive),maxGQ=40(exclusive)
##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
##GVCFBlock40-41=minGQ=40(inclusive),maxGQ=41(exclusive)
##GVCFBlock41-42=minGQ=41(inclusive),maxGQ=42(exclusive)
##GVCFBlock42-43=minGQ=42(inclusive),maxGQ=43(exclusive)
##GVCFBlock43-44=minGQ=43(inclusive),maxGQ=44(exclusive)
##GVCFBlock44-45=minGQ=44(inclusive),maxGQ=45(exclusive)
##GVCFBlock45-46=minGQ=45(inclusive),maxGQ=46(exclusive)
##GVCFBlock46-47=minGQ=46(inclusive),maxGQ=47(exclusive)
##GVCFBlock47-48=minGQ=47(inclusive),maxGQ=48(exclusive)
##GVCFBlock48-49=minGQ=48(inclusive),maxGQ=49(exclusive)
##GVCFBlock49-50=minGQ=49(inclusive),maxGQ=50(exclusive)
##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
##GVCFBlock50-51=minGQ=50(inclusive),maxGQ=51(exclusive)
##GVCFBlock51-52=minGQ=51(inclusive),maxGQ=52(exclusive)
##GVCFBlock52-53=minGQ=52(inclusive),maxGQ=53(exclusive)
##GVCFBlock53-54=minGQ=53(inclusive),maxGQ=54(exclusive)
##GVCFBlock54-55=minGQ=54(inclusive),maxGQ=55(exclusive)
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-2147483647=minGQ=99(inclusive),maxGQ=2147483647(exclusive)
##reference=file://./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa
##source=SelectVariants
##bcftools_normVersion=1.2+htslib-1.2.1
##bcftools_normCommand=norm -m-both --output ./out/joint_calls.merged.filt.pass.norm_tmp.vcf.tmp/joint_calls.merged.filt.pass.norm_tmp.vcf ./out/joint_calls.merged.filt.pass.vcf
##bcftools_normCommand=norm -f ./references/homo_sapiens/b37/dna/bwa/human_g1k_v37_decoy.fa --output ./out/joint_calls.merged.filt.pass.norm.vcf.tmp/joint_calls.merged.filt.pass.norm.vcf ./out/joint_calls.merged.filt.pass.norm_tmp.vcf
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">"
[W::vcf_parse] contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] INFO 'AC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AF' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AN' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'BaseQRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ClippingRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DB' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'DP' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'InbreedingCoeff' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MLEAC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MLEAF' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MQ' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MQRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'QD' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ReadPosRankSum' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SOR' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ANNOVAR_DATE' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Func.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Gene.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'GeneDetail.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ExonicFunc.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AAChange.ensGene' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SIFT_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SIFT_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HDIV_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HDIV_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HVAR_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'Polyphen2_HVAR_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LRT_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LRT_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationTaster_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationTaster_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationAssessor_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'MutationAssessor_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FATHMM_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'FATHMM_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'RadialSVM_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'RadialSVM_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LR_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'LR_pred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'VEST3_score' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'CADD_raw' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'CADD_phred' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'GERP++_RS' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'phyloP46way_placental' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'phyloP100way_vertebrate' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'SiPhy_29way_logOdds' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'snp138NonFlagged' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO '1000g2015aug_eur' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'nci60' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'cosmic70' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'clinvar_20150629' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'ALLELE_END' is not defined in the header, assuming Type=String
[W::vcf_parse] contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
...
@martijnvermaat
Copy link
Collaborator

Thanks @jrderuiter for reporting this.

A quick diff shows two things:

colordiff -ruN
  <(curl https://gist.githubusercontent.com/jrderuiter/ca373a8d4b4652379244/raw/35c4464de809fade8a46d7eb8f61916bdf51fb31/gistfile1.txt
    | sort)
  <(curl https://gist.githubusercontent.com/jrderuiter/e1a4b2d5e437f7652e8c/raw/50af1092976905ac290a56b46407c757a9bbf2d4/gistfile1.txt
    | sort)
  1. The assembly fields in the contig headers are lost.
  2. The GATKCommandLine.VariantFiltration header is truncated in the CommandLineOptions field at the position where the original contains a < character.

The former is too bad, but not the cause of your issue. The latter is due to PyVCF not properly parsing the header lines (see also #90). I'll try to improve this.

@jrderuiter
Copy link
Author

Thank you for looking into this. The two differences you mention are indeed
the differences that I also noticed myself. The assembly information would
be nice to keep, but is as you say not critical. A fix for 2 would be great
though!
On Thu 8 Oct 2015 at 11:39 Martijn Vermaat notifications@github.com wrote:

Thanks @jrderuiter https://github.com/jrderuiter for reporting this.

A quick diff shows two things:

colordiff -ruN
<(curl https://gist.githubusercontent.com/jrderuiter/ca373a8d4b4652379244/raw/35c4464de809fade8a46d7eb8f61916bdf51fb31/gistfile1.txt
| sort)
<(curl https://gist.githubusercontent.com/jrderuiter/e1a4b2d5e437f7652e8c/raw/50af1092976905ac290a56b46407c757a9bbf2d4/gistfile1.txt
| sort)

  1. The assembly fields in the contig headers are lost.
  2. The GATKCommandLine.VariantFiltration header is truncated in the
    CommandLineOptions field at the position where the original contains a
    < character.

The former is too bad, but not the cause of your issue. The latter is due
to PyVCF not properly parsing the header lines (see also #90
#90). I'll try to improve
this.


Reply to this email directly or view it on GitHub
#210 (comment).

@martijnvermaat
Copy link
Collaborator

You could try again with the master branch:

pip install -e git+https://github.com/jamescasbon/PyVCF.git#egg=PyVCF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants