# File format

In this section, we'll going to introduce some genomic file formats and programs which can handle such files.

In [1]:
import pandas as pd
from basic_tools import *

## 1. `.vcf` and `pysam`

[pysam](https://pysam.readthedocs.io/en/latest/index.html)

In [21]:
from pysam import VariantFile

vcf_in = VariantFile(vcf_fn)

[E::idx_find_and_load] Could not retrieve index file for '../data/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz'


The header attribute `VariantHeader` provides access information stored in the vcf header.

In [22]:
# vcf file header
print(vcf_in.header)

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INF

Individual contents such as _contigs, info fields, samples, formats_ can be retrieved as attributes from header:

In [23]:
print(vcf_in.header.contigs) # direct acess -> object
print(list((vcf_in.header.contigs)))
print(list((vcf_in.header.filters)))
print(list((vcf_in.header.info)))
print(list((vcf_in.header.samples)))


<pysam.libcbcf.VariantHeaderContigs object at 0x7fec609d3bb0>
[]
['PASS']
['LDAF', 'AVGPOST', 'RSQ', 'ERATE', 'THETA', 'CIEND', 'CIPOS', 'END', 'HOMLEN', 'HOMSEQ', 'SVLEN', 'SVTYPE', 'AC', 'AN', 'AA', 'AF', 'AMR_AF', 'ASN_AF', 'AFR_AF', 'EUR_AF', 'VT', 'SNPSOURCE']
['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101', 'HG00102', 'HG00103', 'HG00104', 'HG00106', 'HG00108', 'HG00109', 'HG00110', 'HG00111', 'HG00112', 'HG00113', 'HG00114', 'HG00116', 'HG00117', 'HG00118', 'HG00119', 'HG00120', 'HG00121', 'HG00122', 'HG00123', 'HG00124', 'HG00125', 'HG00126', 'HG00127', 'HG00128', 'HG00129', 'HG00130', 'HG00131', 'HG00133', 'HG00134', 'HG00135', 'HG00136', 'HG00137', 'HG00138', 'HG00139', 'HG00140', 'HG00141', 'HG00142', 'HG00143', 'HG00146', 'HG00148', 'HG00149', 'HG00150', 'HG00151', 'HG00152', 'HG00154', 'HG00155', 'HG00156', 'HG00158', 'HG00159', 'HG00160', 'HG00171', 'HG00173', 'HG00174', 'HG00176', 'HG00177', 'HG00178', 'HG00179', 'HG00180', 'HG00182', 'HG00183', 'HG00185', 'HG0018

Alternatively, it is possible to iterate through all records in the header returning objects of type `VariantHeaderRecord::`

In [24]:
for x in vcf_in.header.records:
    print(x)
    print(x.type)
    print(x.key)

##fileformat=VCFv4.1

GENERIC
fileformat
##FILTER=<ID=PASS,Description="All filters passed">

FILTER
FILTER
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">

INFO
INFO
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">

INFO
INFO
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

INFO
INFO
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">

INFO
INFO
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">

INFO
INFO
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">

INFO
INFO
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO
INFO
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

INFO
INFO
##I

`_pysam.VariantFile.fetch()` iterates over VariantRecord objects which provides access to simple variant attributes such as _contig, pos, ref_:

In [29]:
for rec in vcf_in.fetch():
    print(dir(rec))
    break

['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'alleles', 'alts', 'chrom', 'contig', 'copy', 'filter', 'format', 'header', 'id', 'info', 'pos', 'qual', 'ref', 'rid', 'rlen', 'samples', 'start', 'stop', 'translate']


In [34]:
variants_id = []
for rec in vcf_in.fetch():
    variants_id.append(rec.id)

len(variants_id)

494328

In [37]:
len(list(vcf_in.header.samples))

1092