## Big Data for Biologists: Decoding Genomic Function - Class 15


##  Learning Objectives
***Students should be able to***
 <ol>   
 <li><a href=#geneticVariation> Identify  different types of genetic variation that can occur across individuals of a species</a></li>
 <li><a href=#geneticVariation> Describe the goals of the 1000 Genomes Project </a></li>
 <li><a href=#vcf>Understand how to use data in the variant call format (VCF) file format.</a></li>
 <li><a href=#tabix>Use the tabix tool to query a VCF file.  </a></li>


In tutorial 4, we learned how to use the [Burrows-Wheeler aligner](http://bio-bwa.sourceforge.net/) to map FASTQ reads to a reference genome. The resulting alignment can serve as a starting point for identifying genetic variants in the genomic sequence data. We have followed the workflow below to identify variants in a yeast dataset: 
![pipeline](../Images/pipeline.png)

## Working with  Variant Call Format (VCF) files <a name='vcf'>

A whole genome sequencing experiment was performed on some yeast cells. The sequenceing was paired-end with output FASTQ files **y1.fastq** and **y2.fastq**. These were aligned to the yeast reference genome, stored in file **yeast.fasta**, and variants were called in accordance with the pipeline detailed above. The resulting variant file, in VCF format, is **yeast_vars.vcf.gz**

In [None]:
!zcat yeast_vars.vcf.gz | head -n 50

The columns in the vcf file can be interpreted as described [here](https://faculty.washington.edu/browning/beagle/intro-to-vcf.html)

We use the **tabix_index** command to generate an index of the vcf file for rapid querying. 

In [None]:
import pysam
pysam.tabix_index("yeast_vars.vcf.gz", '-f',preset="vcf")

Additionally you may find it helpful to prepare graphs and statistics to assist you in filtering your variants:



In [None]:
!bcftools stats -F yeast.fasta -s - yeast_vars.vcf.gz > yeast_vars.vcf.stats

print the statistics: 

In [None]:
!cat yeast_vars.vcf.stats

A number of summary plots are generated. Of most interest to us is the tally of base substitutions and insertions/deletions (indels) observed in the data. 

Substitutions:
![substitutions tally](../Images/substitutions.0.png)
Indels: 
![indels tally](../Images/indels.0.png)

Not all variants are high quality. We want to apply filters to the vcf file to keep only variants with high quality scores (i.e. QUAL > 10). We can do this by passing filter arguments to **bcftools**. 

In [None]:
!bcftools filter -O z -o yeast_vars.filtered.vcf.gz -s LOWQUAL -i'%QUAL>10' yeast_vars.vcf.gz 

## tabix <a name='tabix'>

The tabix tool can be used to index into a vcf file and select variants that fall within a region of interest. For example: 

In [None]:
#load the filtered vcf file into tabix 
import tabix
tb=tabix.open("yeast_vars.vcf.gz")

In [None]:
# A query returns an iterator over the results.
records = tb.query("II",1,325188)
for record in records: 
    print(record)

A file must first be indexed with pytabix before it can be queried. 