# Genomics Data Analysis with Azure Jupyter Notebooks- Genome in a Bottle (GIAB)

Jupyter notebook is a great tool for data scientists who is working on Genomics data analysis. We will demonstrate Azure Jupyter notebook usage via GATK and Picard with Azure Open Dataset. 

**Here is the coverage of this notebook:**

1. Create index file for VCF file
2. Convert the  VCF file to a table 

**Dependencies:**

This notebook requires the following libraries:

- Azure storage `pip install azure-storage-blob==2.1.0`. Please visit [this page](https://github.com/Azure/azure-storage-python/wiki) for frequently encountered problem for this SDK.


- Genome Analysis Toolkit (GATK) (*Users need to download GATK from Broad Institute's webpage into the same compute environment with this notebook: https://github.com/broadinstitute/gatk/releases*)

**Important information: This notebook is using Python 3.6 kernel**


# 1. Getting the GIAB Genomics data from Azure Open Dataset

Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open datasets. You can find example of data calling procedure from Azure Open Dataset for `Genome In a Bottle- GIAB` datasets in below:

**1.a.Install Azure Blob Storage SDK**

In [None]:
pip install azure-storage-blob==2.1.0

**1.b.Download the targeted file**

In [None]:
import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess

blob_service_client = BlockBlobService(account_name='datasetgiab', sas_token='sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=7qp%2BxGLGc%2BO2MIVzzDZY7GSqEwthyGnhXJ566KoH7As%3D')     
blob_service_client.get_blob_to_path('dataset/data/NA12878/analysis/GIAB_integration', 'NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz', './NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz')

# 2. Creates an index for a feature file, e.g. VCF or BED file

This tool creates an index file for the various kinds of feature-containing files supported by GATK (such as VCF and BED files). An index allows querying features by a genomic interval.


In [None]:
!./gatk IndexFeatureFile -I NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz 

# 3. Extract fields from a VCF file to a tab-delimited table 

This tool creates an index file for the various kinds of feature-containing files supported by GATK (such as VCF and BED files). An index allows querying features by a genomic interval.


**INFO/site-level fields**

Use the `-F` argument to extract INFO fields; each field will occupy a single column in the output file. The field can be any standard VCF column (e.g. CHROM, ID, QUAL) or any annotation name in the INFO field (e.g. AC, AF). The tool also supports the following additional fields:

EVENTLENGTH (length of the event)
TRANSITION (1 for a bi-allelic transition (SNP), 0 for bi-allelic transversion (SNP), -1 for INDELs and multi-allelics)
HET (count of het genotypes)
HOM-REF (count of homozygous reference genotypes)
HOM-VAR (count of homozygous variant genotypes)
NO-CALL (count of no-call genotypes)
TYPE (type of variant, possible values are NO_VARIATION, SNP, MNP, INDEL, SYMBOLIC, and MIXED
VAR (count of non-reference genotypes)
NSAMPLES (number of samples)
NCALLED (number of called samples)
MULTI-ALLELIC (is this variant multi-allelic? true/false)


**FORMAT/sample-level fields**

Use the `-GF` argument to extract FORMAT/sample-level fields. The tool will create a new column per sample with the name "SAMPLE_NAME.FORMAT_FIELD_NAME" e.g. NA12877.GQ, NA12878.GQ.



**Input**

A VCF file to convert to a table

**Output**

A tab-delimited file containing the values of the requested fields in the VCF file.


In [None]:
!./gatk VariantsToTable -V NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz -F CHROM -F POS -F TYPE -O outputtable.table

# References

1. IndexFeatureFile: https://gatk.broadinstitute.org/hc/en-us/articles/360037069832-IndexFeatureFile
2. Variants to table: https://gatk.broadinstitute.org/hc/en-us/articles/360036882811-VariantsToTable 
3. Genome in a Bottle: https://www.nist.gov/programs-projects/genome-bottle 

