# 'uBAM to Machine Learning ready table' pipeline with Cromwell on Azure 

Jupyter notebook is a great tool for data scientists who are working on Genomics data analysis. We will demonstrate the `germline alligment and variant calling pipeline` on Cromwell on Azure with Jupyter notebook via GATK, Picard.

**Here is the coverage of this notebook:**

**1.** Download the CoA client

**2.** Deploy your instance of Cromwell on Azure

**3.** Upload sample wdl, input.json and trigger.json file to storage account instance on CoA

**4.** Download the output GVCF file to notebook compute instance

**5.** Annotate genotypes using VariantFiltration

**6.** Select Specific Variants

**7.** Filter the relevant variants- no calls OR specific regions

**8.** Perform concordance analysis

**9.** Merge GVCF files

**10.** Convert the final VCF files to a table 

**11.** Export variant table to Blob Storage for further PowerBI visualization

**Dependencies:**

This notebook requires the following libraries:

- Azure CLI 

- AzCopy: Please install latest release of the `AzCopy`: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10

- Cromwell on Azure: Please download the latest release of `CoA` from: https://github.com/microsoft/CromwellOnAzure/releases

- Picard:Please download the latest release of the tool from https://broadinstitute.github.io/picard/

- Genome Analysis Toolkit (GATK) (*Users need to download `GATK` from Broad Institute's webpage into the same compute environment with this notebook: https://github.com/broadinstitute/gatk/releases*)

- Users need reference genome for using this notebook on their environment: 

[Homo_sapiens_assembly38.fasta](https://datasettoaexample.blob.core.windows.net/publicsample/Homo_sapiens_assembly38.fasta)

[Homo_sapiens_assembly38.dict](https://datasettoaexample.blob.core.windows.net/publicsample/Homo_sapiens_assembly38.dict)

[Homo_sapiens_assembly38.fasta.fai](https://datasettoaexample.blob.core.windows.net/publicsample/Homo_sapiens_assembly38.fasta.fai)

**Important information: This notebook is using Python 3.6 kernel**


# 1. Download the deployment executable of Cromwell on Azure

Download the required executable from [Releases](https://github.com/microsoft/CromwellOnAzure/releases). Choose the runtime of your choice from `win-x64`, `linux-x64`, `osx-x64`. *On Windows machines, we recommend using the `win-x64` runtime (deployment using the `linux-x64` runtime via the Windows Subsystem for Linux is not supported).*<br/>

In [None]:
!wget https://github.com/microsoft/CromwellOnAzure/releases/download/4.5.0/deploy-cromwell-on-azure-linux.tar.gz

# 2. Deploy your instance of Cromwell on Azure

### Prerequisites

1. REMINDER! You will need an [Azure Subscription](https://portal.azure.com/) to deploy Cromwell on Azure.

2. You must have the proper [Azure role assignments](https://docs.microsoft.com/en-us/azure/role-based-access-control/overview) to deploy Cromwell on Azure.  To check your current role assignments, please follow [these instructions](https://docs.microsoft.com/en-us/azure/role-based-access-control/check-access).  You must have one of the following combinations of [role assignments](https://docs.microsoft.com/en-us/azure/role-based-access-control/built-in-roles):
   1. `Owner` of the subscription<br/>
   2. `Contributor` and `User Access Administrator` of the subscription
   3. `Owner` of the resource group.
      . *Note: this level of access will result in a warning during deployment, and will not use the latest VM pricing data.</i>  [Learn more](/docs/troubleshooting-guide.md/#How-are-Batch-VMs-selected-to-run-tasks-in-a-workflow?).  Also, you must specify the resource group name during deployment with this level of access (see below).*
   4.  Note: if you only have `Service Administrator` as a role assignment, please assign yourself as `Owner` of the subscription.
3. Install the [Azure Command Line Interface (az cli)](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest), a command line experience for managing Azure resources.
4. Run `az login` to authenticate with Azure than start deployment with the following command: 


In [None]:
!./deploy-cromwell-on-azure-linux --SubscriptionId <Subscription ID> --RegionName westus2 --MainIdentifierPrefix <Identifier of CoA> test

# 3. Upload `sample wdl, input.json and trigger.json` file to storage account on instance of Cromwell on Azure

In this notebook we will execute the `Germline alignment and variant calling` pipeline on CoA. You can find the full repo of the pipeline from:  https://github.com/microsoft/gatk4-genome-processing-pipeline-azure.

## gatk4-genome-processing-pipeline
Workflows used for germline processing in whole genome sequence data.

### WholeGenomeGermlineSingleSample :
This WDL pipeline implements data pre-processing and initial variant calling (GVCF
generation) according to the GATK Best Practices (June 2016) for germline SNP and
Indel discovery in human whole-genome sequencing data.

#### Requirements/expectations
- Human whole-genome paired-end sequencing data in unmapped BAM (uBAM) format
- One or more read groups, one per uBAM file, all belonging to a single sample (SM)
- Input uBAM files must additionally comply with the following requirements:
- - filenames all have the same suffix (we use ".unmapped.bam")
- - files must pass validation by ValidateSamFile
- - reads are provided in query-sorted order
- - all reads must have an RG tag
- Reference genome must be Hg38 with ALT contigs

#### Outputs 
- Cram, cram index, and cram md5 
- GVCF and its gvcf index 
- BQSR Report
- Several Summary Metrics 

Users can upload sample wdl and json files with the AzCopy commands in below:

* For more information of creating resource URL with SAS token: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview?toc=/azure/storage/blobs/toc.json
* Shared access signature (SAS) should have a "Write" access right. 

### Upload your wdl file

In [None]:
! ./azcopy copy './WholeGenomeGermlineSingleSample.wdl' '<input folder`s URL+SAS token>' --recursive --s2s-preserve-access-tier=false

In [None]:
!az login

### Upload your input.json file

In [None]:
!./azcopy copy './WholeGenomeGermlineSingleSample.inputs.json' '<input folder`s URL+SAS token>' --recursive --s2s-preserve-access-tier=false

### Upload trigger.json

Users needs to upload their trigger file to `https://<YOUR CoA STORAGE ACCOUNT NAME>.blob.core.windows.net/workflows/new/` for initiate the pipeline.

In [None]:
!./azcopy copy './WholeGenomeGermlineSingleSample.trigger.json' '<CoA`s new folder`s URL+SAS token>' --recursive --s2s-preserve-access-tier=false

# 4. Download GVCF file from storage Account of CoA

Monitoring the pipeline runs should be done manually. After succesfully finished runs, users can download the final GVCF file (`NA12878.g.vcf.gz`) from the storage account of their CoA instance.

In [None]:
!./azcopy copy '<result file`s URL+SAS token>' '<result file`s name>' --recursive --s2s-preserve-access-tier=false

**Now users can make their analysis on GVCF file with the samples codes in below:**

# 5. Annotate genotypes using VariantFiltration

**Important note: Please check your GATK is running on your system.**

If we want to filter heterozygous genotypes, we use VariantFiltration's `--genotype-filter-expression isHet == 1` option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the `--genotype-filter-name` option. Here, this parameter's value is set to `isHetFilter`. In our first example, we used `NA12878.g.vcf.gz (chr1)` from pipeline outputs. Users needs to create a index file before processing the GATK functions.

In [None]:
!./gatk IndexFeatureFile -I NA12878.g.vcf.gz

In [None]:
!./gatk VariantFiltration -V NA12878.g.vcf.gz -O outputannot.vcf --genotype-filter-expression "isHet == 1" --genotype-filter-name "isHetFilter"

# 6. Select Specific Variants

This tool makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. Examples of such analyses include comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain requirements, or troubleshooting some unexpected results, to name a few.

There are many different options for selecting subsets of variants from a larger call set:

Extract one or more samples from a callset based on either a complete sample name or a pattern match.
Specify criteria for inclusion that place thresholds on annotation values, **e.g. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25)**.These criteria are written as "JEXL expressions", which are documented in the article about using JEXL expressions.
Provide concordance or discordance tracks in order to include or exclude variants that are also present in other given callsets.
Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian violation, filtering status, allelicity, etc.
There are also several options for recording the original values of certain annotations which are recalculated when one subsets the new callset, trims alleles, etc.

**Input**

A variant call set in VCF format from which a subset can be selected.

**Output**

A new VCF file containing the selected subset of variants.

In [None]:
!./gatk SelectVariants -R Homo_sapiens_assembly38.fasta -V outputannot.vcf --select-type-to-include SNP --select-type-to-include INDEL -O selective.vcf

# 7. Transform filtered genotypes to no call 

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. 

This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

How can we filter the variants with with **'No call'**


In [None]:
!./gatk SelectVariants -V outputannot.vcf --set-filtered-gt-to-nocall -O outputnocall.vcf

# 8. Check the Concordance of VCF file with Ground Truth

Evaluate site-level concordance of an input VCF against a truth VCF.
This tool evaluates two variant callsets against each other and produces a six-column summary metrics table. 

**This function will :**

1. Stratifies SNP and INDEL calls
2. Report true-positive,False-positive and false-negative calls
3. Calculates sensitivity and precision

The tool assumes all records in the --truth VCF are passing truth variants. For the -eval VCF, the tool uses only unfiltered passing calls.

Optionally, the tool can be set to produce VCFs of the following variant records, annotated with each variant's concordance status:

True positives and false negatives (i.e. all variants in the truth VCF): useful for calculating sensitivity

True positives and false positives (i.e. all variants in the eval VCF): useful for obtaining a training data set for machine learning classifiers of artifacts

**These output VCFs can be passed to VariantsToTable to produce a TSV file for statistical analysis in R or Python.**

In [None]:
!./gatk Concordance -R Homo_sapiens_assembly38.fasta -eval outputannot.vcf --truth outputnocall.vcf  --summary summary.tsv 

# 9. Merge GVCF files

Inputs One or more input file in VCF format (can be gzipped, i.e. ending in ".vcf.gz", or binary compressed, i.e. ending in ".bcf"). Optionally a sequence dictionary file (typically name ending in .dict) if the input VCF does not contain a complete contig list and if the output index is to be created (true by default). The input variant data must adhere to the following rules:

If there are samples, those must be the same across all input files. Input file headers must be contain compatible declarations for common annotations (INFO, FORMAT fields) and filters. Input files variant records must be sorted by their contig and position following the sequence dictionary provided or the header contig list. You can either directly specify the list of files by specifying INPUT multiple times, or provide a list in a file with name ending in ".list" to INPUT.

Outputs 

A VCF sorted (i) according to the dictionary and (ii) by coordiante.

**Important Note: Users need minimum 2 GVCF file for use this function. Therefore, we recommend to run 2 CoA jobs for 2 different uBAM. You can download second sample GVCF file from: https://storeshare.blob.core.windows.net/quickstartblobs/Second_sample.g.vcf.gz


In [None]:
!./gatk MergeVcfs -I NA12878.g.vcf.gz -I Second_sample.g.vcf.gz -O merge.vcf.gz

# 10. VariantsToTable

Extract fields from a VCF file to a tab-delimited table
This tool extracts specified fields for each variant in a VCF file to a tab-delimited table, which may be easier to work with than a VCF. By default, the tool only extracts PASS or . (unfiltered) variants in the VCF file. Filtered variants may be included in the output by adding the --show-filtered flag. The tool can extract both INFO (i.e. site-level) fields and FORMAT (i.e. sample-level) fields.


**INFO/site-level fields**

Use the `-F` argument to extract INFO fields; each field will occupy a single column in the output file. The field can be any standard VCF column (e.g. CHROM, ID, QUAL) or any annotation name in the INFO field (e.g. AC, AF). The tool also supports the following additional fields:

EVENTLENGTH (length of the event)
TRANSITION (1 for a bi-allelic transition (SNP), 0 for bi-allelic transversion (SNP), -1 for INDELs and multi-allelics)
HET (count of het genotypes)
HOM-REF (count of homozygous reference genotypes)
HOM-VAR (count of homozygous variant genotypes)
NO-CALL (count of no-call genotypes)
TYPE (type of variant, possible values are NO_VARIATION, SNP, MNP, INDEL, SYMBOLIC, and MIXED
VAR (count of non-reference genotypes)
NSAMPLES (number of samples)
NCALLED (number of called samples)
MULTI-ALLELIC (is this variant multi-allelic? true/false)


**FORMAT/sample-level fields**

Use the `-GF` argument to extract FORMAT/sample-level fields. The tool will create a new column per sample with the name "SAMPLE_NAME.FORMAT_FIELD_NAME" e.g. NA12877.GQ, NA12878.GQ.



**Input**

A VCF file to convert to a table

**Output**

A tab-delimited file containing the values of the requested fields in the VCF file.


In [None]:
!./gatk VariantsToTable -V NA12878.g.vcf.gz -F CHROM -F POS -F TYPE -F AC -F AD -F AF -GF DP -GF AD -O outputtable.table

# 11. Export variant table to Blob Storage for further PowerBI visualization

Last step of this notebook application is moving the variant table to blob storage. The reason behind this step is `Microsoft's data visualization solution: PowerBI` can visualize the relevant results. Users can import their data to PowerBI from blob sotrage and share their results OR add interactive queries from the variant table. 

As a nature of Notebooks, we can not call PowerBI dashboard from there but you can download the PowerBI template of the variant table of NA12877 from this [link](https://datasettoaexample.blob.core.windows.net/publicsample/sample_variant_table_dashboard.pbix)

In [None]:
from IPython.core.display import Image, display
display(Image('https://datasettoaexample.blob.core.windows.net/publicsample/dashboard_ss.JPG', width=1000, unconfined=True))

# References

1. Cromwell on Azure: https://github.com/microsoft/CromwellOnAzure/releases
2. VariantFiltration: https://gatk.broadinstitute.org/hc/en-us/articles/360036827111-VariantFiltration 
3. Select Variants:https://gatk.broadinstitute.org/hc/en-us/articles/360037052272-SelectVariants
4. Concordance: https://gatk.broadinstitute.org/hc/en-us/articles/360041851651-Concordance
5. Variants to table: https://gatk.broadinstitute.org/hc/en-us/articles/360036882811-VariantsToTable 
6. Illumina Platinum Genomes:https://www.illumina.com/platinumgenomes.html 
7. Picard: https://broadinstitute.github.io/picard/ 
8. Az Copy: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
9. Power BI: https://powerbi.microsoft.com/en-us/ 

    For questions: ercosgun@microsoft.com


**END OF NOTEBOOK**