<img src="https://i2.wp.com/transmartfoundation.org/wp-content/uploads/2014/04/I2B2-TRANSMART-web-banner-1-600x200_c.jpg" width= "450px">


<img src="https://hms.harvard.edu/themes/harvardmedical/logo.svg" width= "250px"> 


---

# <img src="https://hail.is/docs/devel/hail-logo-cropped.png" width= "50px"> **Workshop**

This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic dataset.

# **Module 1**

## Introduction to `Hail`

In [None]:
import hail as hl
import hail.expr.aggregators as agg
hl.init()

In [None]:
from pprint import pprint
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import Span
from bokeh.plotting import figure, show, output_file
import pandas as pd
import os , sys, time
output_notebook()

To learn more about bokeh, look at https://bokeh.pydata.org/en/latest/

In [None]:
local_path=os.getcwd()
sys.path.append(local_path)
import plotting

---

In [None]:
# Load data from the 1K-Genome project
hl.utils.get_1kg('data/')

In [None]:
# Read data into a matrix table 
mt = hl.read_matrix_table('data/1kg.mt/')

Hail has its own internal data representation, called a [MatrixTable](https://hail.is/docs/0.2/tutorials/09-matrixtable.html)


In [None]:
type(mt)

The `MatrixTable.describe()` method prints all fields in the table and their types, as well as the keys.

In [None]:
# the describe function displays the variables in the matrix table and the corresponding data type
mt.describe()

In [None]:
list(mt.row)

In [None]:
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

To know exactly the number of variants per chromosome and the nature of our SNPs, we can use `summarize_variants()`.

In [None]:
hl.summarize_variants(mt)

In [None]:
mt.qual.show()

The [rows](https://hail.is/docs/devel/hail.MatrixTable.html#hail.MatrixTable.rows) method can be used to get a table with all the row fields in our MatrixTable.  
You can use the `show` method to display the variants.

In [None]:
mt.AD.show()

To look at the first few genotype calls, we can use [entries](https://hail.is/docs/devel/hail.MatrixTable.html#hail.MatrixTable.entries) along with `select` and `take`. The `take` method collects the first n rows into a list. Alternatively, we can use the `show` method, which prints the first n rows to the console in a table format. 

Try changing `take` to `show` in the cell below.

In [None]:
mt.entry.show(5)

In [None]:
hl.summarize_variants(mt)

In [None]:
mt.aggregate_rows(hl.agg.count_where(mt.alleles==['A','T']))

In [None]:
snp_counts = mt.aggregate_rows(
    hl.array(hl.agg.counter(mt.alleles)))
snp_counts

In [None]:
type(snp_counts)

In [None]:
sorted(snp_counts, key=lambda x: x[1])

In [None]:
mt.aggregate_entries(hl.agg.stats(mt.GQ))

In [None]:
mt.aggregate_entries(
    hl.agg.filter(mt.GT.is_hom_ref(),hl.agg.stats(mt.GQ)))

In [None]:
hl.agg.stats?


In [None]:
mt.aggregate_entries(
    hl.agg.filter(~mt.GT.is_hom_ref(),hl.agg.stats(mt.GQ)))

In [None]:
mt.aggregate_entries(
    hl.agg.filter(mt.GT.is_het(),hl.agg.stats(mt.GQ)))

In [None]:
p=hl.plot.histogram(mt.GQ, bins=100)

In [None]:
show(p)

In [None]:
p=hl.plot.histogram(mt.filter_entries(mt.GT.is_hom_ref()).GQ, bins=100)

In [None]:
show(p)

In [None]:
p=hl.plot.histogram(
    mt.filter_entries(mt.GT.is_het_ref()).GQ, 
    bins=100)
show(p)

In [None]:
p=hl.plot.histogram(
    mt.filter_entries((mt.DP == 10 ) & mt.GT.is_het_ref()).GQ, 
    bins=100)
show(p)

---

# **Module 2**

## GWAS in 5 steps

---

# **Module 3**

## Annotation, PCA and variant discovery

Annotations are important in any genetic study. Column fields are where you will store information about sample like phenotypes, ancestry, sex, and covariates.  Let's annotate the columns in our MatrixTable. 

In [None]:
table = hl.import_table('data/1kg_annotations.txt',
                       impute=True,
                       key='Sample')

In [None]:
table.describe()

To peek at the first few values, use the `show` method:

In [None]:
# Show the first 10 rows of the table
table.show(10)

Notice that the show command only works this way in tables. In matrix tables it is necessary to specify which of the 3 tables we want to show: rows, columns or entries: 

`table.show()` --> Table

`mt.row.alles.show()` --> Matrix Table

In [None]:
# This is not common not recommended, but one can preview local data using the shell command sh
%%sh
head data/1kg_annotations.txt

In [None]:
mt.describe()

We use the `annotate_cols` method to join the table with the MatrixTable containing our dataset.

In [None]:
mt = mt.annotate_cols(pheno=table[mt.s])

In [None]:
mt.describe()

In [None]:
mt.col.pheno.show()

The `aggregate` method can be used to aggregate over rows of the table.
`counter` is an aggregation function that counts the number of occurrences of each unique element. 

In [None]:
pprint(mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation)))

In [None]:
mt.aggregate_cols(hl.agg.count_where(hl.is_missing(mt.pheno)))

In [None]:
mt = hl.sample_qc(mt)

In [None]:
mt.describe()

`stats` is an aggregation function that produces some useful statistics about numeric collections. 

In [None]:
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean >= 4)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.97)

In [None]:
mt.count_cols()

In [None]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))

In [None]:
ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) | 
                       (mt.GT.is_het() & (ab >= 0.25) & (ab<=0.75)) | 
                        (mt.GT.is_hom_var() & (ab >= 0.9)))
                       
mt = mt.filter_entries(filter_condition_ab)

In [None]:
mt = hl.variant_qc(mt)

In [None]:
mt = mt.filter_rows(hl.min(mt.variant_qc.AF)>0.01)

In [None]:
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

In [None]:
mt.count()

### PCA

The `pca` method produces eigenvalues as a list and sample PCs as a Table, and can also produce variant loadings when asked. The `hwe_normalized_pca` method does the same, using HWE-normalized genotypes for the PCA.

In [None]:
pca_eigenvalues, pcs_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT)

In [None]:
pca_eigenvalues

### Annotate the columns of matrix table `mt` with the PCA scores

A common problem in genetics studies called [Population stratification](https://en.wikipedia.org/wiki/Population_stratification) can be tackled by including ancestry as a covariate in our regression. We will use genetic ancestry by including computed principal components in our model.


In [None]:
mt = mt.annotate_cols(pca=pcs_scores[mt.s])

In [None]:
mt.pca.scores.dtype

### Plot the first two PCs

In [None]:
pca  = plotting.scatter_plot(mt.pca.scores[0],
                  mt.pca.scores[1],
                  label_fields={
                      'Population': mt.pheno.SuperPopulation},
                  title='PCA, first two principal components', 
                  xlabel='PC1', ylabel='PC2')

show(pca)

### Variant discovery

In [None]:
# Extract entries table
entries = mt.entries()

Group by supper population and chromosome, then count heteregeneous variants

In [None]:
results = (entries.group_by(pop = entries.pheno.SuperPopulation, chromosome = entries.locus.contig)
      .aggregate(n_het = hl.agg.count_where(entries.GT.is_het())))

In [None]:
results.show(40)

### Rare variants

In [None]:
# Compute minor allele frequency and generate an annotation column for rare, low frequency and common variants
entries = entries.annotate(maf = hl.cond(entries.info.AF[0]<0.01, "<1%",
                             hl.cond(entries.info.AF[0]<0.05, "1%-5%", ">5%")))

In [None]:
# Group by minor allele frequency and hair color
results2 = (entries.group_by(af_bin = entries.maf, purple_hair = entries.pheno.PurpleHair)
      .aggregate(mean_gq = hl.agg.stats(entries.GQ).mean,
                 mean_dp = hl.agg.stats(entries.DP).mean))

In [None]:
results2.show()

In [None]:
# Filter rare variants only
rare_vars = entries.filter(entries.maf=="<1%")

In [None]:
rare_vars.count()

In [None]:
# why this instruction works 
rare_vars.aggregate((hl.agg.stats(rare_vars.DP)))

In [None]:
# but this one does not work
rare_vars.aggregate((hl.agg.stats(rare_vars.s)))
# answer below

In [None]:
rare_count_per_sample = rare_vars.aggregate((hl.agg.counter(rare_vars.s)))

In [None]:
print(type(rare_count_per_sample))
print(str(len(rare_count_per_sample)) + " samples")

In [None]:
rare_count_per_sample