<img src="https://dbmi.hms.harvard.edu/sites/g/files/mcu781/files/hero-images/PM19.png" width= "450px">


<img src="https://hms.harvard.edu/themes/harvardmedical/logo.svg" width= "250px"> 


---

# <img src="https://hail.is/docs/devel/hail-logo-cropped.png" width= "50px"> **Workshop**

This notebook is designed to provide a broad overview of Hail's functionality, with emphasis on the functionality to manipulate and query a genetic dataset.

# **Module 1**

## Introduction to `Hail`

In [1]:
import hail as hl
import hail.expr.aggregators as agg
hl.init()

using hail jar at /Users/ines_admin/anaconda3/envs/hail/lib/python3.7/site-packages/hail/hail-all-spark.jar
Running on Apache Spark version 2.4.1
SparkUI available at http://dyn205162.shef.ac.uk:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.14-8dcb6722c72a
LOGGING: writing to /Users/ines_admin/Documents/i2b2Conference/hail-workshop-2019/notebooks/hail-20190612-1436-0.2.14-8dcb6722c72a.log


In [2]:
from pprint import pprint
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import Span
from bokeh.plotting import figure, show, output_file
import pandas as pd
import os , sys, time
output_notebook()

To learn more about bokeh, look at https://bokeh.pydata.org/en/latest/

In [3]:
local_path=os.getcwd()
sys.path.append(local_path)
import plotting

---

In [4]:
# Load data from the 1K-Genome project
hl.utils.get_1kg('data/')

2019-06-12 14:36:54 Hail: INFO: downloading 1KG VCF ...
  Source: https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz
2019-06-12 14:36:56 Hail: INFO: importing VCF and writing to matrix table...
2019-06-12 14:36:58 Hail: INFO: Coerced sorted dataset
2019-06-12 14:37:01 Hail: INFO: wrote matrix table with 10961 rows and 284 columns in 16 partitions to data/1kg.mt
2019-06-12 14:37:01 Hail: INFO: downloading 1KG annotations ...
  Source: https://storage.googleapis.com/hail-tutorial/1kg_annotations.txt
2019-06-12 14:37:01 Hail: INFO: downloading Ensembl gene annotations ...
  Source: https://storage.googleapis.com/hail-tutorial/ensembl_gene_annotations.txt
2019-06-12 14:37:02 Hail: INFO: Done!


In [5]:
# Read data into a matrix table 
mt = hl.read_matrix_table('data/1kg.mt/')

Hail has its own internal data representation, called a [MatrixTable](https://hail.is/docs/0.2/tutorials/09-matrixtable.html)


In [6]:
type(mt)

hail.matrixtable.MatrixTable

The `MatrixTable.describe()` method prints all fields in the table and their types, as well as the keys.

In [53]:
# the describe function displays the variables in the matrix table and the corresponding data type
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'pheno': struct {
        Population: str, 
        SuperPopulation: str, 
        isFemale: bool, 
        PurpleHair: bool, 
        CaffeineConsumption: int32
    }
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transv

In [62]:
list(mt.row)

['locus', 'alleles', 'rsid', 'qual', 'filters', 'info', 'variant_qc']

In [63]:
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

Samples: 250  Variants: 7837


We have 250 samples and 7,837 variants. To know exactly the number of variants per chromosome and the nature of our SNPs, you can use `summarize_variants()`.

In [64]:
hl.summarize_variants(mt)

Number of variants: 7837
Alleles per variant
-------------------
  2 alleles: 7837 variants
Variants per contig
-------------------
   1: 659 variants
   2: 611 variants
   3: 556 variants
   4: 486 variants
   5: 478 variants
   6: 517 variants
   7: 427 variants
   8: 384 variants
   9: 313 variants
  10: 413 variants
  11: 392 variants
  12: 420 variants
  13: 268 variants
  14: 251 variants
  15: 240 variants
  16: 268 variants
  17: 222 variants
  18: 251 variants
  19: 221 variants
  20: 226 variants
  21: 111 variants
  22: 110 variants
   X: 13 variants
Allele type distribution
------------------------
  SNP: 7837 alternate alleles


In [7]:
mt.qual.show()

+---------------+------------+----------+
| locus         | alleles    |     qual |
+---------------+------------+----------+
| locus<GRCh37> | array<str> |  float64 |
+---------------+------------+----------+
| 1:904165      | ["G","A"]  | 5.23e+04 |
| 1:909917      | ["G","A"]  | 1.58e+03 |
| 1:986963      | ["C","T"]  | 3.98e+02 |
| 1:1563691     | ["T","G"]  | 1.09e+03 |
| 1:1707740     | ["T","G"]  | 9.35e+04 |
| 1:2252970     | ["C","T"]  | 7.36e+02 |
| 1:2284195     | ["T","C"]  | 1.42e+05 |
| 1:2779043     | ["T","C"]  | 2.89e+05 |
| 1:2944527     | ["G","A"]  | 1.24e+05 |
| 1:3761547     | ["C","A"]  | 1.61e+03 |
+---------------+------------+----------+
showing top 10 rows



The [rows](https://hail.is/docs/devel/hail.MatrixTable.html#hail.MatrixTable.rows) method can be used to get a table with all the row fields in our MatrixTable.  
You can use the `show` method to display the variants.

In [8]:
mt.AD.show()

+---------------+------------+-----------+--------------+
| locus         | alleles    | s         | AD           |
+---------------+------------+-----------+--------------+
| locus<GRCh37> | array<str> | str       | array<int32> |
+---------------+------------+-----------+--------------+
| 1:904165      | ["G","A"]  | "HG00096" | [4,0]        |
| 1:904165      | ["G","A"]  | "HG00099" | [8,0]        |
| 1:904165      | ["G","A"]  | "HG00105" | [8,0]        |
| 1:904165      | ["G","A"]  | "HG00118" | [7,0]        |
| 1:904165      | ["G","A"]  | "HG00129" | [5,0]        |
| 1:904165      | ["G","A"]  | "HG00148" | [4,0]        |
| 1:904165      | ["G","A"]  | "HG00177" | [2,0]        |
| 1:904165      | ["G","A"]  | "HG00182" | [5,0]        |
| 1:904165      | ["G","A"]  | "HG00242" | [5,0]        |
| 1:904165      | ["G","A"]  | "HG00254" | [13,0]       |
+---------------+------------+-----------+--------------+
showing top 10 rows



To look at the first few genotype calls, we can use [entries](https://hail.is/docs/devel/hail.MatrixTable.html#hail.MatrixTable.entries) along with `select` and `take`. The `take` method collects the first n rows into a list. Alternatively, we can use the `show` method, which prints the first n rows to the console in a table format. 

Try changing `take` to `show` in the cell below.

In [12]:
mt.entry.show(5)

+---------------+------------+-----------+------+--------------+-------+-------+
| locus         | alleles    | s         | GT   | AD           |    DP |    GQ |
+---------------+------------+-----------+------+--------------+-------+-------+
| locus<GRCh37> | array<str> | str       | call | array<int32> | int32 | int32 |
+---------------+------------+-----------+------+--------------+-------+-------+
| 1:904165      | ["G","A"]  | "HG00096" | 0/0  | [4,0]        |     4 |    12 |
| 1:904165      | ["G","A"]  | "HG00099" | 0/0  | [8,0]        |     8 |    24 |
| 1:904165      | ["G","A"]  | "HG00105" | 0/0  | [8,0]        |     8 |    23 |
| 1:904165      | ["G","A"]  | "HG00118" | 0/0  | [7,0]        |     7 |    21 |
| 1:904165      | ["G","A"]  | "HG00129" | 0/0  | [5,0]        |     5 |    15 |
+---------------+------------+-----------+------+--------------+-------+-------+

+--------------+
| PL           |
+--------------+
| array<int32> |
+--------------+
| [0,12,147]   |
| [0,2

In [13]:
hl.summarize_variants(mt)

Number of variants: 10961
Alleles per variant
-------------------
  2 alleles: 10961 variants
Variants per contig
-------------------
   1: 919 variants
   2: 848 variants
   3: 745 variants
   4: 641 variants
   5: 650 variants
   6: 676 variants
   7: 546 variants
   8: 512 variants
   9: 420 variants
  10: 535 variants
  11: 579 variants
  12: 569 variants
  13: 338 variants
  14: 341 variants
  15: 345 variants
  16: 384 variants
  17: 354 variants
  18: 301 variants
  19: 346 variants
  20: 314 variants
  21: 153 variants
  22: 172 variants
   X: 273 variants
Allele type distribution
------------------------
  SNP: 10961 alternate alleles


In [14]:
mt.aggregate_rows(hl.agg.count_where(mt.alleles==['A','T']))

76

In [15]:
snp_counts = mt.aggregate_rows(
    hl.array(hl.agg.counter(mt.alleles)))
snp_counts

[(['A', 'C'], 454),
 (['A', 'G'], 1944),
 (['A', 'T'], 76),
 (['C', 'A'], 496),
 (['C', 'G'], 150),
 (['C', 'T'], 2436),
 (['G', 'A'], 2387),
 (['G', 'C'], 112),
 (['G', 'T'], 480),
 (['T', 'A'], 79),
 (['T', 'C'], 1879),
 (['T', 'G'], 468)]

In [16]:
type(snp_counts)

list

In [17]:
sorted(snp_counts, key=lambda x: x[1])

[(['A', 'T'], 76),
 (['T', 'A'], 79),
 (['G', 'C'], 112),
 (['C', 'G'], 150),
 (['A', 'C'], 454),
 (['T', 'G'], 468),
 (['G', 'T'], 480),
 (['C', 'A'], 496),
 (['T', 'C'], 1879),
 (['A', 'G'], 1944),
 (['G', 'A'], 2387),
 (['C', 'T'], 2436)]

In [18]:
mt.aggregate_entries(hl.agg.stats(mt.GQ))

Struct(mean=31.61537554844648, stdev=26.43319099327883, min=0.0, max=99.0, n=3071175, sum=97096351.00000012)

In [19]:
mt.aggregate_entries(
    hl.agg.filter(mt.GT.is_hom_ref(),hl.agg.stats(mt.GQ)))

Struct(mean=21.37505398600503, stdev=12.816819465406763, min=0.0, max=99.0, n=1741192, sum=37218073.000000075)

In [20]:
hl.agg.stats?


In [21]:
mt.aggregate_entries(
    hl.agg.filter(~mt.GT.is_hom_ref(),hl.agg.stats(mt.GQ)))

Struct(mean=45.02183712122629, stdev=32.884305900975264, min=0.0, max=99.0, n=1329983, sum=59878277.9999999)

In [22]:
mt.aggregate_entries(
    hl.agg.filter(mt.GT.is_het(),hl.agg.stats(mt.GQ)))

Struct(mean=64.17925188032766, stdev=30.950981843666693, min=0.0, max=99.0, n=747875, sum=47998058.000000045)

In [23]:
p=hl.plot.histogram(mt.GQ, bins=100)

In [24]:
show(p)

In [25]:
p=hl.plot.histogram(mt.filter_entries(mt.GT.is_hom_ref()).GQ, bins=100)

In [26]:
show(p)

In [27]:
p=hl.plot.histogram(
    mt.filter_entries(mt.GT.is_het_ref()).GQ, 
    bins=100)
show(p)

In [28]:
p=hl.plot.histogram(
    mt.filter_entries((mt.DP == 10 ) & mt.GT.is_het_ref()).GQ, 
    bins=100)
show(p)

# **Module 3**

## Annotation, PCA and variant discovery

Annotations are important in any genetic study. Column fields are where you will store information about sample like phenotypes, ancestry, sex, and covariates.  Let's annotate the columns in our MatrixTable. 

In [29]:
table = hl.import_table('data/1kg_annotations.txt',
                       impute=True,
                       key='Sample')

2019-06-12 14:45:44 Hail: INFO: Reading table to impute column types
2019-06-12 14:45:44 Hail: INFO: Finished type imputation
  Loading column 'Sample' as type 'str' (imputed)
  Loading column 'Population' as type 'str' (imputed)
  Loading column 'SuperPopulation' as type 'str' (imputed)
  Loading column 'isFemale' as type 'bool' (imputed)
  Loading column 'PurpleHair' as type 'bool' (imputed)
  Loading column 'CaffeineConsumption' as type 'int32' (imputed)


In [30]:
table.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'Sample': str 
    'Population': str 
    'SuperPopulation': str 
    'isFemale': bool 
    'PurpleHair': bool 
    'CaffeineConsumption': int32 
----------------------------------------
Key: ['Sample']
----------------------------------------


To peek at the first few values, use the `show` method:

In [31]:
# Show the first 10 rows of the table
table.show(10)

Sample,Population,SuperPopulation,isFemale,PurpleHair,CaffeineConsumption
str,str,str,bool,bool,int32
"""HG00096""","""GBR""","""EUR""",False,False,4
"""HG00097""","""GBR""","""EUR""",True,True,4
"""HG00098""","""GBR""","""EUR""",False,False,5
"""HG00099""","""GBR""","""EUR""",True,False,4
"""HG00100""","""GBR""","""EUR""",True,False,5
"""HG00101""","""GBR""","""EUR""",False,True,1
"""HG00102""","""GBR""","""EUR""",True,True,6
"""HG00103""","""GBR""","""EUR""",False,True,5
"""HG00104""","""GBR""","""EUR""",True,False,5
"""HG00105""","""GBR""","""EUR""",False,False,4


Notice that the show command only works this way in tables. In matrix tables it is necessary to specify which of the 3 tables we want to show: rows, columns or entries: 

`table.show()` --> Table

`mt.row.alles.show()` --> Matrix Table

In [32]:
# This is not common not recommended, but one can preview local data using the shell command sh
%%sh
head data/1kg_annotations.txt

SyntaxError: invalid syntax (<ipython-input-32-86b4469edb27>, line 3)

In [33]:
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DP: int32, 
        DS: bool, 
        FS: float64, 
        HaplotypeScore: float64, 
        InbreedingCoeff: float64, 
        MLEAC: array<int32>, 
        MLEAF: array<float64>, 
        MQ: float64, 
        MQ0: int32, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        set: str
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Colu

We use the `annotate_cols` method to join the table with the MatrixTable containing our dataset.

In [45]:
mt = mt.annotate_cols(pheno=table[mt.s])

In [46]:
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'pheno': struct {
        Population: str, 
        SuperPopulation: str, 
        isFemale: bool, 
        PurpleHair: bool, 
        CaffeineConsumption: int32
    }
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transv

In [47]:
mt.col.pheno.show()

+-----------+------------------+-----------------------+----------------+
| s         | pheno.Population | pheno.SuperPopulation | pheno.isFemale |
+-----------+------------------+-----------------------+----------------+
| str       | str              | str                   |           bool |
+-----------+------------------+-----------------------+----------------+
| "HG00096" | "GBR"            | "EUR"                 |          false |
| "HG00099" | "GBR"            | "EUR"                 |           true |
| "HG00105" | "GBR"            | "EUR"                 |          false |
| "HG00118" | "GBR"            | "EUR"                 |           true |
| "HG00129" | "GBR"            | "EUR"                 |          false |
| "HG00148" | "GBR"            | "EUR"                 |          false |
| "HG00254" | "GBR"            | "EUR"                 |           true |
| "HG00271" | "FIN"            | "EUR"                 |          false |
| "HG00332" | "FIN"            | "EUR"

The `aggregate` method can be used to aggregate over rows of the table.
`counter` is an aggregation function that counts the number of occurrences of each unique element. 

In [48]:
pprint(mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation)))

{'AFR': 64, 'AMR': 32, 'EAS': 61, 'EUR': 39, 'SAS': 54}


2019-06-12 14:48:23 Hail: INFO: Coerced sorted dataset


In [38]:
mt.aggregate_cols(hl.agg.count_where(hl.is_missing(mt.pheno)))

2019-06-12 14:46:18 Hail: INFO: Coerced sorted dataset


0

In [39]:
mt = hl.sample_qc(mt)

In [40]:
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'pheno': struct {
        Population: str, 
        SuperPopulation: str, 
        isFemale: bool, 
        PurpleHair: bool, 
        CaffeineConsumption: int32
    }
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transv

`stats` is an aggregation function that produces some useful statistics about numeric collections. 

In [49]:
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean >= 4)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.97)

In [50]:
mt.count_cols()

250

In [51]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))

0.9924605419213576

In [52]:
ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) | 
                       (mt.GT.is_het() & (ab >= 0.25) & (ab<=0.75)) | 
                        (mt.GT.is_hom_var() & (ab >= 0.9)))
                       
mt = mt.filter_entries(filter_condition_ab)

In [53]:
mt = hl.variant_qc(mt)

In [54]:
mt = mt.filter_rows(hl.min(mt.variant_qc.AF)>0.01)

In [55]:
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

In [56]:
mt.count()

(7837, 250)

### PCA

The `pca` method produces eigenvalues as a list and sample PCs as a Table, and can also produce variant loadings when asked. The `hwe_normalized_pca` method does the same, using HWE-normalized genotypes for the PCA.

In [57]:
pca_eigenvalues, pcs_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT)

2019-06-12 14:49:36 Hail: INFO: hwe_normalized_pca: running PCA using 7837 variants.
2019-06-12 14:49:38 Hail: INFO: pca: running PCA with 10 components...


In [58]:
pca_eigenvalues

[18.030685502441372,
 9.993820694739123,
 3.54003460013884,
 2.6590434101706717,
 1.5947740104203803,
 1.5413396254154588,
 1.5033548038676905,
 1.47064631730658,
 1.4674512050076023,
 1.4481437343258543]

### Annotate the columns of matrix table `mt` with the PCA scores

A common problem in genetics studies called [Population stratification](https://en.wikipedia.org/wiki/Population_stratification) can be tackled by including ancestry as a covariate in our regression. We will use genetic ancestry by including computed principal components in our model.


In [59]:
mt = mt.annotate_cols(pca=pcs_scores[mt.s])

In [60]:
mt.pca.scores.dtype

dtype('array<float64>')

### Plot the first two PCs

In [61]:
pca  = plotting.scatter_plot(mt.pca.scores[0],
                  mt.pca.scores[1],
                  label_fields={
                      'Population': mt.pheno.SuperPopulation},
                  title='PCA, first two principal components', 
                  xlabel='PC1', ylabel='PC2')

show(pca)