# Hail Tutorial GWAS

### 1. Setting up work environment
The first step of performing a GWAS is to load in our depedencies and set up our work environment.

In [1]:
"""
Import statements allow us to reuse code written previously by ourselves or others. 
Here we are importing the "Hail" library which is the core strategy we are going to be using to organize our data and to eventually perform statistical analyses.
"""
import hail as hl
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
hl.init(quiet=True)

2023-01-18 18:44:16 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### 2. Loading in the data
After we finish loading our dependencies, we can go ahead and start loading the data, starting with our genotype data (stored in a folder called "1kg.mt") and our phenotype data (stored in a file called "1kg_annotations.txt").

In [3]:
# Loading in the genotype data from our "data" folder and storing it in a variable called "mt", short for "MatrixTable" (one of the key innovations of the Hail library)
mt = hl.read_matrix_table('data/1kg.mt')

# Loading in the phenotype data from our "data" folder and storing it in a variable called "table"
table = hl.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')

Now that our data is loaded in, we can combine the two to form a consolidated dataset containing all the relevant information we are going to use for our analyses.

In [7]:
# We can use the "annotate_cols" function to add our phenotype data in the "table" variable 
mt = mt.annotate_cols(pheno = table[mt.s])

It is always a good idea to take a look at our data to see what format we are working with and the available information we have. One way to do this is by using the "describe" method. An example is shown below:

In [8]:
# Describing the format of the "mt variable" using an interactive widget
mt.describe(widget = True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

After running the cell above, we can now interact with the four main components of our dataset (globals, rows, cols, and entries). In Hail, each row consist of one specific genetic variant and each column consist of one specific individual. An entry is an intersection of a row and a column and contains information about a specific variant for a particular individual (such as the genetic call).

Additionally from the interactive cell above, we can see that in the "col" tab, we have two fields that we can access: "s" and "pheno". If we expand the "pheno" field, we can see what information we have available for each individual. In this dataset, we have access to the following variables for each individual: "Population", "SuperPopulation", "isFemale", "PurpleHair", and "CaffeineConsumption". 

Feel free to explore the "row" and "entry" tabs to learn more about those parts of our dataset.

### 3. Quality Control

After we load and explore our dataset, the next step is to perform some quality control (QC) so that we have a clean dataset prior to statistical analysis. In a GWAS, there are quite a few quality control measures we have to do. In order to organize ourselves, let's split these quality control measures into two categories: QC on the genotype data and QC on the phenotype data.

Let's begin with QC on the phenotype data:
1. Remove individuals with high levels of missingness (people who we do not have enough data for) (sample_qc.call_rate > 0.98)
2. Remove/modify individuals who have sex discrepencies

Let's continue on to QC on the genotype data:
1. Missingness of SNPs (remove variants for which we do not have adequate data for)
3. Minor allele frequency (MAF) (remove variants that are too rare)
4. Hardy–Weinberg equilibrium (HWE) (remove variants that deviate from Hardy-Weinberg equilibrium)


7. Population stratification

5. Heterozygosity
6. Relatedness
8. Ancestry

In [9]:
import ipywidgets as widgets
from ipywidgets import HBox, VBox
from IPython.display import display
%matplotlib inline

mt.describe(widget=True)

VBox(children=(HBox(children=(Button(description='globals', layout=Layout(height='30px', width='65px'), style=…

Tab(children=(VBox(children=(HTML(value='<p><big>Global fields, with one value in the dataset.</big></p>\n<p>C…

In [10]:
mt = hl.sample_qc(mt)
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.98))
#ab = mt.AD[1] / hl.sum(mt.AD)
#filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
                        #(mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        #(mt.GT.is_hom_var() & (ab >= 0.9)))
#mt = mt.filter_entries(filter_condition_ab)
mt = hl.variant_qc(mt)
original = mt
# Minor allele frequency cutoff
# mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.05)
# Hardy-Weinberg equilibrium (HWE) cutoff
# mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

@widgets.interact(MAF=widgets.FloatSlider(min=0.01, max=0.05, step=0.01, value=0.05, layout = widgets.Layout(width='500px')), HWE=widgets.FloatLogSlider(value=6, base=10, min=-10, max=-6, step=1, readout_format='.2e', layout = widgets.Layout(width='500px')))
def variant_qc_interactive(MAF = 0.05, HWE=1e-6):
    global mt
    global original
    mt = original
    mt = mt.filter_rows(mt.variant_qc.AF[1] > MAF)
    mt = mt.filter_rows(mt.variant_qc.p_value_hwe > HWE)
    print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

interactive(children=(FloatSlider(value=0.05, description='MAF', layout=Layout(width='500px'), max=0.05, min=0…

Let's quickly perform a GWAS and visualize the results!

In [5]:
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

Samples: 224  Variants: 8003


In [6]:
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale])
p = hl.plot.qq(gwas.p_value)
show(p)
p = hl.plot.manhattan(gwas.p_value)
show(p)

Let's add in some PCA and control for variation in our regression.

In [7]:
eigenvalues, pcs, _ = hl.hwe_normalized_pca(mt.GT)
mt = mt.annotate_cols(scores = pcs[mt.s].scores)
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale, mt.scores[0], mt.scores[1], mt.scores[2]])
p = hl.plot.qq(gwas.p_value)
show(p)
p = hl.plot.manhattan(gwas.p_value)
show(p)

In [8]:
gwas.show(width=100)

locus,alleles,n,sum_x,y_transpose_x,beta,standard_error,t_stat,p_value
locus<GRCh37>,array<str>,int32,float64,float64,float64,float64,float64,float64
1:904165,"[""G"",""A""]",224,53.0,229.0,0.013,0.2,0.0648,0.948
1:1707740,"[""T"",""G""]",224,74.0,334.0,0.202,0.183,1.11,0.27
1:2284195,"[""T"",""C""]",224,139.0,636.0,-0.107,0.151,-0.707,0.48
1:2779043,"[""T"",""C""]",224,332.0,1470.0,0.282,0.158,1.79,0.0754
1:2944527,"[""G"",""A""]",224,100.0,447.0,-0.278,0.181,-1.53,0.127
1:3803755,"[""T"",""C""]",224,323.0,1420.0,-0.0211,0.143,-0.148,0.882
1:4121584,"[""A"",""G""]",224,140.0,651.0,0.0229,0.136,0.168,0.867
1:4170048,"[""C"",""T""]",224,111.0,502.0,0.401,0.161,2.5,0.0132
1:4180842,"[""C"",""T""]",224,130.0,606.0,0.132,0.161,0.822,0.412
1:6053630,"[""T"",""G""]",224,169.0,695.0,-0.165,0.14,-1.18,0.241


In [9]:
gwas.order_by("p_value").to_pandas().head(10)

Unnamed: 0,locus,alleles,n,sum_x,y_transpose_x,beta,standard_error,t_stat,p_value
0,8:19600329,"[A, G]",224,240.0,1149.0,0.71094,0.129726,5.480321,1.16789e-07
1,8:19619751,"[G, A]",224,102.0,509.0,0.837023,0.157132,5.326892,2.483578e-07
2,8:19826373,"[G, A]",224,239.0,1076.0,0.657047,0.137838,4.766797,3.419653e-06
3,8:19651161,"[T, C]",224,189.0,906.0,0.55844,0.125995,4.432244,1.477245e-05
4,8:19943027,"[G, A]",224,97.0,462.0,0.780023,0.194232,4.015934,8.148142e-05
5,12:4702230,"[G, A]",224,23.103139,103.721973,1.210201,0.331844,3.646897,0.000332022
6,17:71113564,"[G, A]",224,41.0,151.0,-0.840535,0.230491,-3.64671,0.0003322499
7,4:108530885,"[C, T]",224,165.73991,778.699552,0.525955,0.147482,3.566238,0.0004450898
8,7:33081018,"[T, C]",224,310.38565,1405.928251,0.477917,0.136499,3.501242,0.0005615725
9,14:49135134,"[G, A]",224,49.0,158.0,-0.77471,0.224183,-3.455707,0.0006595956
