### Hail Tutorial GWAS

The first step of performing a GWAS is to load in our depedencies and data (both genotype and phenotype data).

In [1]:
import hail as hl
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
hl.init(quiet=True)


table = hl.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')

mt = hl.read_matrix_table('data/1kg.mt')
mt = mt.annotate_cols(pheno = table[mt.s])

2022-11-15 19:34:29 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


After that, the goal is to perform some QC on the loaded dataset.

1. Missingness of SNPs and individuals
2. Sex discrepancy
3. Minor allele frequency (MAF) - **COMPLETE**
4. Hardy–Weinberg equilibrium (HWE) - **COMPLETE**
5. Heterozygosity
6. Relatedness
7. Population stratification
8. Ancestry

In [2]:
import ipywidgets as widgets
from ipywidgets import HBox, VBox
from IPython.display import display
%matplotlib inline

#mt = hl.sample_qc(mt)
#mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
#ab = mt.AD[1] / hl.sum(mt.AD)
#filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
#                        (mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
#                        (mt.GT.is_hom_var() & (ab >= 0.9)))
#mt = mt.filter_entries(filter_condition_ab)
mt = hl.variant_qc(mt)
original = mt
# Minor allele frequency cutoff
# mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.05)
# Hardy-Weinberg equilibrium (HWE) cutoff
# mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6)

@widgets.interact(MAF=widgets.FloatSlider(min=0.01, max=0.05, step=0.01, value=0.05, layout = widgets.Layout(width='500px')), HWE=widgets.FloatLogSlider(value=6, base=10, min=-10, max=-6, step=1, readout_format='.2e', layout = widgets.Layout(width='500px')))
def variant_qc_interactive(MAF = 0.05, HWE=1e-6):
    global mt
    global original
    mt = original
    mt = mt.filter_rows(mt.variant_qc.AF[1] > MAF)
    mt = mt.filter_rows(mt.variant_qc.p_value_hwe > HWE)
    print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

interactive(children=(FloatSlider(value=0.05, description='MAF', layout=Layout(width='500px'), max=0.05, min=0…

Let's quickly perform a GWAS and visualize the results!

In [3]:
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

Samples: 284  Variants: 7403


In [4]:
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale])
p = hl.plot.qq(gwas.p_value)
show(p)
p = hl.plot.manhattan(gwas.p_value)
show(p)

Let's add in some PCA and control for variation in our regression.

In [5]:
eigenvalues, pcs, _ = hl.hwe_normalized_pca(mt.GT)
mt = mt.annotate_cols(scores = pcs[mt.s].scores)
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale, mt.scores[0], mt.scores[1], mt.scores[2]])
p = hl.plot.qq(gwas.p_value)
show(p)
p = hl.plot.manhattan(gwas.p_value)
show(p)