# Hardy-Weinberg Equilibrium

In [1]:
import hail as hl
hl.init()

Running on Apache Spark version 2.4.6
SparkUI available at http://hms-beagle-5466c684ff-d8mgh:4043
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.58-3f304aae6ce2
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/notebooks/hail-20201124-1255-0.2.58-3f304aae6ce2.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

In [3]:
mt = hl.read_matrix_table('/home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/variants.mt')

In [21]:
distinct_allele_counts = mt.aggregate_rows(hl.agg.counter(hl.len(mt.alleles)))
distinct_allele_counts

{5: 2520, 6: 1015, 2: 902052, 7: 524, 3: 35254, 4: 7184}

Remove multi-allelic sites. 

**NOTE:** it's possible to perform HWE tests on all alleles in multi-allelic variants by splitting the variants using [split_multi()](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.split_multi).

In [32]:
mt_biallelic = mt.filter_rows(hl.len(mt.alleles) == 2)

Perform the HWE test, and annotate the rows of the `MatrixTable` with the expected heterozygote frequency and the p-value from the test.

In [33]:
mt_biallelic = mt_biallelic.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt_biallelic.GT))

For each biallelic site, calculate the number of heterozygotes, homozygotes, and the proportion of heterozygotes.

In [43]:
mt_biallelic = mt_biallelic.annotate_rows(n_het=hl.agg.count_where(mt_biallelic.GT.is_het()))
mt_biallelic = mt_biallelic.annotate_rows(n_hom=hl.agg.count_where(~mt_biallelic.GT.is_het()))
mt_biallelic = mt_biallelic.annotate_rows(het_freq=mt_biallelic.n_het / (mt_biallelic.n_het + mt_biallelic.n_hom))

Plot the p-values from the HWE test as a Q-Q plot. Looks like something went terribly wrong. Maybe just too few samples.

In [54]:
p = hl.plot.qq(mt_biallelic.hwe.p_value, title='Q-Q plot of HWE p-values')
p.plot_height = 500
p.plot_width = 500
show(p)

2020-11-24 13:32:55 Hail: INFO: Ordering unsorted dataset with network shuffle


In [55]:
p = hl.plot.scatter(mt_biallelic.het_freq, mt_biallelic.hwe.het_freq_hwe)
show(p)