# Inbreeding and structure

Can population structure explain inbreeding?

In [1]:
import hail as hl
hl.init()

Running on Apache Spark version 2.4.6
SparkUI available at http://hms-beagle-5466c684ff-2l8nm:4043
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.58-3f304aae6ce2
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/fargen-1-exome/notebooks/hail-20201203-0908-0.2.58-3f304aae6ce2.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

Load LD pruned *diallelic* variant file.

In [13]:
mt = hl.read_matrix_table('/home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/ld_pruned_diallelic.mt')

Perform PCA decomposition of the genotype matrix, calculating the eigenvalues, the scores, and the loadings. Keep 4 dimensions.

In [14]:
eigenvalues, scores, loadings = hl.pca(hl.int(hl.is_defined(mt.GT)), k=4, compute_loadings=True)

2020-12-03 15:05:35 Hail: INFO: pca: running PCA with 4 components...
2020-12-03 15:05:38 Hail: INFO: Coerced sorted dataset


In [15]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)
mt = mt.annotate_rows(loadings = loadings[mt.locus, mt.alleles].loadings)

**TODO:** LD pruning.

**TODO:** I need to capture population structure. Get more samples and get the region of each sample.

**NOTE:** I did not use `hwe_normalized_pca` here because I wonder if the potential correlation between $F$ and the principal components disappears.

In [25]:
p = hl.plot.scatter(mt.scores[0], mt.scores[1], title='PCA', xlabel='PC1', ylabel='PC2', hover_fields={'Sample': mt.s})
show(p)

In [34]:
mt.loadings[0].summarize()

0,1
Non-missing,54738 (100.00%)
Missing,0
Minimum,0.00
Maximum,0.00
Mean,0.00
Std Dev,0.00


In [None]:
p = hl.plot.histogram(mt.variant_qc.dp_stats.mean, range=(0,100), legend='Mean DP per variant histogram')
show(p)

In [22]:
mt_small = mt.sample_rows(0.001)

In [26]:
p = hl.plot.scatter(mt_small.loadings[0], mt_small.loadings[1], title='PCA', xlabel='PC1', ylabel='PC2', hover_fields={'RSID': mt_small.rsid})
show(p)

## Heterozygosity

Perform HWE test, annotating the `MatrixTable` with the expected heterozygosity $2 f_A f_a$.

In [19]:
# Perform HWE test.
mt = mt.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt.GT))

Calculate actual heterozygosity $f_{Aa}$.

In [20]:
# Calculate heterozygosity rate.
mt = mt.annotate_rows(het_freq=hl.agg.fraction(mt.GT.is_het()))

Compute the reduction in heterozygosity $F = \frac{2f_A f_a - f_{Aa}}{2f_A f_a}$.

In [21]:
mt = mt.annotate_rows(
    hwe_inbreeding=(mt.hwe.het_freq_hwe - mt.het_freq) / mt.hwe.het_freq_hwe)

Subset the rows, just to avoid plotting a million variants.

In [22]:
mt_small = mt.sample_rows(0.001)

Plot the reduction in heterozygosity against the first four principal components.

Sadly there is nothing interesting going on here. But at the same time, we probably have too few samples to detect population structure, so increasing the sample size might reveal something.

In [23]:
plot_list = []
for comp in range(0, 4):
    p = hl.plot.scatter(mt_small.hwe_inbreeding, mt_small.loadings[comp],
                        xlabel='Inbreeding (F) (log10 scale)', ylabel='PC {comp}'.format(comp=comp+1),
                        title='Reduction in heterozygosity as explained by principal components',
                       hover_fields={'rsid': mt_small.rsid})
    p.x_scale = LogScale()
    plot_list.append(p)

In [24]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))