# Inbreeding and structure

Can population structure explain inbreeding?

In [1]:
import hail as hl
hl.init()

Running on Apache Spark version 2.4.6
SparkUI available at http://hms-beagle-5466c684ff-d8mgh:4045
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.58-3f304aae6ce2
LOGGING: writing to /home/olavur/experiments/2020-11-13_fargen1_exome_analysis/notebooks/hail-20201125-1340-0.2.58-3f304aae6ce2.log


In [2]:
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models.scales import LogScale
output_notebook()

In [3]:
mt = hl.read_matrix_table('/home/olavur/experiments/2020-11-13_fargen1_exome_analysis/data/mt/variants.mt')

Use only chromosome 22, just to make things run faster.

In [4]:
mt = mt.filter_rows(mt.locus.contig == 'chr22')

In [5]:
eigenvalues, scores, loadings = hl.pca(hl.int(hl.is_defined(mt.GT)), k=4, compute_loadings=True)

2020-11-25 13:41:10 Hail: INFO: pca: running PCA with 4 components...
2020-11-25 13:41:12 Hail: INFO: Coerced sorted dataset


In [13]:
mt = mt.annotate_cols(scores = scores[mt.s].scores)
mt = mt.annotate_rows(loadings = loadings[mt.locus, mt.alleles].loadings)

**TODO:** I need to capture population structure. Get more samples and get the region of each sample.

**NOTE:** I did not use `hwe_normalized_pca` here because I wonder if the potential correlation between $F$ and the principal components disappears.

In [7]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    title='PCA', xlabel='PC1', ylabel='PC2')
show(p)

## Heterozygosity

Perform HWE test, annotating the `MatrixTable` with the expected heterozygosity $2 f_A f_a$.

In [15]:
# Filter multi-allelic sites out.
mt_biallelic = mt.filter_rows(hl.len(mt.alleles) == 2)

# Perform HWE test.
mt_biallelic = mt_biallelic.annotate_rows(hwe=hl.agg.hardy_weinberg_test(mt_biallelic.GT))

Calculate actual heterozygosity $f_{Aa}$.

In [16]:
# Calculate heterozygosity rate.
# Number of heterozygotes.
mt_biallelic = mt_biallelic.annotate_rows(n_het=hl.agg.count_where(mt_biallelic.GT.is_het()))
# Number of homozygotes.
mt_biallelic = mt_biallelic.annotate_rows(n_hom=hl.agg.count_where(~mt_biallelic.GT.is_het()))
# Heterozygote frequency.
mt_biallelic = mt_biallelic.annotate_rows(het_freq=mt_biallelic.n_het / (mt_biallelic.n_het + mt_biallelic.n_hom))

Compute the reduction in heterozygosity $F = \frac{2f_A f_a - f_{Aa}}{2f_A f_a}$.

In [17]:
mt_biallelic = mt_biallelic.annotate_rows(
    hwe_inbreeding=(mt_biallelic.hwe.het_freq_hwe - mt_biallelic.het_freq) / mt_biallelic.hwe.het_freq_hwe)

Subset the rows, just to avoid plotting a million variants.

In [18]:
mt_small = mt_biallelic.sample_rows(0.05)

Plot the reduction in heterozygosity against the first four principal components.

Sadly there is nothing interesting going on here. But at the same time, we probably have too few samples to detect population structure, so increasing the sample size might reveal something.

In [19]:
plot_list = []
for comp in range(0, 4):
    p = hl.plot.scatter(mt_small.hwe_inbreeding, mt_small.loadings[comp],
                        xlabel='Inbreeding (F) (log10 scale)', ylabel='PC {comp}'.format(comp=comp+1),
                        title='Reduction in heterozygosity as explained by principal components')
    p.x_scale = LogScale()
    plot_list.append(p)

2020-11-25 13:45:17 Hail: INFO: reading 1 of 5 data partitions
2020-11-25 13:45:21 Hail: INFO: reading 1 of 5 data partitions
2020-11-25 13:45:24 Hail: INFO: reading 1 of 5 data partitions
2020-11-25 13:45:27 Hail: INFO: reading 1 of 5 data partitions


In [62]:
show(gridplot(plot_list, ncols=2, plot_width=500, plot_height=400))