# Institute for Behavioral Genetics International Statistical Genetics 2023 Workshop 

## Genetic relatedness exploration with Hail

In this practical, we will learn how to:

1) Simulate random mating in Hail

2) Run various relatedness estimation methods in Hail

3) Use Hail to investigate how these methods work on a structured and admixed dataset

# 1. Import and initialize Hail

We import Hail, initialize it, and import some plotting tools as well.

In [None]:
import hail as hl
hl.init()
from hail.plot import output_notebook, show
from bokeh.models import Slope
output_notebook()

# 2. Read HGDP data, add population information, and run PCA

This bit is a sped-up review from the first notebook. We want to have the principal component plot as reference for relatedness exploration below!

In [None]:
mt = hl.read_matrix_table('resources/hgdp.mt')
sd = hl.import_table('resources/HGDP_sample_data.tsv',
                     key='sample_id',
                     impute=True)
mt = mt.annotate_cols(sample_idx = hl.int(hl.scan.count()), sample_data = sd[mt.s])

In [None]:
_, scores, _ = hl.hwe_normalized_pca(mt.GT)

In [None]:
show(hl.plot.scatter(scores.scores[0], scores.scores[1], 
                     label=mt.index_cols(scores.key).sample_data.continental_pop,
                     size=8,
                     xlabel='PC1',
                     ylabel='PC2'))

# 3. Simulate 3 generations of random mating.

Hail contains a simple random mating simulator that can be a great tool for exploring relatedness and admixture in populations.

The starting dataset is highly **structured** but less **admixed**. We can see distinct ancestral clusters with spaces  between. If we were to make the same plot of a random slice of the human population, the clusters would be connected by individuals falling on the clines between.

Here we simulate three rounds of random mating, where each generation creates a number of pairs equal to half its generation size (with replacement, but no self-pairs), and each pair has two children. Each pair is within a single generation; pairs do not span generations. There are no restrictions on pair selection other than that a sample cannot pair with itself.

In [None]:
mt1 = hl.simulate_random_mating(mt,
                                pairs_per_generation_multiplier=0.5,
                                children_per_pair=2,
                                n_rounds=3,
                                seed=5).key_cols_by('sample_idx')
mt1 = mt1.annotate_cols(**mt.key_cols_by('sample_idx').index_cols(mt1.sample_idx).sample_data).persist()
mt1.count()

In [None]:
_, scores1, _ = hl.hwe_normalized_pca(mt1.GT)

#### Take a few moments to explore and discuss the PC plot of the new dataset, and where the new samples fall. What structure is present?

In [None]:
show(hl.plot.scatter(scores1.scores[0], scores1.scores[1],
                     label=mt1.index_cols(scores1.key).continental_pop,
                     size=6,
                     xlabel='PC1',
                     ylabel='PC2'))

# 4. Relatedness estimation

Estimation of relatedness between individuals is a core tool in statistical genetics. It is commonly used to verify reported relatedness (both false negatives and false positives), and is an important input to statistical analysis.

Some statistical analyses (for instance, GWAS using a simple linear regression) are built around the assumption that variance in genotypes and phenotypes are independent and identically distributed, and an estimated relatedness graph can be used to filter samples to an unrelated set before running these methods.

Other methods, like SAIGE, explicitly account for sample covariance to produce statistically sound effect estimation for datasets with related samples.

## Coefficient of kinship versus coefficient of relationship.

The kinship coefficient estimated in methods below is defined as the probability that two homologous alleles drawn from each of two individuals are identical by descent. For diploid humans, the kinship coefficient for monozygotic twins is 0.5. The similar "coefficient of relationship", defined as the fraction of genetic material shared identically-by-descent, is equal to twice the kinship coefficient (1.0 for monozygotic twins). 

### Exercise

What is the kinship coefficient for parent-child pairs?

For full sibling pairs?

2nd degree relatives (cousins)?

# 5. The KING estimator

Description of the model goes here.

In [None]:
king = hl.king(mt1.GT).entries()
king = king.filter(king.sample_idx != king.sample_idx_1) # remove self-comparisons

The below uses the relatedness graph produced by `simulate_random_mating` to look up the relatedness for a given pair, and adds a little bit of noise to make the plots look more realistic.

In [None]:
rel = mt1.index_globals().relatedness
def get_rel(s1, s2):
    return (hl.case()
            .when(s1 == s2, 0.5)
            .when(s1 > s2, hl.coalesce(rel.get(s1).get(s2), 0.0))
            .default(hl.coalesce(rel.get(s2).get(s1), 0.0))) + hl.rand_norm(0, 0.005) # jitter

Now we'll make a plot!

In [None]:
fig = hl.plot.scatter(get_rel(king.sample_idx, king.sample_idx_1),
                      king.phi, 
                      xlabel='True kinship', 
                      ylabel='King phi (estimated kinship)',
                      hover_fields={'id1': king.sample_idx, 'id2': king.sample_idx_1})

fig.add_layout(Slope(gradient=1, y_intercept=0, line_color='red', line_dash='dashed'))
show(fig)

### Exercise: discuss -- what's wrong with this plot?

# 6. The PC-Relate Estimator

Model info here.

In [None]:
pcrel = hl.pc_relate(mt1.GT, 0.01, k=2)
pcrel = pcrel.key_by(i=pcrel.i.sample_idx, j=pcrel.j.sample_idx)

In [None]:
pcrel.describe()

In [None]:
fig = hl.plot.scatter(get_rel(pcrel.i, pcrel.j),
                      pcrel.kin, xlabel='True kinship', ylabel='PC-Relate Kinship',
                      hover_fields={'id1': pcrel.i, 'id2': pcrel.j})
fig.add_layout(Slope(gradient=1, y_intercept=0, line_color='red', line_dash='dashed'))
show(fig)

### Exercise: detective work. Investigate the relationship between the individuals with kinship coefficient ~0.375.

Some useful information is encoded in the column fields of `mt`. You can show specific samples by editing the below code.
    
If you've got time after finishing this, do the same for a pair in the cluster with relatedness ~0.185!

In [None]:
mt1.filter_cols(hl.literal([50, 100, 1000]).contains(mt1.sample_idx)).cols().show()

### Exercise: model interrogation. 

PC-Relate uses an explicit `k` term to control how many principal components are used to in individual allele frequency predictions. Your job is to rerun the above, starting from the **6** header, with various k values to interrogate how the number of principal components included affects the results.

What **k** seems best? What happens when **k** is small? Large?