# Practical 3: Unmasking ancestry labels

Objectives:
- Plot principal components obtained by running PCA on our dataset.
- Use principal components to identify population clusters.
- Reidentify populations for samples with missing populations using population clusters

Now we initialize Hail and set up plotting to display inline in the notebook.

In [None]:
import hail as hl
import bokeh
from hail.plot import show, output_notebook
hl.init()
output_notebook()

## Read in QC'ed data and PCA scores

First, we'll need to read back in the sample annotations and the Principal Components Analysis (PCA) scores from the previous practical.

In [None]:
pca_scores = hl.read_table('resources/pca_scores.ht')

sa = hl.import_table('resources/1kg_annotations.txt', 
                     impute=True, 
                     key='s')

Now, we'll take the first 4 PCs from the PCA table, and add the population information for each sample from our dataset.

In [None]:
ht = pca_scores.select(PC1=pca_scores.scores[0],
                       PC2=pca_scores.scores[1],
                       PC3=pca_scores.scores[2],
                       PC4=pca_scores.scores[3])
ht = ht.annotate(pheno = sa[ht.s])

The five populations present in this dataset are `AFR`, `AMR`, `EAS`, `EUR`, and `SAS`. They are three-letter codes from the 1000 Genomes project denoting the [super population of each sample](https://www.internationalgenome.org/category/population/).

## Visualize!

Let's plot several combinations of the first four principal components (PCs) against each other. This will help us visualize the population structure of our dataset, and allow us to try identify our masked samples with different population clusters. Note that since the plots generated by the `hl.plot` module use the `bokeh` plotting library internally, we can use `bokeh` functions like `gridplot` to arrange our plots.

In [None]:
p1 = hl.plot.scatter(ht.PC1, ht.PC2, xlabel='PC1', ylabel='PC2', label=ht.pheno.super_population, size=6)
p2 = hl.plot.scatter(ht.PC1, ht.PC3, xlabel='PC1', ylabel='PC3', label=ht.pheno.super_population, size=6)
p3 = hl.plot.scatter(ht.PC2, ht.PC4, xlabel='PC2', ylabel='PC4', label=ht.pheno.super_population, size=6)


show(bokeh.layouts.gridplot([[p1], [p2], [p3]]))

## Reidentify samples with missing ancestry based on PCA scores

Now that we can see how the populations are decomposed by the PCs, let's try to reidentify the masked samples.

First, we'll define a grading scheme to check against the true populations of each masked sample. (The `check` function will see how many masked samples you have correctly identified.)

In [None]:
_true_labels = hl.import_table('resources/true_pops.txt', key='s').cache()
def check(ht):
    ht = ht.annotate(true_pop = _true_labels[ht.s].real_super_population)
    c = ht.aggregate(hl.agg.filter(hl.is_missing(ht.pheno.super_population), 
                                   hl.agg.counter((ht.unmasked, ht.true_pop))))
    n_correct = sum(count for k, count in c.items() if k[0] == k[1])
    n_wrong = sum(count for k, count in c.items() if k[0] != k[1])
    print(f'Correctly identified {n_correct} / {n_correct + n_wrong} masked samples.')
    print()
    
    for (unm, true), n in c.items():
        if unm != true:
            if unm is not None:
                print(f'Incorrectly assigned {n} {true} samples as {unm}.')
            else:
                print(f'Left {n} {true} samples unassigned.')

## Fill in the below

Your job is to expand the below code to reidentify the population labels. One of the populations has already been provided as an example.

### `case().when()` in Hail

The `case` / `when` / `default` motif you see below is a nice way to write `if` / `else if` / `else`. The returned `unmasked` will be equal to the result of the first `when` whose predicate is `True`.

### A note on `&` and `|`

Python uses `and` and `or` for logical operators. Hail expressions use `&` for 'and' and `|` for or.

This can lead to some confusion, especially since `&` and `|` often don't play nicely with expressions involving `>`, `<`, `==`, or `!=`. If both of these operators appear, you will need to wrap the comparison in parentheses.

Suppose we want to write code that returns true when "PC1 is greater than 0.1 or PC2 is less than 0.2":

**correct**:

```
(ht.PC1 > 0.1) | (ht.PC2 < 0.2)
```

**incorrect**:
```
ht.PC1 > 0.1 or ht.PC2 < 0.2
ht.PC1 > 0.1 | ht.PC2 < 0.2
(ht.PC1 > 0.1) or (ht.PC2 < 0.2)
```

### To think about

Which population is hardest to reidentify? Why?

In [None]:
check(ht.annotate(
    unmasked = hl.case()
        .when((ht.PC2 > 0.2) & (ht.PC1 < 0), 'EAS')
#         .when(..., 'AFR')
#         .when(..., 'AMR')
#         .when(..., 'EUR')
#         .when(..., 'SAS')
        .default(ht.pheno.super_population)
))