### Notebook 01, Exercise 1
Take a few moments to explore the interactive representation of the matrix table above.

* Where is the variant information (`locus` and `alleles`)? 
* Where is the sample identifier (`s`)?
* Where is the genotype quality `GQ`?

<font color='green'><b>Solution: a MatrixTable has a row component, a column component, and an entry component. We represent sequencing data with variants as the rows, samples as the columns, and genotypes as the entries. Thus, the variant information 'locus' and 'alleles' are in the row component, the sample identifier 's' is in the column component, and the genotype quality GQ is in the entry component.</b></font>

### Notebook 01, Exercise 2

There is a fourth value seen above, other than `0/0`, `0/1`, `1/1`. What is it?

<font color='green'><b>Solution: the fourth value is NA, or missing. Hail supports missing values anywhere, and many QC and statistical applications require handling missingness correctly to generate correct results.</b></font>

### Notebook 01, Exercise 3

In the empty cell below, summarize some of the other fields on the matrix table. You can use the interactive widget above to find the names of some of the other fields.

<font color='green'><b>Solution: mt.GQ.summarize(), mt.alleles.summarize(), mt.locus.summarize()</b></font>

### Notebook 01, Exercise 4

There's a lot of information in the above output. Take a moment to look through, and remember, these are **bad-quality variants**. Why do these variants had such low HWE p-values? *Hint: scroll all the way to the right to the variant_qc output*.

<font color='green'><b>Solution: If we scroll all the way to the right, we can see that these variants contain only heterozygous calls, no homozygote reference or alternate calls. These sites are almost certainly seeing mapping errors.</b></font>


### Notebook 01, Exercise 5

**Is this GWAS well controlled? Discuss with your group.**

Wikipedia has a good description of [genomic control estimation](https://en.wikipedia.org/wiki/Genomic_control) (lambda GC) to read later.

<font color='green'><b>Solution: This GWAS is NOT well controlled! The p-values are extremely inflated.</b></font>

### Exercise 6

Change the "gwas2" cell to experiment with how many principal components are needed as covariates to properly control this GWAS. How many are needed here in our tiny simulated example? How many are needed in a typical GWAS?

<font color='green'><b>Here the students need to manually edit the covariates used for the cell starting with 'gwas2' to see how many of scores[0], scores[1], and scores[2] are necessary to control the GWAS. In this case, 2 PCs as covariates control the GWAS perfectly -- this is because we used 2 PCs to simulate the phenotypes! In real world studies, many more PCs are necessary to control for population stratification.</b></font>

### Notebook 02, Exercise 1

Is this a dense (mostly non-zero) or sparse (mostly zero) matrix? Is this expected? How many variants are in our dataset, and how many genes are there?

<font color='green'><b>There are a variety of ways to interrogate this. A few are listed below.</b></font>

In [None]:
xx = burden_mt
xx.aggregate_entries(hl.agg.fraction(xx.n_variants == 0))

In [None]:
xx = burden_mt
xx = xx.annotate_rows(frac_zero = hl.agg.fraction(xx.n_variants == 0))
xx.aggregate_rows(hl.agg.hist(xx.frac_zero, 0, 1, 20))

In [None]:
xx = burden_mt
xx = xx.annotate_rows(n_zero_variants = hl.agg.count_where(xx.n_variants == 0))
xx.aggregate_rows(hl.agg.counter(xx.n_zero_variants))

### Notebook 02, Exercise 2

1. Explore these annotations using `show` and aggregations.
2. Use a numeric annotation as a weight or compute a new numeric annotation from a non-numeric annotation (you might need [`hl.case`](https://hail.is/docs/0.2/functions/core.html#hail.expr.functions.case)).
3. Perform a new burden test using `mt.group_rows_by(...).aggregate(...)`, aggregators, `hl.linear_regression_rows`, and your new weight annotation. Do not use `burden_mt` again!

<font color='green'><b>Again, there are many ways to approach this. We provide a few options below.</b></font>

In [None]:
mt.describe(widget=True)

In [None]:
mt.splice_ai.summarize()

In [None]:
mt.cadd.summarize()

In [None]:
mt.aggregate_rows(hl.agg.counter(mt.vep.most_severe_consequence))

In [None]:
mt = mt.annotate_rows(
    weight1 = (hl.case()
               .when(mt.vep.most_severe_consequence == "synonymous_variant", 2)
               .when(mt.vep.most_severe_consequence == "intron_variant", 3)
               .when(mt.vep.most_severe_consequence == "missense_variant", 5)
               .default(1)),
    weight2 = mt.cadd.phred
)

mt = mt.annotate_cols(pca = pca_scores[mt.s])


xx = mt.group_rows_by(mt.gene_name).aggregate(
    n_variants = hl.agg.count_where(mt.weight2 * mt.GT.n_alt_alleles() > 0)
)
xx = xx.filter_rows(hl.agg.sum(xx.n_variants) > 0)

weighted_burden = hl.linear_regression_rows(
    y=xx.pheno1, 
    x=xx.n_variants,
    covariates=[1.0, 
                xx.pca.scores[0], 
                xx.pca.scores[1], 
                xx.pca.scores[2]])
ht = weighted_burden
ggplot(ht) + geom_col(aes(x=ht.gene_name, y=-hl.log(ht.p_value, base=10)))

## Notebook 03, Exercise 1: Coefficient of kinship versus coefficient of relationship.

The kinship coefficient estimated in methods below is defined as the probability that two homologous alleles drawn from each of two individuals are identical by descent. The similar "coefficient of relationship", defined as the fraction of genetic material shared identically-by-descent, is equal to twice the kinship coefficient.

| Relationship | Kinship Coefficient ($\phi$) | Coefficient of Relationship ($r$) |
| :--- | --- | --- |
| Self | 1/2 = 0.5 | 1.0 |
| Parent-Child | 0.25 | 0.5 |
| Full Sibling | 0.25 | 0.5 |
| Grandparent-Grandchild | 1/8 = 0.125 | 0.25 |
| Avuncular Pair | 0.125 | 0.25 |
| First Cousin | 0.0625 | 0.125 |



### Notebook 03, Exercise 2: discuss -- How does the estimate deviate from true relatedness? Why?

<font color='green'><b>The IBD estimator is extremely inflated in the presence of structure and admixture. We can hover over the blob at "true relatedness" equal to zero and see that IBD estimates many unrelated founders as highly related!</b></font>

### Notebook 03, Exercise 3: discuss -- How does the estimate deviate from true relatedness? Why?

<font color='green'><b>The KING estimator is deflated in the presence of admixture.</b></font>

### Notebook 03, Exercise 4: detective work. Investigate the relationship between the individuals with kinship coefficient ~0.375.

Some useful information is encoded in the column fields of `mt`. You can show specific samples by editing the below code.
    |
If you've got time after finishing this, do the same for a pair in the cluster with relatedness ~0.185!

To get you started, here's an example of how to interrogate three random samples, sample 50, 100, and 1000.

<font color='green'><b>The students should start with a pair, include those numbers in the list below that reads [50, 100, 1000] to start, and then successively add the parents from the printout to the list until there are enough individuals to observe their relationship. The individuals of interest are the descendents of a pair of full siblings.</b></font>

### Notebook 03, Exercise 5: model interrogation. 

PC-Relate uses an explicit `k` term to control how many principal components are used to in individual allele frequency predictions. Your job is to copy the code cell starting with `pcrel = `, and run it with different `k` values to interrogate how the number of principal components included affects the results.

What **k** seems best? What happens when **k** is small? Large?

<font color='green'><b>The PC-relate model here does best when K is small (1!) and deflates when K is large. Intersting... We'll discuss as a group and update the solutions afterwards :)</b></font>