# Rare Variant Analysis

Objective:

- Understand basic principles behind simple variant aggregation and burden tests.

GWAS is a great tool for finding associations between **common variants** and disease, but is underpowered to detect rare-variant associations, because rare variants by definition have small sample sizes.

It is possible to find associations between rare variants and disease by **grouping variants of similar effect**, and testing each group.

One possible solution is to sum variant counts according to some genomic interval (for instance, gene), and then association with these intervals. A version of this kind of test is called a burden test. 

We'll do a burden test that associates rare variant burden with our `caffeine_consumption` phenotype. We shouldn't hope to find anything here -- especially because we've only got a few thousand rare variants!

In [None]:
import hail as hl
from hail.plot import output_notebook, show

Now we initialize Hail and set up plotting to display inline in the notebook.

In [None]:
hl.init()
# make plots display inline, rather than creating files
output_notebook()

# <font color="#1a0dab">Step 1:</font> Import variant data

First, we'll need to start again from the QC'ed matrix table on disk -- `mt` has been filtered to include only common variants.

In [None]:
mt = hl.read_matrix_table('resources/post_qc.mt')

Next, we will keep variants with an allele frequency of under 1%. Including common variants will only reduce the power of a burden test.

In [None]:
mt = mt.filter_rows(hl.agg.call_stats(mt.GT, mt.alleles).AF[1] < 0.01)

# <font color="#1a0dab">Step 2:</font> Group by gene


To assign variants to genes, we'll use a tab-separated file that contains genomic intervals and corresponding genes.

Additionally, you can also use our vep annotation tool which works like magic with the correct Google Cloud Platform (GCP) settings. More information [here](https://hail.is/docs/0.2/annotation_database_ui.html)

In [None]:
gene_ht = hl.import_table('resources/ensembl_gene_annotations.txt', impute=True)

In [None]:
gene_ht.show()

How many intervals (genes) are there?

In [None]:
gene_ht.count()

## Annotate variants with genes

In order join our two tables, we need to create a field of type `interval` so that Hail knows how to execute a join.

We'll use the [transmute](https://hail.is/docs/0.2/hail.Table.html?highlight=transmute#hail.Table.transmute) function, which is like `annotate`, but drops any fields referenced in the computation.

In [None]:
print('before transmute')
gene_ht.describe()

gene_ht = gene_ht.transmute(
    interval = hl.locus_interval(gene_ht.chromosome,
                                 gene_ht.start,
                                 gene_ht.end))

print('')
print('after transmute')
gene_ht.describe()

This field needs to be the key of the table, so we will use [key_by](https://hail.is/docs/0.2/hail.Table.html?highlight=key_by#hail.Table.key_by) to assign this computed field as the table key:

In [None]:
keyed_gene_table = gene_ht.key_by('interval')

keyed_gene_table.describe()

Recall how we annotated sample phenotypes earlier in the common variant tutorial -- this join looks very similar:

In [None]:
mt = mt.annotate_rows(gene = keyed_gene_table[mt.locus].gene_name)

Let's `show` the resulting annotations on the matrix table. How do they differ?

In [None]:
mt.gene.show()

# <font color="#1a0dab">Step 3:</font> Aggregate by gene

Hail's modularity makes it easy to perform non-kernel-based burden tests.

We'll compose two general tools:
 - [group_rows_by](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.group_rows_by) / [aggregate](https://hail.is/docs/0.2/hail.GroupedMatrixTable.html#hail.GroupedMatrixTable.aggregate)
 - [hl.linear_regression_rows](https://hail.is/docs/0.2/methods/stats.html#hail.methods.linear_regression_rows).
 
This means that you can flexibly specify the way genotypes are summarized per gene. Using other tools, you may have a few ways to aggregate, but if you want to do something different you are out of luck!

In [None]:
mt.describe(widget=True)

In [None]:
burden_mt = (
    mt
    .group_rows_by('gene')
    .aggregate(n_variants = hl.agg.count_where(mt.GT.n_alt_alleles() > 0))
)

# filter to genes with at least one rare variant!
burden_mt = burden_mt.filter_rows(hl.agg.sum(burden_mt.n_variants) > 0)

In [None]:
burden_mt.describe(widget=True)

In [None]:
burden_mt.show()

# <font color="#1a0dab">Step 4:</font> Run linear regression per gene

This should look familiar! We can reuse the same modular components (like `linear_regression_rows`) for many different purposes.

In [None]:
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)

In [None]:
burden_mt = burden_mt.annotate_cols(pca = pca_scores[burden_mt.s])

burden_results = hl.linear_regression_rows(
    y=burden_mt.pheno.caffeine_consumption, 
    x=burden_mt.n_variants,
    covariates=[1.0, 
                burden_mt.pheno.is_female, 
                burden_mt.pca.scores[0], 
                burden_mt.pca.scores[1], 
                burden_mt.pca.scores[2]])

## Sorry, no `hl.plot.manhattan` for genes!

Manhattan plots are really only useful for standard GWAS. Instead, we can simply sort by p-value using [order_by](https://hail.is/docs/0.2/hail.Table.html#hail.Table.order_by), and print:

In [None]:
burden_results.order_by(burden_results.p_value).show()

Can we use a QQ plot to help us with what we are expecting from our data?

In [None]:
p = hl.plot.qq(burden_results.p_value)
show(p)

With relatively few points, it'll be a little unstable.

RVAS QQ plots tend to be a bit lower for the same sample size.

Deflation would imply an underpowered study and and this RVAS is definitely underpowered.

## Any questions, team?


### What other covariates can you think off that could possibly clean up this analysis? It's the same dataset that we played with a few weeks ago

#### Zoom Breakout rooms Activity

We will assign you into TWO breakout rooms. 

**Team/Room _Purple Hair_**

Create a model with **purple hair** as the outcome


**Team/Room _Polydactylism_**

Create a model with **six toes** as the outcome

## What do you have to do?

1) Introduce yourselves! 

2) Identify a note-taker (and a back up, just in case). This person will also share their screen with the group for code reviewing.

3) Identify a reporter who will share your group’s responses with the larger group.
  
Your assignment would be to :

1) What is the distribution of people who have the phenotype? A simple list with do from `count()` or `show()`! 

2) Create a logistic model with the given phenotype outcome using [Hail documentation](https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows). Use the search function at the top of the documentation page if you need more information!  

3) Which genes are ranked highest? What do you think of the results? 

&emsp; Kumar and Arcturus will pop in and out of your rooms to check in; please use the “Ask for Help” button to bring Kumar or Arcturus into your group as and when needed 

# If you have questions, ask them! We may have answers :)