# GWAS Tutorial

### 1. Setting up work environment
The first step of performing a GWAS is to load in our depedencies and set up our work environment.

In [None]:
"""
Import statements allow us to reuse code written previously by ourselves or others. 
Here we are importing the "Hail" library which is the core strategy we are going to be using to organize our data and to eventually perform statistical analyses.
"""
import hail as hl
from hail.plot import show
from pprint import pprint
import ipywidgets as widgets
from IPython.display import display, clear_output
import time
%matplotlib inline
start = time.time()
hl.stop()
hl.plot.output_notebook()
hl.init()

### 2. Loading in the data
After we finish loading our dependencies, we can go ahead and start loading the data, starting with our genotype data (stored in a folder called "1kg.mt") and our phenotype data (stored in a file called "1kg_annotations.txt").

In [None]:
# Loading in the genotype data from our "data" folder and storing it in a variable called "mt", short for "MatrixTable" (one of the key innovations of the Hail library)
mt = hl.read_matrix_table('data/1kg.mt')

# Loading in the phenotype data from our "data" folder and storing it in a variable called "table"
table = hl.import_table('data/1kg_annotations.txt', impute=True).key_by('Sample')

Now that our data is loaded in, we can combine the two to form a consolidated dataset containing all the relevant information we are going to use for our analyses.

In [None]:
# We can use the "annotate_cols" function to add our phenotype data in the "table" variable 
mt = mt.annotate_cols(pheno = table[mt.s])

It is always a good idea to take a look at our data to see what format we are working with and the available information we have. One way to do this is by using the "describe" method. An example is shown below:

In [None]:
# Describing the format of the "mt variable" using an interactive widget
mt.describe(widget = True)

After running the cell above, we can now interact with the four main components of our dataset (globals, rows, cols, and entries). In Hail, each row consist of one specific genetic variant and each column consist of one specific individual. An entry is an intersection of a row and a column and contains information about a specific variant for a particular individual (such as the genetic call).

Additionally from the interactive cell above, we can see that in the "col" tab, we have two fields that we can access: "s" and "pheno". If we expand the "pheno" field, we can see what information we have available for each individual. In this dataset, we have access to the following variables for each individual: "Population", "SuperPopulation", "isFemale", "PurpleHair", and "CaffeineConsumption". 

Feel free to explore the "row" and "entry" tabs to learn more about those parts of our dataset.

### 3. Quality Control

After we load and explore our dataset, the next step is to perform some quality control (QC) so that we have a clean dataset prior to statistical analysis. In a GWAS, there are quite a few quality control measures we have to do. For this tutorial, we will focus on just a few QC measures. In order to organize ourselves, let's split these quality control measures into two categories: QC on the sample data and QC on the variant data.

Let's begin with QC on the sample (phenotype) data:
1. Remove individuals with high levels of missingness (people who we do not have enough data for)

Let's continue on to QC on the variant (genotype) data:
1. Remove variants with low minor allele frequency (MAF)
2. Remove variants that deviate from Hardy–Weinberg equilibrium (HWE)

Hail has a few QC methods that can help us get started. These methods, **sample_qc** and **variant_qc**, extract quality-related information from our MatrixTable and store them into variables that we can later reference. While we are at it, let's also create another variable (filtered_mt) that will hold the filtered version of our MatrixTable after QC.

If you want more information on QC on GWAS data, including definition and best practices of the above QC terms, please check out the following paper by [Marees et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6001694/)

In [None]:
# Calling Hail's built in QC functions for both the samples and the variants
mt = hl.sample_qc(mt)
mt = hl.variant_qc(mt)
filtered_mt = mt

We can now use Hail's filter_rows and filter_cols methods to filter out bad samples and bad variants from our MatrixTable. To make it easier to follow, running the cell below creates sliders that you can drag to select different threshold values for our QC conditions above. Feel free to change the values and then click on the button to apply the QC filters. Also take some time to check out the commented portion of the code to dive deeper in the syntax used by Hail to perform filtering. Notice how the number of samples and variants change depending on our threshold values.

When you are done experimenting, configure the sliders to the following QC values and click on the button: 
1. **Sample Call Rate = 0.97** (Call rate refers to the fraction of called SNPs over the total number of SNPs in the dataset. Samples will low call rate have a ton of missing SNPs/data and should be removed from the final analysis).
2. **Minor Allele Frequency = 0.01** (Minor Allele Frequency (MAF) refers to the frequency at which the second most common allele occurs in a given population. Lower MAF indiciates that a SNP is rare in a population. As we are focusing on more common variants for this tutorial, it is important to remove the really rare ones.)
3. **Hardy Weinberg Equilibrium = 1.00e-6** (Hardy Weinberg Equilbrium (HWE) is a fundamental concept in population genetics and refers to the concept that genotype frequencies in a population remain constant between generations in the absence of disturbance by outside factors. We can utilize HWE to test for genotyping errors by comparing our dataset allele frequencies with that of a theoretical dataset.)

The code cell below is fairly large as it is handling both the behavior of creating the interactive buttons and the actual filtering. Mostly focus on the commented lines rather than trying to understand the entire code cell at once.

In [None]:
# The code below creates the different sliders that you can interact with. We have a slider for each of the QC variables and finally a button to run the QC steps.
call_rate_slider = widgets.FloatSlider(min=0.90, max=1.00, step=0.01, value=0.97, layout = widgets.Layout(width='500px'), description = "Sample Call Rate:", style=dict(description_width='initial'))
maf_slider = widgets.FloatSlider(min=0.01, max=0.10, step=0.01, value=0.01, layout = widgets.Layout(width='500px'), description = "Minor Allele Frequency:", style=dict(description_width='initial'))
hwe_slider = widgets.FloatLogSlider(value=6, base=10, min=-10, max=-6, step=1, readout_format='.2e', layout = widgets.Layout(width='500px'), description = "Hardy Weinberg Equilbrium:", style=dict(description_width='initial'))
output = widgets.Output()
button = widgets.Button(description = "Apply QC filter", button_style = "primary")

display(call_rate_slider, maf_slider, hwe_slider)
display(button)

# This method is what takes the values from the sliders once the button is pressed and passes them into Hail's filtering methods.
def on_button_click(a):
    global mt
    global filtered_mt
    filtered_mt = mt
    with output:
        clear_output()
        call_rate_value = call_rate_slider.value
        maf_value = maf_slider.value
        hwe_value = hwe_slider.value
        # The line below filters the columns of our MatrixTable and removes samples whose call rate is below the specified value (in our case 0.97).
        filtered_mt = filtered_mt.filter_cols((filtered_mt.sample_qc.dp_stats.mean >= 4) & (filtered_mt.sample_qc.call_rate >= call_rate_value))
        ab = filtered_mt.AD[1] / hl.sum(filtered_mt.AD)
        filter_condition_ab = ((filtered_mt.GT.is_hom_ref() & (ab <= 0.1)) |
                        (filtered_mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        (filtered_mt.GT.is_hom_var() & (ab >= 0.9)))
        filtered_mt = filtered_mt.filter_entries(filter_condition_ab)
        # The line below filters the rows of our MatrixTable and removes variants whose minor allele frequency is below the specified value (in our case 0.01)
        filtered_mt = filtered_mt.filter_rows(filtered_mt.variant_qc.AF[1] > maf_value)
        # The line below filters the rows of our MatrixTable and removes variants whose Hardy-Weinberg equilirbrium value is below the specified value (in our case 1e-6)
        filtered_mt = filtered_mt.filter_rows(filtered_mt.variant_qc.p_value_hwe > hwe_value)
        print('After filtering: Samples: %d  Variants: %d' % (filtered_mt.count_cols(), filtered_mt.count_rows()))
button.on_click(on_button_click)
display(output)

### 4. Initial GWAS

Now that we are done filtering our data, we can go ahead and perform the actual GWAS! In Hail, a GWAS can be performed using the **hl.linear_regression_rows or hl.logistic_regression_rows** methods. The decision to use which method depends on what kind of phenotype we want to investigate. For this tutorial, let us investigate the CaffeineConsumption phenotype (a synthetic phenotype created just for the purpose of this tutorial). As the CaffeineConsumption phenotype is an integer and not a boolean (True/False), we will use the **hl.linear_regression_rows** method for our analysis.

In [None]:
# Here is where we actually run our GWAS!
gwas = hl.linear_regression_rows(
    y=filtered_mt.pheno.CaffeineConsumption,
    x=filtered_mt.GT.n_alt_alleles(),
    covariates=[1.0, filtered_mt.pheno.isFemale])
p = hl.plot.manhattan(gwas.p_value)
show(p)

The image above is called a Manhattan plot, named after the city skyline of Manhattan, NY. Each point represents one particular variant in our dataset. Variants that have higher y-values are more statistically significant. The dashed horizontal line presents our significance threshold (5e-8). Examples of well controlled GWAS's Manhattan plots from other published studies are provided below:

<img src="https://ars.els-cdn.com/content/image/3-s2.0-B9780123742797130206-f13020-01-9780123742797.jpg" alt="Manhattan Image1"/>
<img src="https://ars.els-cdn.com/content/image/3-s2.0-B9780128209516000132-f08-03-9780128209516.jpg" alt="Manhattan Image2"/>

Source: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/manhattan-plot

In both of the examples above, most of the variants are not significant and are tightly packed together. However, this is not what we see in our GWAS. What's wrong? Is there something in our dataset that is adding noise to our GWAS? We can investigate for potential confounding effects by using a Quantile-Quantile Plot.

In [None]:
p = hl.plot.qq(gwas.p_value)
show(p)

A Quantile-Quantile plot (or QQ-plot) is a graph that represents of the deviation of the observed p-values from the null hypothesis. The GWAS p-values for each SNP are sorted from largest to smallest and plotted against expected values from a theoretical χ2-distribution. ***If the observed values correspond to the expected values, all points are on or near the middle line between the x-axis and the y-axis (null hypothesis).*** In our case, we see that our points deviate greatly from the middle line. This is evidence of potential confounding present in the dataset.

### 4. Ancestry, Population Stratification and Principal Component Analysis (PCA)

One's ancestry has a large role in determining the genetic variants present in one's genome. Depending on the ancestry, variants can have differing influence on the phenotype being explored. Often times in genomic studies, researchers will control for ancestry in their statistical experiments. We do this in order to account for population stratification (differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than association of genes with disease). One approach to control for ancestry is to perform Principal Component Analysis (PCA). 

Hail allows one to easily perform PCA using a built in function. Let's perform PCA and plot the results to see if we notice anything.

In [None]:
eigenvalues, pcs, _ = hl.hwe_normalized_pca(filtered_mt.GT)
filtered_mt = filtered_mt.annotate_cols(scores = pcs[filtered_mt.s].scores)

p = hl.plot.scatter(filtered_mt.scores[0],
                    filtered_mt.scores[1],
                    label=filtered_mt.pheno.SuperPopulation,
                    title='PCA', xlabel='PC1', ylabel='PC2')
show(p)

The scatter plot above plots each sample according to their first two principal components and colors the samples by their ancestry SuperPopulation (AFR = Africa, AMR = Admixed American, EAS = Eastern Asian, EUR = European, and SAS = South Asian). As we can see, several clusters emerge. Without going into too much detail, to account for ancestry potentially playing a part in our linear regression, we simply add our calculated principal components as co-variates in our statistical test.

### 5. Final GWAS controlling for Population Stratification

Let us add in the first few principal components (stored in a variable called "filterd_mt.scores") as co-variates in our statistical test and re-run our GWAS.

In [None]:
# Here is where we actually add our principal components and ancestry information into our GWAS function call. Take a look at the commented line before to see how we modify our covariates.
gwas = hl.linear_regression_rows(
    y=filtered_mt.pheno.CaffeineConsumption,
    x=filtered_mt.GT.n_alt_alleles(),
    # This is where we added in additional covariates (in our case the principal components). In Python, array numbering starts at 0, so we include filtered_mt.scores[0], filtered_mt.scores[1], filtered_mt.scores[2] to account for the first three PCs.
    covariates=[1.0, filtered_mt.pheno.isFemale, filtered_mt.scores[0], filtered_mt.scores[1], filtered_mt.scores[2]]) 
p = hl.plot.manhattan(gwas.p_value)
show(p)
p = hl.plot.qq(gwas.p_value)
show(p)

Now that's a much better looking skyline. From the QQ-plot, we see that most points do fall on the center line. From the Manhattan plot, we see that most variants are not significant but we do have some on Chromosome 8 that have met our significance threshold. Remember this is a synthetic phenotype so there probably are not any variants on Chromosome 8 that are associated with CaffieneConsumption, but it is always nice to see results!

Feel free to use your cursor and hover over the variants to find more information about their chromosomal position and p-value. If you simply want a table with that information, you can run the cell below.

In [None]:
# Prints out top 10 GWAS results sorted by p-value
gwas.order_by("p_value").show(10)
# Simply for evaluation purposes, getting time taken for tutorial
end = time.time()
print("Total time (seconds): " + str(end - start))

### 6. Congratulations!

You have just completed an entire GWAS from start to finish! To summarize:
1. We first set up our work environment and loaded in our data
2. We then performed some sample and variant QC to filter our dataset
3. We performed an initial GWAS and were met with noisy results
4. We applied PCA and accounted for ancestry playing a role in impacting our phenotype
5. We performed a final GWAS accounting for ancestry

Feel free to explore some of the other notebooks available or rerun this experiment with your own data!