## Big Data for Biologists: Replication of GWAS studies in different  populations? - Class 16
##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#LD>Find variants in linkage disequilibrium (LD) with a target variant using tabix and PLINK.</a></li>
 <li> <a href=#GeneCards>Use GeneCards to find out information about a gene.</a></li>
 <li><a href=#projectFiles>Use reference datasets of genome and epigenome information to investigate function of coding and non-coding variants.</a></li>
 


## Linkage Disequilibrium Example with Tabix and PLINK <a name ='LD'>

[An article in the New England Journal of Medicine](http://www.nejm.org/doi/full/10.1056/NEJMoa1502214?rss=searchAndBrowse&#t=article) presented a GWAS in 52 participants  who were homozygous for the risk allele for the tag variant rs1558902. This variant occurs in an intron of the FTO gene, which has previously been linked to obesity. However, since a strong GWAS association of a variant with a phenotype is insufficient to deterimne causation, the authors checked whether other variants were in strong linkage disequilibrium with rs1558902 and are thus also potentially causal variants in obesity. 

We will examine below how the PLINK tool can be used to perform such a linkage disequilibrium analysis. 


We have downloaded variant files for the 1000 Genomes Project in the PLINK binary format: 
    
* **/opt/data/1kg_phase1_all.bed** -- binary encoding of subject genotypes (do not be fooled by the file extension, this is NOT the 4-column bed file format we have been using). 

* **/opt/data/1kg_phase1_all.bim** -- list of all variants in the subject population 
* **/opt/data/1kg_phase1_all.fam** -- list of all subject id's in the 1000's genome project

In [None]:
#This syntax will identify all variants that are in linkage disequilibrium with our tagged SNP rs1558902
!plink --bfile /opt/data/1kg_phase1_all --r  --ld-snp rs1558902 --threads 10 --out r.for.rs1558902 


The SNPs that are in linkage disequilibrium with our tagged SNP were saved to the file **r.for.rs1558902.ld**. Let's examine the contents of this file: 

In [None]:
!cat r.for.rs1558902.ld

PLINK also allows us to compute the r^2 value for linkage disequilibrium. The command is the same as what we ran above, but replace "r" with "r^2".

In [None]:
!plink  --bfile /opt/data/1kg_phase1_all --r2  --ld-snp rs1558902 --threads 10 --out r2.for.rs1558902


In [None]:
!cat r2.for.rs1558902.ld

The New England Journal article mentions that variant rs1421085 was found to be associated with rs1558902. The authors found that rs1421085 disrupted an ARID5B repressor motif, and was thus the most likely causal variant.


In [None]:
## Create a scatterplot with BP_B along the x-axis and R along the y-axis. 
## Include labels from column SNP_B so we know which SNPs have the highest LD with our target 

## Load the LD file from PLINK:
## We use a new delimiter argument: delim_whitespace=True to handle the case when there may be variable 
## number of spaces between the columns in a data frame. 
import pandas as pd
from plotnine import * 
data=pd.read_table("r.for.rs1558902.ld",delim_whitespace=True) 
data.head()

x=data['BP_B']
y=data['R']
label=data['SNP_B']
print(qplot(x=x,
      y=y,
      label=label,
      geom=["point","text"],
      xlab="BP_B",
      ylab="R"))

## Use Gene Cards to find out information about a gene <a name ='GeneCards'>

[Gene Cards](http://www.genecards.org/) is a database of information about human genes. It provides information about gene function, tissue-specific expression, as well as journal articles where a given gene is mentioned. 

Look up the following genes in gene cards. What is the function of each gene? 

 * IRX5 
 * FTO 

## Overview of Course Project Files <a name='projectFiles'>

All project files can be found in the folder **/opt/data**

* /opt/data/1kg_phase1_all*   -- binary variant files
* /opt/data/gene_coords_hg19.bed -- bed file of gene coordinates 
* /opt/data/gencode.hg19.annotation.gtf -- gene annotation file 
* /opt/data/motifs.bed -- coordinates of all transcription factor-binding motifs in the genome. 
* /opt/data/active_promoters_across_cell_type.bed 
* /opt/data/active_enhancers_across_cell_type.bed 


### Binary variant files

* /opt/data/1kg_phase1_all*   -- binary variant files

These files store the genotypes of all subjects in phase 1 of the 1000 genomes project in a compressed binary format. You can use these with the PLINK tool to identify variants in linkage disequilibrium with your variant of interest. 

In [None]:
## Use the plink --r command to identify all the variants in linkage disequilibrium with target variant rs150021059. 

## How many such variants are there? 


In [None]:
#! cat r.for.150021059.ld

### Gene coordinate file 
* /opt/data/gene_coords_hg19.txt -- bed file of gene coordinates 

Use this files to find the closest gene to a variant of interest. 

In [None]:
!head -n20 /opt/data/gene_coords_hg19.txt

In [None]:
## What are the coordinates of gene 'FTO'? 
##Hint: use the grep command. 

### Gene annotation file 
* /opt/data/gencode.hg19.annotations.gtf -- gene annotation file 
Use this file to identify exons and transcription start sites of genes. 

In [None]:
!head -n20  /opt/data/gencode.hg19.annotations.gtf 

In [None]:
## How many exons are there in the FTO gene? 
## Hint: grep is useful here too! 

### Motif coordinates file

* /opt/data/motifs.bed 

Use this file to find the motif that is present at a particular region in the genome.

In [None]:
!head -n20 /opt/data/motifs.bed


In [None]:
## What motif is present at coordinates chr1	53495	53504	? 
!echo "chr1\t53495\t53504" > region_for_motif.bed 

## Hint use the bedtools intersect command to find the motif 

### Active promoters and enhancers across cell type 
* /opt/data/active_promoters_across_cell_type.bed 
* /opt/data/active_enhancers_across_cell_type.bed 

Use these files to determine whether your variant of interest is in an active enhancer or promoter region 

In [None]:
!head -n20 /opt/data/active_enhancers_across_cell_type.bed 

From our linkage disequilibrium analysis above, we know that variant rs1558902 is at position: 

    chr16 53803574

We generate a bed file for this SNP: 


In [None]:
! echo "chr16\t53803574\t53803575" > rs1558902.bed

In [None]:
!cat rs1558902.bed

We can now use bedtools intersect to check whether the variant falls into an active promoter or enhancer in any cell type. 

In [None]:
## Use the bedtools intersect command to check whether the variant above falls into an active promoter or enhancer region. 