## Big Data for Biologists: Decoding Genomic Function

##  Learning Objectives
***Students should be able to***
<ol>
<li> <a href=#Workflow>Participate in a collaborative programming project to gain insights into the workflow for computational projects.</a></li>
<li> <a href=#Roles>Experience different roles in a computational project including code implementer and documentation provider.</a></li>
<li> <a href=#Dataanalysis> Apply data analysis methods from the course to new problems </a></li> 
</ol>


## Introduction to Course Project 

The course project will have three main steps.  

I.  You will be given the SNP identifiers (rs ids) for two variants in the human genome. For each of these variants you will: 

* write the code using the notebook guidelines below. 
* analyze the output of the code 
* create an initial draft of a report including:  
   <ol>
    <li> the likely causal variants </li>
    <li> explanation of reasoning for why they are the likely causal variants</li>
    <li> what you have learned about how the variant acts(ie. through what type of mutation in a coding region or what type of element in a non-coding region)</li>

II. You will proofreading and check your teammates work. 
* Switch variants and run through code.
* The outputs of your files should be identical.
* Add comments to annotate any code that may need clarification. 

III. Writing up the report (see rubric for guidelines). 

Deliverables: 
* Jupyter notebook with code 
* Writeup document with summary

### Grading Rubric: 

Will be posted on Canvas with the Assignment

### Project files: 
All project files can be found in the folder **/data/project**

* /data/project/1kg_phase1_all*   -- binary variant files
* /data/project/gene_coords_hg19.bed.gz -- bed file of gene coordinates 
* /data/project/gencode.hg19.annotation.gtf -- gene annotation file 
* /data/project/motifs.bed.gz -- coordinates of all transcription factor-binding motifs in the genome. 
* /data/project/active_promoters_across_cell_type.bed.gz
* /data/project/active_enhancers_across_cell_type.bed.gz

### General Suggestions: 

You will be working with some large files for your course project. Please use the !head command to examine these files rather than the !cat command to avoid printing very large amounts of text to your notebook. 

Some of the files we have provided are zipped in the gzip format (these end with the .gz suffix). To examine these files use the combination of zcat and head coommands, as below: 

```
!zcat /data/project/gene_coords_hg19.bed.gz | head 
    
```

To make your code easier to follow, you may find it helpful to add additional comments in the code boxes. You can decide which comments will be helpful to include. 


## STEP 1:  Are either of the candidate causal variants in protein coding regions?  <a name ='Dataanalysis'>

Your first task is to determine whether any of the candidate variants are in protein coding regions. That is, do they overlap a known protein coding region? 

We have provided an hg19 gene annotation file here: 

* **/data/project/gencode.hg19.annotation.gtf**

The annotations for CDS regions in this file  include the text "CDS".You should use the "grep" command to extract CDS regions from this file.  You should use a flag for the grep command that ensures you limit the output to lines with "CDS" only as a whole word. Otherwise lines with CDS embedded in other fields may also appear (see !grep --help for a list of flags). 

In [9]:
#BEGIN SOLUTION
#END SOLUTION

Next, you will use the output from above to make a file in bed format. Examine what columns you will need and make a file in bed format. 

In [20]:
#BEGIN SOLUTION
#END SOLUTION

Now, you will use one of the bedtools commands we have discussed to overlap the CDS file with the coordinates of your assigned variants. 

In [None]:
#BEGIN SOLUTION  
#END SOLUTION 

You should find that one of your variants is in a coding region. For this variant, you do NOT need to investigate it's linked variants, because the variant likely directly affecting the sequence of the protien that the gene encodes. You only need to complete STEP 2 for the coding variant. The coding variant should also be considered in STEP 11. 

For the variant that is in a non-coding portion of the genome, it's linked SNPs must  be examined to determine the variant's most likely mechanism of action. Proceed to STEPS 3 to STEPS 11 for the variant that is in a non-coding region. 

## STEP 2:  Has the coding variant been linked to a disease? If so, which one?  What is known about how the variant could affect transcription or translation?


The [GWAS Catalog](https://www.ebi.ac.uk/gwas/) is a curated database of GWAS studies. Look up known variant-phenotype associations in the GWAS Catalog. 

We also recommend looking up the variant in the [Global Biobank Engine](https://biobankengine.stanford.edu/).

Analyze what you have observed in the GWAS Catalog and Global Biobank Engine. Include any diseases that the variant has been linked to and how it could affect transcription or translation. 

**ANSWER HERE:**



## STEP 3: Given a target non-coding variant that has been linked to a disease, what are the candidate causal variants in LD with the target variant?   

To ensure the highest likelihood of discovering the causal SNP, please investigate multiple variants in LD with your non-coding SNP and discuss them in your writeup. If you find that your non-coding SNP has high LD with more than five SNPs, please investigate and discuss the five SNPs with the highest  r^2 LD score. 
    

In [None]:
#Find all single nucleotide polymorhphisms (SNPs)  in linkage disequilibrium (LD)  with your target variants.

#BEGIN SOLUTION 
#END SOLUTION 

Generate a Manhattan plot with SNP position along the x-axis and r^2 of all LD SNPs along the y-axis. 

In [None]:
#BEGIN SOLUTION 
#END SOLUTION

## STEP 4:  Are any of the variants in high LD with the non-coding SNP located within an exon? 

Repeat the step 1 analysis, but this time search for EXONS and examine the SNPs in high LD with your non-coding SNP. 

In [None]:
## BEGIN SOLUTION 
## END SOLUTION 

## STEP 5:  Are any variants from Step 4 in protein coding regions? If so, have they been linked to a disease? Which one?  What is known about how the variant could affect transcription or translation?

In [None]:
#Determine if any of the SNPs in LD with the non-coding variant are in protein coding regions
#BEGIN SOLUTION 
#END SOLUTION 

Similarily to STEP 3 above, it might help to visualize any protein coding variants in the [GWAS Catalog](https://www.ebi.ac.uk/gwas/) and  [Global Biobank Engine](https://biobankengine.stanford.edu/). 

Analyze what you have observed in the GWAS Catalog and Global Biobank Engine. Include any diseases that the variant has been linked to and how it could affect transcription or translation. 

**ANSWER HERE:**

## STEP 6: Is the original non-coding variant or are any of the SNPs in high LD with the non-coding variant located within a promoter region, if so, what are the relevant cell types?  

You may find the file **/data/project/active_promoters_across_cell_type.bed.gz** useful in performing the tasks below. 

In [None]:
#Determine if any of the candidate variants are in promoter regions  
#BEGIN SOLUTION 
#END SOLUTION 

In [None]:
#List the cell types where the candidate variants are in active promoters
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 7: Is the original non-coding variant or are any of the SNPs in high LD with the non-coding variant located in an enhancer, if so, what are the relevant cell types? 

You may find the file **/data/project/active_enhancers_across_cell_type.bed.gz** useful in performing the tasks below. 

In [None]:
#Determine if any of the candidate variants are in enhancer regions  
#BEGIN SOLUTION 
#END SOLUTION 

In [None]:
#List the cell types where the candidate variants are in active enhancers
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 8: What Transcription Factors motifs overlap with the original non-coding variant or any SNPs of interest in high LD with the non-coding variant?

You may find the  file **/data/project/motifs.bed.gz** useful for performing the task below. 

Note: If you don't fine any transcription factors overlap with the specific SNP, you can expand the search to include transcription factors that overlap with the active promoters/enhancers from steps 6 and 7. 

In [None]:
#Determine which transcription factor motifs overlap with the SNPs.
#BEGIN SOLUTION 
#END SOLUTION 

## STEP 9: Look up the Transcription Factor(s) identified in Step 8 in Gene Cards or another browser. What is known about the transcription factor(s)?  

### Introduction to Gene Cards 

[Gene Cards](http://www.genecards.org/) is a database of information about human genes. It provides information about gene function, tissue-specific expression, as well as journal articles where a given gene is mentioned. 
Look up relevant genes in gene cards. What is the function of each gene? 


## STEP 10 Identify candidate target genes (genes that are in the vicinity of the variant).


You may find the file **/data/project/gene_coords_hg19.bed.gz** useful. 

In [None]:
## BEGIN SOLUTION 
## END SOLUTION 

Look up the function of these genes in [Gene Cards](http://www.genecards.org/)

Visualize your variant in the [WashU Browser](http://epigenomegateway.wustl.edu/browser/) to determine which genes lie nearby. How near is each SNP to a candidate gene? 

To export screenshots from the WashU Browser, go to **Tracks** in the menu bar and select **Screenshot**
![WashU Screenshot](../Images/15_BrowserScreenshot.png)

Select **show track name** and click on **Take screenshot**.
![Browser Screenshot 2](../Images/15_BrowserScreenshot2.png)

## STEP 11: Using all of the information together select your top 5 most likely causal variants. 