# Big Data for Biologists: Decoding Genomic Function- Class 10

## What is GO term enrichment analysis ?

##  Learning Objectives
***Students should be able to***

<ol>
<li> <a href=#catwc> Use the Unix command wc (word count) to count the lines in a file.</a> </li>
 <li> <a href=#GOtermIntro> Describe how the Gene Ontology is organized and what a "GO term" means. </a> </li>
 <li> <a href=#GOtermenrichment> Explain what GO term enrichment is </a></li>
 <li> <a href=#GeneIDtoName> Convert GeneIDs to gene names using the unix grep command </a></li>
 <li> <a href=#GOrilla> Use the GOrilla website to identify GOterms enriched in a set of genes </a></li>
 </ol>

### Load data and import helper functions

In [None]:
%%capture

#Imports helper functions for loading RNA_Seq data and kmeans algorithm 

%matplotlib inline
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import sys
sys.path.append('../helpers')
from kmeans_helpers import * 
from RNAseq_helpers import * 

## RNA-seq analysis workflow 
<img src="../Images/10-RNApipeline.png" alt="RNA pipeline" width="300" height="200"/>


## Use the Unix command wc (word count) to count the lines in a file.<a name='catwc' />

In the previous session, we ran K-means clustering on the gene-by-sample RNA-seq matrix. Here, we examine the functional enrichment of Gene Ontology terms in each cluster. 
. 
First, we calculate the number of genes in each cluster.
We can also use the wc (word count) command to quickly count the lines in a file.
The `wc -l` command prints the number of new lines in the file. Other flags can be used to print the number of words or characters.

We can use the `--help` argument to learn how a unix command is used: 

In [None]:
!wc --help

In [None]:
#Count the number of genes in each cluster 
#Cluster 0 
!wc -l 0.txt

In [None]:
#Count the number of genes in clusters 1, 2 and 3
!wc -l 1.txt
!wc -l 2.txt 
!wc -l 3.txt

## How is the gene ontology organized and what is a GO term? <a name='GOtermIntro' />

An ontology represents knowledge about some subject domain. An ontology consists of two parts: 
* Well-defined terms 
* Relationships between the terms. 

[The Gene Onotology](http://www.geneontology.org/)  provides a way to annotate known information about genes. The gene ontology seeks to answer three questions about each gene: 


* Which functions does the gene product exert? ( **Molecular Function**) 

* With which biological process is the gene product associated ( **Biological Process** ) 

* Where and when is a particular gene product involved (cell part, cell type, body part, development stage)? (**Cellular Component**)


![GO Explanation](../Images/9-GOexplanation.png)

(figure credit: Rachel Huntley, "Introduction to the Gene Ontology and GO annotation resources", http://slideplayer.com/slide/7009132/)

Gene Ontology terms are organized in a hierarchy of 7 levels. The structure of GO can be described in terms of a graph, where each GO term is a node, and the relationships between the terms are edges between the nodes. GO is loosely hierarchical, with 'child' terms being more specialized than their 'parent' terms, but unlike a strict hierarchy, a term may have more than one parent term. 

For example: 
![GO Example](../Images/9-GOexample.png)

## What is GO term enrichment?  <a name='GOtermenrichment' />

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRqyb_exm8Yzfe8_PfhCLGl5FwFLNerBoYJD7JVIsfnbNbEhu2_F8efs8UJCY9jTyB9SOTaw6a7eJWn/embed?start=false&loop=false&delayms=60000" frameborder="0" width="800" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

[GOrilla](http://cbl-gorilla.cs.technion.ac.il/) is a tool for identifying and visualizing enriched GO terms in ranked lists of genes.
It can be run in one of two modes:

*    Searching for enriched GO terms that appear densely at the top of a ranked list of genes or
*    Searching for enriched GO terms in a target list of genes compared to a background list of genes. We will use this mode to identify GO terms that are enriched in the set of 1543 differential genes identified in our four tissues of interest, compared with the background of all genes in the hg19 reference genome. 

First, we will run all 1543 differentially expressed genes through GORilla to determine if there are any significantly enriched GO terms as compared to the background of all genes in the hg19 reference genome. We have written the differential genes to an output file called 'differential_gene_ids.txt'. The file 'hg19.txt' contains a list of all gene id's and gene names in the hg19 reference genome.  

GORilla uses a hypergeometric test (Fisher's Exact test) to identify enriched GO terms. 

In [None]:
!head differential_gene_ids.txt

## Convert GeneIDs to Gene Names <a name='GeneIDtoName' />

GOrilla expects gene names, rather than gene id's as the input, so we must convert from the ENSEMBL id's to the official gene symbols. The code below will perform this conversion. 

In [None]:
!wc -l gene_id_to_gene_name.txt

In [None]:
#The file "gene_id_to_gene_name.txt" maps gene id's to gene names. Examine the contents of this file: 
!head gene_id_to_gene_name.txt

In [None]:
#We iterate through the gene id's in our list and find the corresponding gene names. 
!grep -f differential_gene_ids.txt gene_id_to_gene_name.txt > tmp.txt
#let's examine the output of the grep command: 
!head tmp.txt 

In [None]:
#select the second column from the grep output
!cut -f2 tmp.txt > differential_gene_names.txt

#Examine the first 10 lines in the resulting file.
!head -n10 differential_gene_names.txt

In [None]:
# cut the second column from the file hg19.txt to get the names (rather than ENSEMBL id's) of all genes in hg19. 
! cut -f2 hg19.txt >  hg19.names.txt 
!head -n10 hg19.names.txt

In the next section, we will be using a web browser for the GO term analysis. 

To access the differential_gene_names.txt and hg19.names.txt files from the browser you will need to download these files from the cloud server to your computer. 

To download the files, click File under the jupyter icon and then download. Make sure to download the file from within the Jupyter notebook window and not from the browser window. If you use the browser window the file will downlad as an .html instead of a .txt file which will not work for the next steps. 

## Use the GOrilla website to identify GOterms enriched in a set of genes <a name='GOrilla' />

We are now ready to use GOrilla to check for enriched GO terms. 
First, navigate to the GOrilla portal: <a href="http://cbl-gorilla.cs.technion.ac.il/">http://cbl-gorilla.cs.technion.ac.il/</a>

Follow these steps in the GOrilla portal: 
* select "Homo sapiens" for "Choose organism" 
* Select "Two unranked lists of genes" from "Choose running mode" 
* Upload the file "differential_genes_names.txt" for the Target set. 
* Upload the file "hg19.names.txt" for the Background set. 
* Select "All" under "Choose an ontology". 
* Click on "Search Enriched GO terms"  
* Examine the output by clicking on "Process", "Function", and "Cellular Component" tabs.  



The top hits for Process should be: 
![Process](../Images/10_process_go_allgenes.png)

The top hits for Function should be: 
![Function](../Images/10_function_go_allgenes.png)

The top hits for Cellular Component should be:
![Cellular_Component](../Images/10_component_go_allgenes.png)

GOrilla also generates the graph of inter-related GO terms, color-coding them by significance.

Re-run the GORilla analysis with each cluster of genes (upload 0.txt, 1.txt, 2.txt, and 3.txt to GORilla to compare the significant Process, Function, and Cellular Component GO terms that are returned.)

## Cluster 0 
From the heatmap, Cluster 0 appears to contain genes that are downregulated in Blood and up-regulated in Embryonic cells. 

### Process: 
![0_process](../Images/10_0_process.png)
### Function: 
![0_function](../Images/10_0_function.png)
### Cellular Component:
![0_component](../Images/10_0_component.png)

## Cluster 1 
From the heatmap, Cluster 1 contains genes that are moderately upregulated in Blood and the Immune system. 

### Process: 
![1.process](../Images/10_1_process.png)
### Function: 
![1.function](../Images/10_1_function.png)
### Cellular Component:
![1.component](../Images/10_1_component.png)

## Cluster 2 
From the heatmap, Cluster 2 contains genes that are strongly upregulated in Blood and the Immune system. 


### Process: 
![2.process](../Images/10_2_process.png)
### Function: 
![2.function](../Images/10_2_function.png) 
#### Cellular Component:
![2.component](../Images/10_2_component.png) 

## Cluster 3

From the heatmap, cluster 3 contains genes that are upregulated in the Respiratory system and Embryonic samples. 

### Process: 
![3.process](../Images/10_3_process.png) 
#### Function: 
![3.function](../Images/10_3_function.png) 
#### Cellular Component:
![3.component](../Images/10_3_component.png)


## Use the goattools Python module to run gene enrichment <a name='Goattools' />

Web-based tools such as GOrilla are convenient to use for small numbers of queries. However, you may often perform a more complex analysis with dozens of clusters instead of just 4. 

For multiple queries, it can be helpful to perform the analysis programmatically. Python has a libary called [goatools](https://github.com/tanghaibao/goatools)that can be used for this purpose. 

The goatools library provides a script called **find_enrichments.py** to find enriched GO terms in a list of genes. Let's examine the syntax of this script: 

In [None]:
from goatools import * 

In [None]:
!find_enrichment.py

There are 3 required arguments for the script: 
    
    1. The dataset (list of genes in differential_gene_names.txt) 
    2. The background (list of genes in hg19.names.txt) 
    3. A file that associates gene names to GO terms (hg19.assocs). We generated this file in advance -- It contains gene names in column 1 and GO terms associated with this gene in column 2. 

There are also 2 optional inputs to the script that we will find useful: 
    1. --outfile
    The output file to save enriched GO terms 
    2. --pval_field=fdr_bh 
    This argument indicates that the script should only return GO Terms enriched with false discovery rate < 0.05. 
    
The module can be run using the following command. 
    

In [None]:
! find_enrichment.py 0.txt hg19.names.txt ../Weekly\ Assignments/Week_5/hg19.assocs --outfile differential_genes_enrichments.tsv --pval_field=fdr_bh

### ***This concludes our analysis of RNA-seq data. Next we will transition to discussing the mechanisms that regulate gene expression, starting with transcription factors which are proteins that control the rate of gene expression.***