# 3.3 Finding enriched GO Terms in the dataset with GORilla

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [None]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



In [None]:
cd $WORK_DIR

In [None]:
ls

In the previous tutorial (3.1) we calculated the contribution of each peak in the dataset to principal components 1, 2, and 3.

Now, we sort the pc_contribs\*.txt files. 

Then, we map the peaks to their nearest genes. 
 
Finally, we use the resulting ranked list with software such as GOrilla, which accept a ranked list of genes and outputs GO terms that are overrepresented in the data: (http://cbl-gorilla.cs.technion.ac.il/)



In [None]:
head pc1_contribs.txt

In [None]:
#bedtools sort the pc1_contribs.txt file 
bedtools sort -i pc1_contribs.txt > pc1_contribs.sorted.txt

In [None]:
head pc1_contribs.sorted.txt

To map peaks to their nearest gene, we need to know the SacCer3 gene coordinates. The gene coordinates are indicated in the file **$YEAST_DIR/yeast_tss_coords.bed**

In [None]:
head $YEAST_DIR/yeast_tss_coords.bed

In [None]:
bedtools closest

In [None]:
cd $WORK_DIR
#We map the sorted peaks to their nearest genes. 
bedtools closest -D a -a pc1_contribs.sorted.txt -b $YEAST_DIR/yeast_tss_coords.bed > pc1_contribs.togene.txt


In [None]:
head pc1_contribs.togene.txt

In [None]:
#sort the pc1_contribs.togene.txt file in ascending and descending order. 
sort -k4,4 pc1_contribs.togene.txt > pc1_contribs.ascending.togene.txt 
sort -nr -k4,4 pc1_contribs.togene.txt > pc1_contribs.descending.togene.txt 


In [None]:
head pc1_contribs.ascending.togene.txt 

In [None]:
head pc1_contribs.descending.togene.txt 

The GOrilla software expects a ranked list of genes, so we use the cut command to extract just the gene name column 
from the files. 

In [None]:
cut -f8 pc1_contribs.ascending.togene.txt > pc1_contribs.ascending.geneonly.txt
cut -f8 pc1_contribs.descending.togene.txt > pc1_contribs.descending.geneonly.txt
head pc1_contribs.ascending.geneonly.txt

Exercise: copy & paste the lists of genes into GOrilla (http://cbl-gorilla.cs.technion.ac.il/) to discover over-represented GO terms.

Exercis: Repeat the above analysis for genes contributing to PC2 and PC3. 