# 3.3 Finding enriched GO Terms in the dataset with GORilla

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





In [2]:
cd $WORK_DIR



In [3]:
ls

all.fc.bigwig	      all.tagAlign.files.txt   pc3_contribs.txt
all.fc.txt	      data		       pgoddard_rtt109_YPD_1_out
all_merged.peaks.bed  hrosenbl_asf1_YPD_1_out  src
all.peaks.bed	      narrowPeak_files.txt     tmp
all.peaks.sorted.bed  pc1_contribs.txt	       WT-SCD-0_6MNaCl-Rep1_out
all.readcount.txt     pc2_contribs.txt	       WT-SCD-0_6MNaCl-Rep2_out


In the previous tutorial (3.1) we calculated the contribution of each peak in the dataset to principal components 1, 2, and 3.

Now, we sort the pc_contribs\*.txt files. 

Then, we map the peaks to their nearest genes. 
 
Finally, we use the resulting ranked list with software such as GOrilla, which accept a ranked list of genes and outputs GO terms that are overrepresented in the data: (http://cbl-gorilla.cs.technion.ac.il/)



In [4]:
head pc1_contribs.txt

chrIV	434072	434393	-0.0616686185905887
chrII	443796	445325	-0.0588197973192415
chrXVI	125756	125981	-0.0577292944721955
chrVII	110196	110840	-0.0566929315383855
chrIV	946172	946834	-0.0557534137561471
chrXIII	480480	480670	-0.0543690085795371
chrIV	1165638	1165873	-0.0533390655801499
chrVII	828251	828932	-0.0532967343341985
chrXIV	102595	103081	-0.0508382428149161
chrV	61653	62071	-0.0507213730808103


In [8]:
#sort the peak_contris files by the fourth column, which contains the contribution to the PC. 
sort -g -k4,4 pc1_contribs.txt > pc1_contribs.ascending.txt

#get the descending peak list by sorting in reverse 
sort -gr -k4,4 pc1_contribs.txt > pc1_contribs.descending.txt




In [9]:
head pc1_contribs.ascending.txt

chrIV	434072	434393	-0.0616686185905887
chrII	443796	445325	-0.0588197973192415
chrXVI	125756	125981	-0.0577292944721955
chrVII	110196	110840	-0.0566929315383855
chrIV	946172	946834	-0.0557534137561471
chrXIII	480480	480670	-0.0543690085795371
chrIV	1165638	1165873	-0.0533390655801499
chrVII	828251	828932	-0.0532967343341985
chrXIV	102595	103081	-0.0508382428149161
chrV	61653	62071	-0.0507213730808103


In [None]:
head pc1_contribs.descending.txt

To map peaks to their nearest gene, we need to know the SacCer3 gene coordinates. The gene coordinates are indicated in the file **$YEAST_DIR/yeast_tss_coords.bed**

In [10]:
head $YEAST_DIR/yeast_tss_coords.bed

chrI	130798	130799	YAL012W	0	+
chrI	334	335	YAL069W	0	+
chrI	537	538	YAL068W-A	0	+
chrI	2168	2169	YAL068C	0	-
chrI	2479	2480	YAL067W-A	0	+
chrI	9015	9016	YAL067C	0	-
chrI	10090	10091	YAL066W	0	+
chrI	11950	11951	YAL065C	0	-
chrI	12045	12046	YAL064W-B	0	+
chrI	13742	13743	YAL064C-A	0	-


In [11]:
bedtools closest


Tool:    bedtools closest (aka closestBed)
Version: v2.17.0
Summary: For each feature in A, finds the closest 
	 feature (upstream or downstream) in B.

Usage:   bedtools closest [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

Options: 
	-s	Req. same strandedness.  That is, find the closest feature in
		B that overlaps A on the _same_ strand.
		- By default, overlaps are reported without respect to strand.

	-S	Req. opposite strandedness.  That is, find the closest feature
		in B that overlaps A on the _opposite_ strand.
		- By default, overlaps are reported without respect to strand.

	-d	In addition to the closest feature in B, 
		report its distance to A as an extra column.
		- The reported distance for overlapping features will be 0.

	-D	Like -d, report the closest feature in B, and its distance to A
		as an extra column. Unlike -d, use negative distances to report
		upstream features.
		The options for defining which orientation is "upstream" are:
		- "ref"

In [12]:
cd $WORK_DIR
#We map the sorted peaks to their nearest genes. 
bedtools closest -D a -a pc1_contribs.ascending.txt -b $YEAST_DIR/yeast_tss_coords.bed > pc1_contribs.ascending.togene.txt
bedtools closest -D a -a pc1_contribs.descending.txt -b $YEAST_DIR/yeast_tss_coords.bed > pc1_contribs.descending.togene.txt



In [13]:
head pc1_contribs.ascending.togene.txt

chrIV	434072	434393	-0.0616686185905887	chrIV	433496	433497	YDL008W	0	+	-576
chrII	443796	445325	-0.0588197973192415	chrII	444692	444693	YBR101C	0	-	0
chrXVI	125756	125981	-0.0577292944721955	chrXVI	126005	126006	YPL225W	0	+	25
chrVII	110196	110840	-0.0566929315383855	chrVII	112004	112005	YGL204C	0	-	1165
chrIV	946172	946834	-0.0557534137561471	chrIV	946806	946807	YDR242W	0	+	0
chrXIII	480480	480670	-0.0543690085795371	chrXIII	480189	480190	YMR106C	0	-	-291
chrIV	1165638	1165873	-0.0533390655801499	chrIV	1164659	1164660	YDR345C	0	-	-979
chrVII	828251	828932	-0.0532967343341985	chrVII	828624	828625	YGR164W	0	+	0
chrXIV	102595	103081	-0.0508382428149161	chrXIV	102231	102232	YNL284C-B	0	-	-364
chrXIV	102595	103081	-0.0508382428149161	chrXIV	102231	102232	YNL284C-A	0	-	-364


In [14]:
head pc1_contribs.descending.togene.txt

chrXIII	458005	458409	0.0722876214884511	chrXIII	458407	458408	YMR096W	0	+	0
chrVI	19579	20032	0.0703114703344558	chrVI	17003	17004	YFL055W	0	+	-2576
chrXV	1058531	1059974	0.0683952492488027	chrXV	1059530	1059531	YOR382W	0	+	0
chrX	265103	265858	0.0634861880255522	chrX	265050	265051	YJL090C	0	-	-53
chrVII	1066859	1067088	0.0619424875138834	chrVII	1068990	1068991	YGR287C	0	-	1903
chrX	278556	279386	0.0615391834653188	chrX	278840	278841	YJL083W	0	+	0
chrVIII	379297	380235	0.0600915029232951	chrVIII	379198	379199	YHR139C	0	-	-99
chrXV	164309	165752	0.0596787268589957	chrXV	165713	165714	YOL083W	0	+	0
chrIX	426903	427901	0.0596772234690976	chrIX	424512	424513	YIR038C	0	-	-2391
chrXII	747273	747828	0.0596709512130742	chrXII	747936	747937	YLR308W	0	+	109


The GOrilla software expects a ranked list of genes, so we use the cut command to extract just the gene name column 
from the files. 

In [15]:
cut -f8 pc1_contribs.ascending.togene.txt > pc1_contribs.ascending.geneonly.txt
cut -f8 pc1_contribs.descending.togene.txt > pc1_contribs.descending.geneonly.txt
head pc1_contribs.ascending.geneonly.txt

YDL008W
YBR101C
YPL225W
YGL204C
YDR242W
YMR106C
YDR345C
YGR164W
YNL284C-B
YNL284C-A


In [16]:
#we create a symbolic link to the sorted genes files so you can load them directly into GORilla.

ln -s $WORK_DIR/pc1_contribs.ascending.geneonly.txt ~/training_camp/workflow_notebooks
ln -s $WORK_DIR/pc1_contribs.descending.geneonly.txt ~/training_camp/workflow_notebooks



In [17]:
ls ~/training_camp/workflow_notebooks

0.0 Introduction to Jupyter notebooks.ipynb
1.0 Big Ideas.ipynb
1.1 Unix Basics.ipynb
1.2 Getting ready to run code on the cluster.ipynb
2.0 The metadata file and analysis overview.ipynb
2.1_Sequencing_Data_Analysis.ipynb
3.1 Clustering analysis and PCA on fold change data .ipynb
3.2 Clustering analysis and PCA on count data.ipynb
3.4 Calling differentially expressed peaks with DESeq2 and limma .ipynb
3.4 Finding enriched GO Terms  with  GORilla.ipynb
3.5 GO Term enrichment for differentially accessible chromatin regions.ipynb
3.6 Finding TF motifs.ipynb
4.0 Introduction to shell scripts.ipynb
4.1 Introduction to SCG and Sherlock compute clusters.ipynb
4.2 Install Missing R packages.ipynb
atac.bds.20180911_232450_810.dag.js
atac.bds.20180911_232450_810.report.html
atac.bds.20180911_232452_967.dag.js
atac.bds.20180911_232452_967.report.html
hrosenbl_asf1_YPD_1_out
images
pc1_contribs.ascending.geneonly.txt
pc1_contribs.descending.geneonly.txt
pgoddard_rtt109_YPD_1

Exercise: copy & paste the lists of genes into GOrilla (http://cbl-gorilla.cs.technion.ac.il/) to discover over-represented GO terms.

Exercis: Repeat the above analysis for genes contributing to PC2 and PC3. 