# 3.2 Calling differentially expressed genes with GO

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [6]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





In [None]:
cd $WORK_DIR

In the previous tutorial (3.1) we calculated the contribution of each peak in the dataset to principal components 1, 2, and 3.

Now, we sort the pc_contribs\*.txt files. 

Then, we map the peaks to their nearest genes. 
 
Finally, we use the resulting ranked list with software such as GOrilla, which accept a ranked list of genes and outputs GO terms that are overrepresented in the data: (http://cbl-gorilla.cs.technion.ac.il/)



In [8]:
#sort the peak_contris files by the fourth column, which contains the contribution to the PC. 
sort -n -k4,4 pc1_contribs.txt > pc1_contribs.ascending.txt

#get the descending peak list by sorting in reverse 
sort -nr -k4,4 pc1_contribs.txt > pc1_contribs.descending.txt




In [9]:
head pc1_contribs.ascending.txt

chrXI	381701	382552	-9.00607341480353e-05
chrXV	196157	196924	-8.92315277797249e-05
chrXIII	880107	881290	-7.30785285107545e-05
chrV	241887	243164	-6.348727749738e-05
chrXIII	523247	523662	-5.36550742122443e-05
chrXI	554628	554937	-4.63722176043807e-05
chrXV	924864	925718	-4.0911430098774e-05
chrXI	437379	437791	-3.09810708327273e-05
chrIV	212030	212256	-3.07049025609486e-06
chrIV	1164630	1165943	-0.0767684058707863


In [10]:
head pc1_contribs.descending.txt

chrIV	1445158	1446115	7.50936387031513e-05
chrXV	1003066	1004648	5.61800065935932e-05
chrXII	931669	933147	5.19492539880198e-05
chrV	86062	87064	2.70898543069069e-05
chrVII	1022307	1022842	1.64906830135266e-05
chrXII	28871	30024	0.0678804358542824
chrXII	746924	747874	0.0669235095057691
chrVIII	38277	39700	0.0660310452695733
chrXV	244410	244812	0.0604821622051676
chrVIII	556266	556857	0.0583396391306409


To map peaks to their nearest gene, we need to know the SacCer3 gene coordinates. The gene coordinates are indicated in the file **$YEAST_DIR/yeast_tss_coords.bed**

In [11]:
head $YEAST_DIR/yeast_tss_coords.bed

chrI	130798	130799	YAL012W	0	+
chrI	334	335	YAL069W	0	+
chrI	537	538	YAL068W-A	0	+
chrI	2168	2169	YAL068C	0	-
chrI	2479	2480	YAL067W-A	0	+
chrI	9015	9016	YAL067C	0	-
chrI	10090	10091	YAL066W	0	+
chrI	11950	11951	YAL065C	0	-
chrI	12045	12046	YAL064W-B	0	+
chrI	13742	13743	YAL064C-A	0	-


In [13]:
cd $WORK_DIR
#We map the sorted peaks to their nearest genes. 
bedtools closest -D a -a pc1_contribs.ascending.txt -b $YEAST_DIR/yeast_tss_coords.bed > pc1_contribs.ascending.togene.txt
bedtools closest -D a -a pc1_contribs.descending.txt -b $YEAST_DIR/yeast_tss_coords.bed > pc1_contribs.descending.togene.txt



In [14]:
head pc1_contribs.ascending.togene.txt

chrXI	381701	382552	-9.00607341480353e-05	chrXI	382497	382498	YKL030W	0	+	0
chrXV	196157	196924	-8.92315277797249e-05	chrXV	196506	196507	YOL071W	0	+	0
chrXIII	880107	881290	-7.30785285107545e-05	chrXIII	881158	881159	YMR306W	0	+	0
chrV	241887	243164	-6.348727749738e-05	chrV	243179	243180	YER046W	0	+	16
chrXIII	523247	523662	-5.36550742122443e-05	chrXIII	523344	523345	YMR127C	0	-	0
chrXI	554628	554937	-4.63722176043807e-05	chrXI	554986	554987	YKR059W	0	+	50
chrXV	924864	925718	-4.0911430098774e-05	chrXV	925039	925040	YOR324C	0	-	0
chrXI	437379	437791	-3.09810708327273e-05	chrXI	437777	437778	YKL002W	0	+	0
chrIV	212030	212256	-3.07049025609486e-06	chrIV	212045	212046	YDL139C	0	-	0
chrIV	1164630	1165943	-0.0767684058707863	chrIV	1164659	1164660	YDR345C	0	-	0


In [15]:
head pc1_contribs.descending.togene.txt

chrIV	1445158	1446115	7.50936387031513e-05	chrIV	1445466	1445467	YDR497C	0	-	0
chrXV	1003066	1004648	5.61800065935932e-05	chrXV	1003224	1003225	YOR354C	0	-	0
chrXII	931669	933147	5.19492539880198e-05	chrXII	932966	932967	YLR407W	0	+	0
chrV	86062	87064	2.70898543069069e-05	chrV	86936	86937	YEL032W	0	+	0
chrVII	1022307	1022842	1.64906830135266e-05	chrVII	1022655	1022656	YGR266W	0	+	0
chrXII	28871	30024	0.0678804358542824	chrXII	30108	30109	YLL055W	0	+	85
chrXII	746924	747874	0.0669235095057691	chrXII	747110	747111	YLR307C-A	0	-	0
chrVIII	38277	39700	0.0660310452695733	chrVIII	39073	39074	YHL030W-A	0	+	0
chrXV	244410	244812	0.0604821622051676	chrXV	244139	244140	YOL046C	0	-	-271
chrVIII	556266	556857	0.0583396391306409	chrVIII	557041	557042	YHR217C	0	-	185


The GOrilla software expects a ranked list of genes, so we use the cut command to extract just the gene name column 
from the files. 

In [17]:
cut -f8 pc1_contribs.ascending.togene.txt > pc1_contribs.ascending.geneonly.txt
cut -f8 pc1_contribs.descending.togene.txt > pc1_contribs.descending.geneonly.txt
head pc1_contribs.ascending.geneonly.txt

YKL030W
YOL071W
YMR306W
YER046W
YMR127C
YKR059W
YOR324C
YKL002W
YDL139C
YDR345C


In [18]:
#we create a symbolic link to the sorted genes files so you can load them directly into GORilla.

ln -s $WORK_DIR/pc1_contribs.ascending.geneonly.txt ~/training_camp/workflow_notebooks
ln -s $WORK_DIR/pc1_contribs.descending.geneonly.txt ~/training_camp/workflow_notebooks



In [19]:
ls ~/training_camp/workflow_notebooks

0.0 Introduction to Jupyter notebooks.ipynb
1.0 Big Ideas.ipynb
1.1 Unix Basics.ipynb
1.2 Shell scripts and job submission.ipynb
1.3 Getting ready to run code on the cluster.ipynb
2.0 The metadata file and analysis overview.ipynb
2.1_Sequencing_Data_Analysis.ipynb
3.1 Clustering analysis and PCA.ipynb
3.2 Differential gene expression analysis with GORilla.ipynb
3.3 Calling differentially expressed peaks with DESeq2.ipynb
3.4 GO Term Enrichment.ipynb
3.5 Install Missing R packages.ipynb
images
pc1_contribs.ascending.geneonly.txt
pc1_contribs.descending.geneonly.txt
peaks2genes.bed
qc_analysis


Exercise: copy & paste the lists of genes into GOrilla (http://cbl-gorilla.cs.technion.ac.il/) to discover over-represented GO terms.

Exercis: Repeat the above analysis for genes contributing to PC2 and PC3. 