# 3.4 GO Term Enrichment with Foreground and Background Set#

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





In this tutorial, we will focus on GO term enrichment analysis: 
![Analysis pipeline](images/part5.png)

In the previous tutorial, we identified differential peaks between pairs of strains and media. These were stored in the $WORK_DIR, as the following files: 

* Media_SCD_vs_SCE.txt  
* Media_SCD_vs_SCE.differential.txt  


* Strain_WT_vs_cln3.txt  
* Strain_WT_vs_cln3.differential.txt

* Strain_WT_vs_whi5.txt  
* Strain_WT_vs_whi5.txt.sigPeakNames  


* Strain_WT_vs_cln3.txt
* Strain_WT_vs_cln3.differential.txt


We'll perform GO Term enrichment for the Media variable. You can look at term enrichment between the different strains as an exercise. 

In [11]:
cd $WORK_DIR
head Media_SCD_vs_SCE.txt

baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
chrI	0	781	71.4247594771033	0.362542651434563	0.191033277028369	1.8977984206423	0.057722641357702	0.412486273145827
chrI	6332	6549	391.519300192043	-0.0903453826781279	0.122585917923755	-0.736996420211335	0.461124526325371	0.696855927778159
chrI	9138	9609	103.122109647538	0.170848189475769	0.152245402286897	1.12218948427629	0.261781882791821	0.589369470704246
chrI	20611	21197	116.903606874811	0.456142166962172	0.206984101656355	2.20375460391389	0.0275416067274062	0.303433319639802
chrI	28155	29092	17.918683845665	0.354579784474876	0.269386440193253	1.31624956408536	0.188090293004623	0.543158882236564
chrI	29173	30197	72.0299111447025	-0.218158510802214	0.148134090467285	-1.47270969237424	0.140829331980554	0.50517128958626
chrI	31527	31972	17.6993060460414	0.0878247958481518	0.272882234659731	0.321841383180054	0.74757286164616	0.8584568746176
chrI	32456	36256	25.4848246195161	-0.249976827324697	0.276448051484265	-0.904245213458939	0

In [12]:
head Media_SCD_vs_SCE.differential.txt

chrI	60701	61907
chrI	166120	166935
chrI	189485	192687
chrII	306681	307593
chrII	623263	623436
chrIII	90707	91411
chrIV	47634	48030
chrIV	73358	73585
chrIV	124897	125699
chrIV	524555	525548


We will map the differentially expressed peaks to their nearest genes, as we did in tutorial 3.2, and search for GO term enrichment. The genes close to differential peaks will be the foreground set. The full set of genes near peaks will be the background set. 



In [16]:
#foreground mapping
bedtools closest -D a -a Media_SCD_vs_SCE.differential.txt -b $YEAST_DIR/yeast_tss_coords.bed > Media_SCD_vs_SCE.differential.togene.txt
#background mapping 
tail -n +2 Media_SCD_vs_SCE.txt| cut -f1,2,3 | bedtools closest -D a -a stdin -b $YEAST_DIR/yeast_tss_coords.bed > Media_SCD_vs_SCE.togene.txt




In [10]:
head Media_SCD_vs_SCE.differential.togene.txt

chrI	60701	61907	chrI	61315	61316	YAL042W	0	+	0
chrI	166120	166935	chrI	165865	165866	YAR010C	0	-	-255
chrI	189485	192687	chrI	192618	192619	YAR042W	0	+	0
chrII	306681	307593	chrII	306954	306955	YBR035C	0	-	0
chrII	623263	623436	chrII	623575	623576	YBR201W	0	+	140
chrIII	90707	91411	chrIII	91323	91324	YCL018W	0	+	0
chrIV	47634	48030	chrIV	48030	48031	YDL227C	0	-	1
chrIV	73358	73585	chrIV	73917	73918	YDL215C	0	-	333
chrIV	124897	125699	chrIV	125615	125616	YDL186W	0	+	0
chrIV	524555	525548	chrIV	525439	525440	YDR037W	0	+	0


In [17]:
head Media_SCD_vs_SCE.togene.txt

chrI	0	781	chrI	537	538	YAL068W-A	0	+	0
chrI	6332	6549	chrI	9015	9016	YAL067C	0	-	2467
chrI	9138	9609	chrI	9015	9016	YAL067C	0	-	-123
chrI	20611	21197	chrI	21565	21566	YAL064W	0	+	369
chrI	28155	29092	chrI	27967	27968	YAL063C	0	-	-188
chrI	29173	30197	chrI	27967	27968	YAL063C	0	-	-1206
chrI	31527	31972	chrI	31566	31567	YAL062W	0	+	0
chrI	32456	36256	chrI	35154	35155	YAL060W	0	+	0
chrI	39017	39243	chrI	39045	39046	YAL056C-A	0	-	0
chrI	42035	42993	chrI	42176	42177	YAL055W	0	+	0


In [18]:
#As before, we want a list of genes to use in GO Term enrichment, so we extract column 7, which contains the gene names
cut -f7 Media_SCD_vs_SCE.differential.togene.txt > Media.foreground.txt
cut -f7 Media_SCD_vs_SCE.togene.txt > Media.background.txt 



In [19]:
#Add symbolic links to the gene lists in the folder where this notebook is stored.
ln -s $WORK_DIR/*foreground* ~/training_camp/workflow_notebooks
ln -s $WORK_DIR/*background* ~/training_camp/workflow_notebooks





Now use the saccharomyces Genome Databases (SGD’s) GO Term finder tools (http://www.yeastgenome.org/cgi-bin/GO/goTermFinder.pl) to check for GO Term enrichment. Upload your differential gene list as the foreground set and the full gene list as the background set. 