# 3.5 GO Term Enrichment for Differentially Accessible Chromatin Regions. #

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [None]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



In this tutorial, we will focus on GO term enrichment analysis: 
![Analysis pipeline](images/part5.png)

In the previous tutorial, we identified differential peaks between pairs of strains and media. These were stored in the $WORK_DIR, as the following files: 

* Media_YPD_vs_YPGE.txt  
* Media_YPD_vs_YPGE.differential.txt  


* Strain_WT_vs_asf1.txt  
* Strain_WT_vs_asdf1.differential.txt


* Strain_WT_vs_rtt109.txt  
* Strain_WT_vs_rtt109.differential.txt  


* Strain_asf1_vs_rtt109.txt
* Strain_asf1_vs_rtt109.differential.txt



We'll perform GO Term enrichment for the Media variable. You can look at term enrichment between the different strains as an exercise. 

In [26]:
cd $WORK_DIR
head Media_YPD_vs_YPGE.differential.txt

chrI	0	857
chrI	2415	2586
chrI	6315	6556
chrI	28570	28931
chrI	29238	29452
chrI	29729	30050
chrI	31624	35831
chrI	43206	43632
chrI	44949	46685
chrI	48249	48767


In [24]:
cut -f1,2,3,5 Media_YPD_vs_YPGE.differential.txt > tmp.txt 
head tmp.txt

chrI	0	857
chrI	2415	2586
chrI	6315	6556
chrI	28570	28931
chrI	29238	29452
chrI	29729	30050
chrI	31624	35831
chrI	43206	43632
chrI	44949	46685
chrI	48249	48767


In [None]:
grep "\-" tmp.txt > down

In [13]:
head Media_YPD_vs_YPGE.differential.txt

chrI	0	857
chrI	2415	2586
chrI	6315	6556
chrI	28570	28931
chrI	29238	29452
chrI	29729	30050
chrI	31624	35831
chrI	43206	43632
chrI	44949	46685
chrI	48249	48767


We will map the differentially expressed peaks to their nearest genes, as we did in tutorial 3.2, and search for GO term enrichment. The genes close to differential peaks will be the foreground set. The full set of genes near peaks will be the background set. 



In [14]:
#foreground mapping
bedtools closest -D a -a Media_YPD_vs_YPGE.differential.txt -b $YEAST_DIR/yeast_tss_coords.bed > Media_YPD_vs_YPGE.differential.togene.txt
#background mapping 

tail -n +2 Media_YPD_vs_YPGE.txt| cut -f1,2,3 | bedtools closest -D a -a stdin -b $YEAST_DIR/yeast_tss_coords.bed > Media_YPD_vs_YPGE.togene.txt




In [15]:
head Media_YPD_vs_YPGE.differential.togene.txt

chrI	0	857	chrI	537	538	YAL068W-A	0	+	0
chrI	2415	2586	chrI	2479	2480	YAL067W-A	0	+	0
chrI	6315	6556	chrI	9015	9016	YAL067C	0	-	2460
chrI	28570	28931	chrI	27967	27968	YAL063C	0	-	-603
chrI	29238	29452	chrI	27967	27968	YAL063C	0	-	-1271
chrI	29729	30050	chrI	31566	31567	YAL062W	0	+	1517
chrI	31624	35831	chrI	35154	35155	YAL060W	0	+	0
chrI	43206	43632	chrI	42176	42177	YAL055W	0	+	-1030
chrI	44949	46685	chrI	45898	45899	YAL053W	0	+	0
chrI	48249	48767	chrI	48563	48564	YAL051W	0	+	0


In [16]:
head Media_YPD_vs_YPGE.togene.txt

chrI	0	857	chrI	537	538	YAL068W-A	0	+	0
chrI	2415	2586	chrI	2479	2480	YAL067W-A	0	+	0
chrI	6315	6556	chrI	9015	9016	YAL067C	0	-	2460
chrI	14706	14936	chrI	13742	13743	YAL064C-A	0	-	-964
chrI	20592	21210	chrI	21565	21566	YAL064W	0	+	356
chrI	28570	28931	chrI	27967	27968	YAL063C	0	-	-603
chrI	29238	29452	chrI	27967	27968	YAL063C	0	-	-1271
chrI	29729	30050	chrI	31566	31567	YAL062W	0	+	1517
chrI	31624	35831	chrI	35154	35155	YAL060W	0	+	0
chrI	42233	42693	chrI	42176	42177	YAL055W	0	+	-57


In [17]:
#As before, we want a list of genes to use in GO Term enrichment, so we extract column 7, which contains the gene names
cut -f7 Media_YPD_vs_YPGE.differential.togene.txt > Media.foreground.txt
cut -f7 Media_YPD_vs_YPGE.togene.txt > Media.background.txt 



In [18]:
head Media.background.txt

YAL068W-A
YAL067W-A
YAL067C
YAL064C-A
YAL064W
YAL063C
YAL063C
YAL062W
YAL060W
YAL055W


In [19]:
#Add symbolic links to the gene lists in the folder where this notebook is stored.
ln -s $WORK_DIR/*foreground* ~/training_camp/workflow_notebooks
ln -s $WORK_DIR/*background* ~/training_camp/workflow_notebooks


ln: failed to create symbolic link ‘/home/ubuntu/training_camp/workflow_notebooks/Media.foreground.txt’: File exists
ln: failed to create symbolic link ‘/home/ubuntu/training_camp/workflow_notebooks/Media.background.txt’: File exists


In [20]:
wc -l Media.foreground.txt

1619 Media.foreground.txt


In [21]:
wc -l Media.background.txt

3287 Media.background.txt


Now use the saccharomyces Genome Databases (SGD’s) GO Term finder tools (http://www.yeastgenome.org/cgi-bin/GO/goTermFinder.pl) to check for GO Term enrichment. Upload your differential gene list as the foreground set and the full gene list as the background set. 