# 3.6 Finding TF motifs # 

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [None]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



In this tutorial, we will focus on identifying motifs in the ATAC-seq peaks: 
![Analysis pipeline](images/part6.png)

In [None]:
cd $WORK_DIR

We will look for TF motifs in the differentially open  chromatin regions we have identified.We have very few differential peaks in our samples, so we will do this exercise with the pilot datset. 


We will use HOMER (http://homer.ucsd.edu/homer/) to search for enriched motifs. First, we load the module for homer:

In [None]:
module load homer 

In [None]:
module list

The specific HOMER command we will use is `findMotifsGenome.pl`. Let's see the inputs and outputs needed by this command:

In [None]:
findMotifsGenome.pl --help

The **pos** file is our list of differential peaks. 

**genome** is the fasta file containing the yeast genome. 

**output dir** is the output directory where HOMER outputs will be stored. 

**background** is the all_merged.peaks.bed file containing all called peaks for the input datasets.

We leave all other values at their defaults. 


In [None]:
findMotifsGenome.pl $WORK_DIR/SKN7_0min_vs_45min.negative.pilot.txt \
                    sacCer3 \
                    homer_SKN7_0min_vs_45min_negative \
                    -bg $AGGREGATE_ANALYSIS_DIR/all_merged.peaks.bed



We can examine the contents of the **homer_SKN7_0min_vs_45min_negative** folder in the browser (it's located within your folder on `http://1.gentc.net/scratch/`


## Finding all occurences of a motif within a peak set 

We can also tell HOMER to scan for all instances of a specific motif in the peak set. This is useful for the footprinting and V-plot analyses in the subsequent tutorials. 
We will find all instances of the top three *de novo* hits in SKN7 0min vs 45min. We will also scan for the REB1 Tf, which has been shown in prior work to play an important regulatory role in sacCer3. 

![top_hits](images/top_hits_homer_SKN7.png)

Note: you can click on the "motif file matrix" link in the right-most column of the homerResults.html results file to get the input motif file for scanning: 


```
>GGGCGGCACAAG	1-GGGCGGCACAAG,BestGuess:POL011.1_XCPE1/Jaspar(0.681)	10.848594	-40.855667	0	T:9.0(5.70%),B:1.0(0.03%),P:1e-17
0.001	0.001	0.997	0.001
0.125	0.250	0.624	0.001
0.001	0.001	0.997	0.001
0.001	0.997	0.001	0.001
0.125	0.125	0.749	0.001
0.001	0.001	0.874	0.124
0.001	0.749	0.249	0.001
0.749	0.001	0.125	0.125
0.124	0.874	0.001	0.001
0.874	0.001	0.124	0.001
0.997	0.001	0.001	0.001
0.125	0.125	0.749	0.001
```
This motif is located in the output folder: 
```
/scratch/[YOUR USERNAME]/homer_SKN7_0min_vs_45min_negative/homerResults/motif1.motif
```

We use the `scanMotifGenomeWide.pl` HOMER command to find all instances of these 2 motifs in the genome. We then intersect the resulting bed files with the peak bed files in 0min_SKN7 and 45min_SKN7, providing us a lit of motifs within peaks for downstream analysis. 


In [None]:
scanMotifGenomeWide.pl homer_SKN7_0min_vs_45min_negative/homerResults/motif1.motif sacCer3 -bed  > denovo1.genomewide.bed

In [None]:
head  denovo1.genomewide.bed

In [None]:
scanMotifGenomeWide.pl /data/motif_pfm/reb1.motif sacCer3 -bed  > reb1.genomewide.bed 

In [None]:
head reb1.genomewide.bed

In [None]:
#now, use bedtools to intersect motif positions with peak calls 
bedtools intersect -a denovo1.genomewide.bed -b $AGGREGATE_ANALYSIS_DIR/croo_pilot/0min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak | bedtools sort -i - |uniq > denovo1.in.0min_SKN7.bed
bedtools intersect -a denovo1.genomewide.bed -b $AGGREGATE_ANALYSIS_DIR/croo_pilot/45min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak | bedtools sort -i - |uniq > denovo1.in.45min_SKN7.bed
bedtools intersect -a reb1.genomewide.bed -b $AGGREGATE_ANALYSIS_DIR/croo_pilot/0min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak | bedtools sort -i - | uniq > REB1.in.0min_SKN7.bed
bedtools intersect -a reb1.genomewide.bed -b $AGGREGATE_ANALYSIS_DIR/croo_pilot/45min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak | bedtools sort -i - | uniq >REB1.in.45min_SKN7.bed 

In [None]:
head denovo1.in.0min_SKN7.bed

In [None]:
#let's count how many motif hits we have in each peak set 
wc -l denovo1.in.0min_SKN7.bed
wc -l denovo1.in.45min_SKN7.bed
wc -l REB1.in.0min_SKN7.bed
wc -l REB1.in.45min_SKN7.bed

In [None]:
wc -l $AGGREGATE_ANALYSIS_DIR/croo_pilot/0min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak

In [None]:
wc -l $AGGREGATE_ANALYSIS_DIR/croo_pilot/45min_SKN7/peak/idr_reproducibility/optimal_peak.narrowPeak

Looks like we have too few motif-peak intersections for the top de novo HOMER hit, but we have a good number of REB1 hits for footprinting. 