# 3.6 Finding TF motifs # 

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook. ###


In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





In this tutorial, we will focus on identifying motifs in the ATAC-seq peaks: 
![Analysis pipeline](images/part6.png)

In [2]:
cd $WORK_DIR



We will look for TF motifs in the differentially open  chromatin regions we have identified. Pick one of the following files to check for motif enrichment: 

* Media_YPD_vs_YPGE.differential.txt  
* Strain_WT_vs_asdf1.differential.txt
* Strain_WT_vs_rtt109.differential.txt  
* Strain_asf1_vs_rtt109.differential.txt




We will use HOMER (http://homer.ucsd.edu/homer/) to search for enriched motifs. First, we load the module for homer:

In [3]:
module load homer 



In [4]:
module list

Currently Loaded Modulefiles:
  1) homer/default


The specific HOMER command we will use is `findMotifsGenome.pl`. Let's see the inputs and outputs needed by this command:

In [5]:
findMotifsGenome.pl --help


	Program will find de novo and known motifs in regions in the genome

	Usage: findMotifsGenome.pl <pos file> <genome> <output directory> [additional options]
	Example: findMotifsGenome.pl peaks.txt mm8r peakAnalysis -size 200 -len 8

	Possible Genomes:
			-- or --
		Custom: provide the path to genome FASTA files (directory or single file)
			Heads up: will create the directory "preparsed/" in same location.

	Basic options:
		-mask (mask repeats/lower case sequence, can also add 'r' to genome, i.e. mm9r)
		-bg <background position file> (genomic positions to be used as background, default=automatic)
			removes background positions overlapping with target positions
			-chopify (chop up large background regions to the avg size of target regions)
		-len <#>[,<#>,<#>...] (motif length, default=8,10,12) [NOTE: values greater 12 may cause the program
			to run out of memory - in these cases decrease the number of sequences analyzed (-N),
			or try analyzing shorter sequenc

The **pos** file is our list of differential peaks. 

**genome** is the fasta file containing the yeast genome. 

**output dir** is the output directory where HOMER outputs will be stored. 

We leave all other values at their defaults. 


In [None]:
findMotifsGenome.pl $WORK_DIR/Media_YPD_vs_YPGE.differential.txt $YEAST_DIR/sacCer3.fa ~/training_camp/workflow_notebooks/homer_output

We can examine the contents of the homer_output folder in the browser. 