# 2.4 Creating count coverage tracks #

### IMPORTANT: Please make sure that you are using the bash kernel to run this notebook.

In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/tc2016/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
export FASTQ_DIR="${DATA_DIR}/fastq/"
export SRC_DIR="${WORK_DIR}/src/training_camp/src/"

export ANALYSIS_DIR="${WORK_DIR}/analysis/"
export TRIMMED_DIR="$ANALYSIS_DIR/trimmed"
export ALIGNMENT_DIR="$ANALYSIS_DIR/aligned/"
export TAGALIGN_DIR="$ANALYSIS_DIR/tagAlign/"
export PEAKS_DIR="${ANALYSIS_DIR}peaks/"

export YEAST_DIR="/srv/scratch/training_camp/saccer3/seq"
export YEAST_INDEX="/srv/scratch/training_camp/saccer3/bowtie2_index/saccer3"
export YEAST_CHR="/srv/scratch/training_camp/saccer3/sacCer3.chrom.sizes"

export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP 
export TMPDIR=$TMP



Before running the scripts here, make sure your environment variables for the temp folder are set to something other than the default of /tmp, or you may get an out-of-space error:

In [2]:
echo $TMP 
echo $TEMP
echo $TMPDIR 

/srv/scratch/training_camp/tc2016/ubuntu/tmp
/srv/scratch/training_camp/tc2016/ubuntu/tmp
/srv/scratch/training_camp/tc2016/ubuntu/tmp



We will compute the per‐base coverage (number of read starts at each base in the genome) for each sample. We will simply be counting the number of read starts (5’ ends of reads in a strand specific manner) from both strands at each base. This gives us a frequency of cuts at each base.

Note that this is unnormalized coverage i.e. you can’t compare the values per base across samples since samples with overall greater number of reads (sequencing depth) can have greater coverage values simply due to the greater sequencing depth. The normalized signal tracks that we will generate by the peak caller MACS2 are more comparable.

Look at the script **$SRC_DIR/create_countCoverageTracks.sh**. It will use the genomeCoverageBed utility to create the count coverage files. You can see the usage instructions for genomeCoverageBed by typing genomeCoverageBed -h. 

In [3]:
genomeCoverageBed -h


Tool:    bedtools genomecov (aka genomeCoverageBed)
Version: v2.17.0
Summary: Compute the coverage of a feature file among a genome.

Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>

Options: 
	-ibam		The input file is in BAM format.
			Note: BAM _must_ be sorted by position

	-d		Report the depth at each genome position (with one-based coordinates).
			Default behavior is to report a histogram.

	-dz		Report the depth at each genome position (with zero-based coordinates).
			Reports only non-zero positions.
			Default behavior is to report a histogram.

	-bg		Report depth in BedGraph format. For details, see:
			genome.ucsc.edu/goldenPath/help/bedgraph.html

	-bga		Report depth in BedGraph format, as above (-bg).
			However with this option, regions with zero 
			coverage are also reported. This allows one to
			quickly extract all regions of a genome with 0 
			coverage by applying: "grep -w 0$" to the output.

	-split		Treat "split" BAM o

Additional documentation on this and other bed utilities can be found at:

BEDTools software: https://code.google.com/p/bedtools/

BEDTools manual: http://bedtools.readthedocs.org/en/latest/

We will perform the required operations in batch mode using **$SRC_DIR/batch_countCoverage.sh**, which will submit a series of jobs the the queue (each job takes several minutes to run)


In [4]:
$SRC_DIR/batch_countCoverage.sh

Your job 69 ("jobWT-1h_S7_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 70 ("jobWT-3h_S14_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 71 ("jobCu-1h_S4_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 72 ("jobDMSO-1h_S6_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 73 ("jobKz-1h_S1_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 74 ("jobCt-3h_S12_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 75 ("jobKz-3h_S8_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 76 ("jobCu-3h_S11_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 77 ("jobCt-1h_S5_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 78 ("jobMz-3h_S10_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 79 ("jobCz-1h_S2_L001_R1_001.trimmed.nodup.tagAlign.gz") has been submitted
Your job 80 ("jobDMSO-3h_S13_L001_R1_001.trimmed

Let's create a new "signal" directory to store the counts and fold change bigWig files. 

In [6]:
#create a directory to store the signal data 
SIGNAL_DIR="${ANALYSIS_DIR}signal/"
[[ ! -d $SIGNAL_DIR ]] && mkdir -p "$SIGNAL_DIR"

#create a directory to store the fold change data 
FOLDCHANGE_DIR="${SIGNAL_DIR}foldChange/"
[[ ! -d $FOLDCHANGE_DIR ]] && mkdir -p "$FOLDCHANGE_DIR"

#create a directory to store the counts data 
COUNTS_DIR="${SIGNAL_DIR}counts/"
[[ ! -d $COUNTS_DIR ]] && mkdir -p "$COUNTS_DIR"



In [None]:
cd $TAGALIGN_DIR
mv *.count.bedgraph.gz *.count.bigWig $COUNTS_DIR

convert the fold change files from bedGraph to bigWig format and move them to the $FOLDCHANGE_DIR 

In [None]:
cd $PEAKS_DIR
for fold_change_file in *FE.bdg
do
    fold_change_bigwig_file=$FOLDCHANGE_DIR$(echo $(basename $fold_change_file) | sed -e 's/.bdg/.bigWig/')
    bedGraphToBigWig $fold_change_file $YEAST_CHR $fold_change_bigwig_file 
done