# 1.3 Getting ready to run code on the cluster#

Now that you can submit jobs like any self-respecting Unix ninja, you are ready to start analyzing data! Here you will learn about how to organize your research directory and setup the cluster environment to access all software you wish to use.

## Organizing your research as a pro##

This is a really nice paper with guidelines on organizing computational projects in an organized and snazzy fashion: (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424). Let's see this in action!

First, define a variable with the training camp directory

In [1]:
%%bash 
TCDIR=/tc2016/$(whoami)

Organize your folder into subdirectories as a pro: 

In [2]:
%%bash 
cd ${TCDIR}
mkdir data src

## Preparing to run code on the cluster ## 

Our data processing will use multiple software tools. To be able to access them, we can load their paths into our session, by loading their respective modules.

To load a module, you can type
**module load [desiredModule]**

Once a module is loaded, you can use the code associated with that module directly. For instance, let's say you want to load a module for BEDTools (a software package we will be using in this training camp). If you run:

In [4]:
%%bash 
module load bedtools/2.25.0

bash: line 1: module: command not found


it loads the bedtools code, such that when you are ready to use the code, you can just directly call commands, e.g:


In [6]:
%%bash 
bedtools -h

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of exi

Don't worry, you do not need to know off the top of your head the names of the modules you want. To see all software modules available on the AWS cluster, type:

In [8]:
%%bash 
module avail

bash: line 1: module: command not found


## The .bashrc file (=your friend) ##

Wouldn't it be nice to have everything ready to run when you log into the cluster?
To avoid having to run module load commands every time you log in, you can add these commands to a .bashrc file, located in your home directory. The .bashrc file contains a set of commands that get executed every time you log into the server. In this way, every time you log in, you will be all set to run all code you wish.

Note: Technically, the ~/.bashrc file is not what's executed on login; it's ~/.bash_profile, which in turn calls ~/.bashrc. If your .bash_profile does not call .bashrc, put the line source ~/.bashrc in your .bash_profile. The difference between the two files is explained here: http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html

Let's add all our desired module loading commands into a .bashrc file. 

In [19]:
%%bash

bowtie_load='module load bowtie/2.0.5'
bedtools_load='module load bedtools/2.19.1'
fastqc_load='module load fastqc/0.11.2'
java_load='module load java/latest'
picard_load='module load picard-tools/1.92'
r_load='module load r/3.0.1'
samtools_load='module load samtools/0.1.19'
ucsc_load='module load ucsc_tools/2.7.2'
macs2_load='module load MACS2/2.1.0'
homer_load='module load homer/default'

#only add the module load commands to the ~/.bashrc file if they don't exist in this file already 
grep -E "$bowtie_load" ~/.bashrc || echo $bowtie_load >> ~/.bashrc
grep -E "$bedtools_load" ~/.bashrc || echo $bedtools_load >> ~/.bashrc
grep -E "$fastqc_load" ~/.bashrc || echo $fastqc_load >> ~/.bashrc
grep -E "$java_load" ~/.bashrc || echo $java_load >> ~/.bashrc
grep -E "$picard_load" ~/.bashrc || echo $picard_load >> ~/.bashrc
grep -E "$r_load" ~/.bashrc || echo $r_load >> ~/.bashrc
grep -E "$samtools_load" ~/.bashrc || echo $samtools_load >> ~/.bashrc
grep -E "$ucsc_load" ~/.bashrc || echo $ucsc_load >> ~/.bashrc
grep -E "$macs2_load" ~/.bashrc || echo $macs2_load >> ~/.bashrc
grep -E "$homer_load" ~/.bashrc || echo $homer_load >> ~/.bashrc



module load bowtie/2.0.5
module load bedtools/2.19.1
module load fastqc/0.11.2
module load java/latest
module load picard-tools/1.92
module load r/3.0.1
module load samtools/0.1.19
module load ucsc_tools/2.7.2
module load MACS2/2.1.0
module load homer/default


## Defining shortcuts in the .bashrc file ##

Why stop here? You can make all your dreams come true in the .bashrc file!
For instance, you can add to the .bashrc file some shortcuts to your directories of interest, which you can then seamlessly use. Add the following to your .bashrc:
### Note: only run the cell below once to avoid adding the entries to your .bashrc file multiple times. You might want to comment out this cell after you have run the commands (i.e. add # in front of each line)###

In [27]:
%%bash 
echo 'export SUNETID="$(whoami)"' >> ~/.bashrc 
echo 'export WORK_DIR="/tc2016/${SUNETID}"' >> ~/.bashrc
echo 'export DATA_DIR="${WORK_DIR}/data"' >> ~/.bashrc
echo '[[ ! -d ${DATA_DIR} ]] && mkdir "${DATA_DIR}"' >> ~/.bashrc
echo 'export SRC_DIR="${WORK_DIR}/src/training_camp/src"' >> ~/.bashrc
echo '[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"' >> ~/.bashrc
echo 'export AK_DATA_DIR="/srv/scratch/trainingCamp/2016/data/2016"' >> ~/.bashrc
echo 'export AK_TOOL_DIR="/srv/scratch/trainingCamp/2016/tools"' >> ~/.bashrc
echo 'export YEAST_DIR="${AK_TOOL_DIR}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence"' >> ~/.bashrc 
echo 'export YEAST_INDEX="${YEAST_DIR}/Bowtie2Index/genome"' >> ~/.bashrc
echo 'export YEAST_CHR="${YEAST_DIR}/Chromosomes"' >> ~/.bashrc
echo 'export TMP="${WORK_DIR}/data/tmp"' >> ~/.bashrc
echo 'export TEMP=$TMP' >> ~/.bashrc
echo 'export TMPDIR=$TMP' >> ~/.bashrc
echo 'mkdir $TMP' >> ~/.bashrc 

export AK_DATA_DIR="/srv/scratch/trainingCamp/2016/data/2016"
export AK_TOOL_DIR="/srv/scratch/trainingCamp/2016/tools"


**\$\{WORK\_DIR\}** is your main work directory

**\$\{SRC\_DIR\}** is your src/ directory

**\$\{DATA\_DIR\}** is your data/ directory

**\$\{AK\_DATA\_DIR\}** is the directory with the fastq files (the data you will use in this project)

To see your ~/.bashrc and ~/.bash_profile files in action, logout and log in again. All modules should be loaded and all shortcuts should be set!

## Data and code for this project ##

Make symbolic links to the data (symbolic links are pointers to other files; they behave like copies of the original file but without actually duplicating the data):

In [32]:
%%bash 
mkdir -p ${DATA_DIR}/fastq #makes a directory to contain the symbolic links
ln -s -r ${AK_DATA_DIR}/* ${DATA_DIR}/fastq/  #creates the symbolic links (-r is needed to handle directories)

mkdir: cannot create directory ‘/fastq’: Permission denied
ln: target ‘/fastq/’ is not a directory: No such file or directory


Get the code we provide for this project:

In [33]:
%%bash 
cd ${WORK_DIR}/src
git clone https://github.com/kundajelab/training_camp

bash: line 1: cd: /src: No such file or directory
Cloning into 'training_camp'...


## Writing code and code version control ##

Your code is precious! For this reason, you want to: 1) make sure it is backed up, and 2) keep a "diary" of all versions of your code so that you can track its progress and revert back to previous versions if need be. Both of these goals can be achieved with version control software, such as Github.

In this training camp you will get familiarized with GitHub.

First, go to Github, make an account and create a repository. Let's assume you named the repository myExampleRepo.

Then, to get used to GitHub, you can try to make a test script and add it to your repository on GitHub.

First, we will get a copy of your Github repository in your directory

In [34]:
%%bash 
cd ${WORK_DIR}
mkdir testGihub
cd testGihub
export githubusername="FILL IN YOUR USERNAME HERE!"
git clone https://github.com/$githubusername/myExampleRepo
ls

bash: line 4: githubusername: No such file or directory


You should now see a directory named myExampleRepo in the current working directory. We will now make a new file and add it to the repository. We add some text to this script using the "echo" command.

In [37]:
%%bash 
cd myExampleRepo
touch ${SUNETID}_testscript.sh
echo "Here is some example text" >> ${SUNETID}_testscripts.sh

bash: line 1: cd: myExampleRepo: No such file or directory


Now, you will learn how to "push" this script to Github.

First, get the most current version of the code:

In [38]:
%%bash 
git pull 

Already up-to-date.


You will probably be told that your repository is already up-to-date. However, if you are collaborating with other people, or have modified the code from a different computer, it is very important to always do a git pull before submitting your changes.

Next, add your code to the Github:

In [None]:
%%bash 
git add ${SUNETID}_testscript.sh

Tell Github that you are ready to load your code (so-called commit).

In [None]:
%%bash 
git commit -m "my first script on github"

Then, push your code to Github.

In [None]:
%%bash 
git push

When you visit the Github page for your code, you should be able to see your code!

These are only the most basic commands. For more on Github see https://training.github.com/kit/downloads/github-git-cheat-sheet.pdf

Let the analysis begin!