# 1.3 Getting ready to run code on the cluster#

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.

Now that you can submit jobs like any self-respecting Unix ninja, you are ready to start analyzing data! Here you will learn about how to organize your research directory and setup the cluster environment to access all software you wish to use.

## Organizing your research as a pro##

This is a really nice paper with guidelines on organizing computational projects in an organized and snazzy fashion: (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424). Let's see this in action!

First, define a variable with the training camp directory

In [16]:
export WORKDIR=/srv/scratch/training_camp/work/$(whoami)
export WORKDIR=/srv/scratch/training_camp/work/`whoami`

echo $WORKDIR

/srv/scratch/training_camp/work/ubuntu


In [17]:
whoami

ubuntu


Organize your folder into subdirectories as a pro: 

In [18]:
cd ${WORKDIR}
mkdir data src
ls

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘src’: File exists
all.fc.bigwig		cln3-SCE-0_6MNaCl-Rep1	WT-SCD-Rep1
all.fc.txt		cln3-SCE-0_6MNaCl-Rep2	WT-SCD-Rep1_out
all_merged.peaks.bed	cln3-SCE-Rep1		WT-SCD-Rep1_R1_001.fc.signal
all_merged.peaks.bed~	cln3-SCE-Rep2		WT-SCD-Rep1_R1_001.tagAlign
all.peaks.bed		data			WT-SCD-Rep2
all.peaks.sorted.bed	narrowPeak_files.txt	WT-SCD-Rep2_out
all.pval.bigwig		src			WT-SCD-Rep2_R1_001.fc.signal
all.pval.txt		tmp			WT-SCD-Rep2_R1_001.tagAlign
all.tagAlign		whi5-cln3-SCE-Rep1	WT-SCE-0_6MNaCl-Rep1
all.tagAlign.files.txt	whi5-cln3-SCE-Rep2	WT-SCE-0_6MNaCl-Rep2
cln3-SCD-0_6MNaCl-Rep1	whi5-SCE-Rep1		WT-SCE-Rep1
cln3-SCD-0_6MNaCl-Rep2	whi5-SCE-Rep2		WT-SCE-Rep2
cln3-SCD-Rep1		WT-SCD-0_6MNaCl-Rep1
cln3-SCD-Rep2		WT-SCD-0_6MNaCl-Rep2


In [19]:
cd $WORKDIR



In [20]:
pwd

/srv/scratch/training_camp/work/ubuntu


## Preparing to run code on the cluster ## 

Our data processing will use multiple software tools. To be able to access them, we can load their paths into our session, by loading their respective modules.

To load a module, you can type
**module load [desiredModule]** - this is going to modify your path

Once a module is loaded, you can use the code associated with that module directly. For instance, let's say you want to load a module for BEDTools (a software package we will be using in this training camp). If you run:

In [21]:
source /etc/profile.d/modules.sh



In [22]:
module load bedtools/2.17.0



In [23]:
module list

Currently Loaded Modulefiles:
  1) bedtools/2.17.0


It loads the bedtools code, such that when you are ready to use the code, you can just directly call commands. Note that the -h or --help arguments can often be used to give help about a particular tool.


In [24]:
bedtools -h

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from 

Don't worry, you do not need to know off the top of your head the names of the modules you want. To see all software modules available on the AWS cluster, type:

In [25]:
module avail


------------------------- /usr/share/modules/versions --------------------------
3.2.10

------------------------ /usr/share/modules/modulefiles ------------------------
bamtools/default  homer/default     modules           ucsc_tools/2.7.2
bedtools/2.17.0   java/latest       null              use.own
bowtie/2.1.0      MACS2/2.0.9       picard-tools/1.95
dot               module-git        r/3.0.2
fastqc/0.10.1     module-info       samtools/0.1.19


## The .bashrc file (=your friend) ##

In [26]:
#Where is .bashrc?
#In our home directory
cd ~
pwd
ls
#But this doesn't show bashrc...
ls -ah #this does (shows all hidden files)
#.bash_logout automatically runs things when you log out

/home/ubuntu
jupyterhub  populate_users.sh  sge_setup  training_camp
.		.cache	    .gitconfig	 .login.old	    sge_setup
..		.conda	    .gnupg	 .oracle_jre_usage  .ssh
.bash_history	.config     .ipython	 populate_users.sh  training_camp
.bash_logout	.continuum  .jupyter	 .profile	    .viminfo
.bashrc		.cshrc	    jupyterhub	 .profile.old	    .Xauthority
.bashrc~	.cshrc.old  .kshenv	 .python-eggs
.bashrc_backup	.dbus	    .kshenv.old  .python_history
.bashrc.old	.emacs.d    .local	 .Rprofile
.bds		.gconf	    .login	 .selected_editor


Wouldn't it be nice to have everything ready to run when you log into the cluster?
To avoid having to run module load commands every time you log in, you can add these commands to a .bashrc file, located in your home directory. The .bashrc file contains a set of commands that get executed every time you log into the server. In this way, every time you log in, you will be all set to run all code you wish.

Note: Technically, the ~/.bashrc file is not what's executed on login; it's ~/.bash_profile, which in turn calls ~/.bashrc. If your .bash_profile does not call .bashrc, put the line source ~/.bashrc in your .bash_profile. The difference between the two files is explained here: http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html

Let's add all our desired module loading commands into a .bashrc file. 

In [27]:
bedtools_load='module load bedtools/2.17.0'
#naive thing:
#NOTE: ~ is a shortcut for your home directory
#echo $bedtools_load >> ~/.bashrc #this might clutter up our bashrc if we run it a bunch of times

#only add the module load commands to the ~/.bashrc file if they don't exist in this file already 
#The || acts like an OR; it executes the command on the right if the command on the left errors out
#grep -E acts like search
#reminder: "$bedtoos_load" decodes to 'module load bedtools/2.17.0'
grep -E "$bedtools_load" ~/.bashrc || echo $bedtools_load >> ~/.bashrc


module load bedtools/2.17.0


## Defining shortcuts in the .bashrc file ##

Why stop here? You can make all your dreams come true in the .bashrc file!
For instance, you can add to the .bashrc file some shortcuts to your directories of interest, which you can then seamlessly use. Add the following to your .bashrc:

In [28]:
grep -E "shortcuts_defined" ~/.bashrc || 
(echo '#shortcuts_defined:' >> ~/.bashrc &&

echo 'export SUNETID="$(whoami)"' >> ~/.bashrc &&
echo 'export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"' >> ~/.bashrc &&

echo 'export DATA_DIR="${WORK_DIR}/data"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"' >> ~/.bashrc &&

echo 'export SRC_DIR="${WORK_DIR}/src"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"' >> ~/.bashrc &&

echo 'export METADATA_DIR="/srv/scratch/training_camp/metadata"' >> ~/.bashrc &&
echo 'export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"' >> ~/.bashrc &&
echo 'export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"' >> ~/.bashrc &&
echo 'export YEAST_DIR="/srv/scratch/training_camp/saccer3"' >> ~/.bashrc &&

echo 'export TMP="${WORK_DIR}/tmp"' >> ~/.bashrc &&
echo 'export TEMP=$TMP' >> ~/.bashrc && 
echo 'export TMPDIR=$TMP' >> ~/.bashrc && 
echo '[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"' >> ~/.bashrc )

#shortcuts_defined:


**\$\{WORK\_DIR\}** is your main work directory

**\$\{DATA\_DIR\}** is your data/ directory -- used for storing the subset of the data you will be working with.  

**\$\{SRC\_DIR\}** is your src/ directory -- used for storing code. 

**\$\{METADATA_DIR}** is the directory with the metadata file for this year's training camp.  


**\$\{AGGREGATE\_ANALYSIS\_DIR}** We will store the aggregate analysis results for all samples in this directory for common use by everyone. 

**\$\{AGGREGATE\_DATA\_DIR\}** is the data/ directory -- this is where we store all the raw data from the sequencer generated by the group

**\$\{YEAST\_DIR}** is the directory with the yeast reference genome files 

**\$\{TMP\_DIR}** is the directory where your temporary files will be stored when you execute code. 



To see your ~/.bashrc and ~/.bash_profile files in action, logout and log in again. All modules should be loaded and all shortcuts should be set!

Since logging in/out would disrupt this tutorial, we execute the commands in our ipython notebook:

In [29]:
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"




## Data and code for this project ##

It's generally good practice to always keep a backup copy of your raw data files in case you unintentionally delete or modify these files when performing your analysis. 

For this reason, you will copy the two samples you generated from the **\$AGGREGATE_DATA_DIR** folder to your personal **\$DATA_DIR** folder. 

In [15]:
#backup data files from 2017
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCD-0_6MNaCl-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCD-0_6MNaCl-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCD-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCD-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCE-0_6MNaCl-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCE-0_6MNaCl-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCE-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/WT-SCE-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCD-0_6MNaCl-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCD-0_6MNaCl-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCD-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCD-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCE-0_6MNaCl-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCE-0_6MNaCl-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCE-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/cln3-SCE-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/whi5-SCE-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/whi5-SCE-Rep2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/whi5-cln3-SCE-Rep1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/tc2017/whi5-cln3-SCE-Rep2* $DATA_DIR




In [None]:
#new data from 2018 
#cp $AGGREGATE_DATA_DIR/asf1_glucose_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_glucose_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_glucose_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_glucose_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_glucose_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_glucose_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_ethanol_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_ethanol_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rtt109_glucose_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/WT_glucose_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/asf1_ethanol_6* $DATA_DIR



In [None]:
ls $DATA_DIR

Let the analysis begin!