# 1.2 Getting ready to run code on the cluster#

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.

Now that you can submit jobs like any self-respecting Unix ninja, you are ready to start analyzing data! Here you will learn about how to organize your research directory and setup the cluster environment to access all software you wish to use.

## Organizing your research as a pro##

This is a really nice paper with guidelines on organizing computational projects in an organized and snazzy fashion: (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424). 
![Analysis Pipeline](images/journal.pcbi.1000424.g001.png)

Let's see this in action!


First, let's set up our working directory (also known as "scratch directory")

In [None]:
export WORKDIR=/scratch/$(whoami)

echo $WORKDIR

In [None]:
whoami

Organize your folder into subdirectories as a pro: 

In [None]:
cd ${WORKDIR}
mkdir data src
ls

In [None]:
cd $WORKDIR

In [None]:
pwd

## Preparing to run code on the cluster ## 

Our data processing will use multiple software tools. To be able to access them, we can load their paths into our session, by loading their respective modules.

To load a module, you can type
**module load [desiredModule]** - this is going to modify your path

Once a module is loaded, you can use the code associated with that module directly. For instance, let's say you want to load a module for BEDTools (a software package we will be using in this training camp). If you run:

In [None]:
source /etc/profile.d/modules.sh

In [None]:
echo $PATH

In [None]:
module list 

In [None]:
module load bedtools

In [None]:
module list

In [None]:
echo $PATH

It loads the bedtools code, such that when you are ready to use the code, you can just directly call commands. Note that the -h or --help arguments can often be used to give help about a particular tool.


In [None]:
module unload bedtools

In [None]:
module load bedtools 

In [None]:
which bedtools

Don't worry, you do not need to know off the top of your head the names of the modules you want. To see all software modules available on the AWS cluster, type:

In [None]:
module avail

## The .bashrc file (=your friend) ##

In [None]:
#Where is .bashrc?
#In our home directory
cd ~
pwd
ls


In [None]:
#But this doesn't show bashrc...
ls -ah #this does (shows all hidden files)
#.bash_logout automatically runs things when you log out

Wouldn't it be nice to have everything ready to run when you log into the cluster?
To avoid having to run module load commands every time you log in, you can add these commands to a .bashrc file, located in your home directory. The .bashrc file contains a set of commands that get executed every time you log into the server. In this way, every time you log in, you will be all set to run all code you wish.

Note: Technically, the ~/.bashrc file is not what's executed on login; it's ~/.bash_profile, which in turn calls ~/.bashrc. If your .bash_profile does not call .bashrc, put the line source ~/.bashrc in your .bash_profile. The difference between the two files is explained here: http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html

Let's add all our desired module loading commands into a .bashrc file. 

In [None]:
bedtools_load='module load bedtools/2.26.0'
#naive thing:
#NOTE: ~ is a shortcut for your home directory
#echo $bedtools_load >> ~/.bashrc #this might clutter up our bashrc if we run it a bunch of times

#only add the module load commands to the ~/.bashrc file if they don't exist in this file already 
#The || acts like an OR; it executes the command on the right if the command on the left errors out
#grep -E acts like search
#reminder: "$bedtoos_load" decodes to 'module load bedtools/2.17.0'
grep -E "$bedtools_load" ~/.bashrc || echo $bedtools_load >> ~/.bashrc


## Defining shortcuts in the .bashrc file ##

Why stop here? You can make all your dreams come true in the .bashrc file!
For instance, you can add to the .bashrc file some shortcuts to your directories of interest, which you can then seamlessly use. Add the following to your .bashrc:

In [None]:
grep -E "shortcuts_defined" ~/.bashrc || 
(echo '#shortcuts_defined:' >> ~/.bashrc &&

echo 'export SUNETID="$(whoami)"' >> ~/.bashrc &&
echo 'export WORK_DIR="/scratch/${SUNETID}"' >> ~/.bashrc &&

echo 'export DATA_DIR="${WORK_DIR}/data"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"' >> ~/.bashrc &&

echo 'export SRC_DIR="${WORK_DIR}/src"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"' >> ~/.bashrc &&

echo 'export METADATA_DIR="/metadata"' >> ~/.bashrc &&
echo 'export AGGREGATE_DATA_DIR="/data"' >> ~/.bashrc &&
echo 'export AGGREGATE_ANALYSIS_DIR="/outputs"' >> ~/.bashrc &&
echo 'export YEAST_DIR="/saccer3"' >> ~/.bashrc &&

echo 'export TMP="${WORK_DIR}/tmp"' >> ~/.bashrc &&
echo 'export TEMP=$TMP' >> ~/.bashrc && 
echo 'export TMPDIR=$TMP' >> ~/.bashrc && 
echo '[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"' >> ~/.bashrc )

**\$\{WORK\_DIR\}** is your main work directory

**\$\{DATA\_DIR\}** is your data/ directory -- used for storing the subset of the data you will be working with.  

**\$\{SRC\_DIR\}** is your src/ directory -- used for storing code. 

**\$\{METADATA_DIR}** is the directory with the metadata file for this year's training camp.  


**\$\{AGGREGATE\_ANALYSIS\_DIR}** We will store the aggregate analysis results for all samples in this directory for common use by everyone. 

**\$\{AGGREGATE\_DATA\_DIR\}** is the data/ directory -- this is where we store all the raw data from the sequencer generated by the group

**\$\{YEAST\_DIR}** is the directory with the yeast reference genome files 

**\$\{TMP\_DIR}** is the directory where your temporary files will be stored when you execute code. 



To see your ~/.bashrc and ~/.bash_profile files in action, logout and log in again. All modules should be loaded and all shortcuts should be set!

Since logging in/out would disrupt this tutorial, we execute the commands in our ipython notebook:

In [None]:
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"


## Data and code for this project ##

It's generally good practice to always keep a backup copy of your raw data files in case you unintentionally delete or modify these files when performing your analysis. 

For this reason, you will copy the two samples you generated from the **\$AGGREGATE_DATA_DIR** folder to your personal **\$DATA_DIR** folder. 

In [None]:
#backup data files from 2018
#cp $AGGREGATE_DATA_DIR/hrosenbl_WT_YPGE_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/pgoddard_asf1_YPGE_1* $DATA_DIR
cp $AGGREGATE_DATA_DIR/yiuwong_rtt109_YPGE_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/kjngo_WT_YPGE_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/marinovg_asf1_YPGE_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ambenj_rtt109_YPGE_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/mkoska_WT_YPGE_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/dcotter1_asf1_YPGE_3* $DATA_DIR
cp $AGGREGATE_DATA_DIR/jkcheng_rtt109_YPGE_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rpatel7_WT_YPGE_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/jarod_asf1_YPGE_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ktomins_rtt109_YPGE_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/gamador_WT_YPGE_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/raungar_asf1_YPGE_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rosaxma_rtt109_YPD_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/dmaghini_WT_YPD_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/egreenwa_asf1_YPD_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/kjhanson_rtt109_YPD_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/hrosenbl_asf1_YPD_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/pgoddard_rtt109_YPD_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/yiuwong_WT_YPD_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/kjngo_rtt109_YPD_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/marinovg_WT_YPD_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ambenj_asf1_YPD_2* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/mkoska_asf1_YPD_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/dcotter1_rtt109_YPD_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/jkcheng_WT_YPD_3* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rpatel7_asf1_YPGE_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/jarod_rtt109_YPGE_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ktomins_WT_YPGE_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/gamador_rtt109_YPGE_6* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/raungar_WT_YPD_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/rosaxma_asf1_YPD_4* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/dmaghini_asf1_YPD_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/egreenwa_rtt109_YPD_5* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/kjhanson_WT_YPD_6* $DATA_DIR

In [None]:
ls $DATA_DIR

In [None]:
#new data files from 2019 
#Note: Copy the 2 replicate for a given strain/timepoint to your data directory. Pick any set of 2 that you like. 

#srstern
#cp $AGGREGATE_DATA_DIR/genegra2_0h_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/srstern_0h_WT_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/lstrand_4h_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ajberg5_4h_WT_2* $DATA_DIR

#cvduffy
#cp $AGGREGATE_DATA_DIR/sierrasb_0h_tf1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/cvduffy_0h_tf1_2* $DATA_DIR

#genegra2
#cp $AGGREGATE_DATA_DIR/genegra2_4h_tf1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/mihayes_4h_tf1_2* $DATA_DIR

#subkc
#cp $AGGREGATE_DATA_DIR/subkc_0h_tf2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/annlin_0h_tf2_2* $DATA_DIR

#clin5
#cp $AGGREGATE_DATA_DIR/srstern_4h_tf2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/clin5_4h_tf2_2* $DATA_DIR

#lstrand
#cp $AGGREGATE_DATA_DIR/lstrand_0h_tf3_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/clin5_0h_tf3_2* $DATA_DIR

#miao1
#cp $AGGREGATE_DATA_DIR/miao1_4h_tf3_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/makena_4h_tf3_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/ajberg5_0h_tf4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/miao1_0h_tf4_2* $DATA_DIR

#courtrun
#cp $AGGREGATE_DATA_DIR/jarhodes_4h_tf4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/courtrun_4h_tf4_2* $DATA_DIR

#jarhodes
#cp $AGGREGATE_DATA_DIR/makena_0h_tf5_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/jarhodes_0h_tf5_2* $DATA_DIR

#sierrasb
#cp $AGGREGATE_DATA_DIR/sierrasb_4h_tf5_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/marinovg_4h_tf5_2* $DATA_DIR

#annlin
#cp $AGGREGATE_DATA_DIR/annlin_4h_tf9_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/zahoor_4h_tf9_2* $DATA_DIR

#myhayes
#cp $AGGREGATE_DATA_DIR/myhayes_0h_tf7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/soumyak_0h_tf7_2* $DATA_DIR


#### IF you want to run an additional sample, pick from the ones below, otherwise, TA's will run these #### 

#annashch
#cp $AGGREGATE_DATA_DIR/courtrun_0h_tf6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/annashch_0h_tf6_2* $DATA_DIR

#soumyak
#cp $AGGREGATE_DATA_DIR/cvduffy_4h_tf6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/soumyak_4h_tf6_2* $DATA_DIR

#surag
#cp $AGGREGATE_DATA_DIR/surag_4h_tf7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/albalsubr_4h_tf7_2* $DATA_DIR

#abalsubr
#cp $AGGREGATE_DATA_DIR/abalsubr_0h_tf8_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/zahoor_0h_tf8_2* $DATA_DIR

#zahoor
#cp $AGGREGATE_DATA_DIR/subkc_4h_tf8_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/annashch_4h_tf8_2* $DATA_DIR

#lakss
#cp $AGGREGATE_DATA_DIR/marinovg_0h_tf9_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/surag_0h_tf9_2* $DATA_DIR

In [None]:
ls $DATA_DIR

Let the analysis begin!