# Getting ready to run code on the cluster#

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.

Now that you can navigate through file systems and manipulate directories and files like any self-respecting Unix ninja, you are ready to start analyzing data! Here you will learn about how to organize your research directory and setup the cluster environment to access all software you wish to use.

## Organizing your research as a pro##

This is a really nice paper with guidelines on organizing computational projects in an organized and snazzy fashion: (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424). 
![Analysis Pipeline](images/journal.pcbi.1000424.g001.png)

Let's see this in action!


First, let's set up our working directory (also known as "scratch directory")

In [None]:
whoami

In [None]:
export WORKDIR=/scratch/$(whoami)

echo $WORKDIR

Organize your folder into subdirectories as a pro: 

In [None]:
cd $WORKDIR
mkdir data src
ls

In [None]:
pwd

## Preparing to run code on the cluster ## 

Our data processing will use multiple software tools. To be able to access them, we can load their paths into our session, by loading their respective modules.

To load a module, you can type
**module load [desiredModule]** - this is going to modify your path

Once a module is loaded, you can use the code associated with that module directly. For instance, let's say you want to sum all the columns in a file (let's use the file `number_cols.txt` in your home directory as an example). We can do this with the `addCols` tool from the ucsc_tools software package.  If you run:

In [None]:
cat ~/number_cols.txt

In [None]:
addCols ~/number_cols.txt

The above cell throws an error because we have not loaded the software package that `addCols` is part of. We can look at our `$PATH` before and after loading the software module to confirm that the loading is successful (we should see the path grow), and we can also use `module list` to see the currently loaded modules:

In [None]:
echo $PATH

In [None]:
source /etc/profile.d/modules.sh

In [None]:
module list

In [None]:
module load ucsc_tools

In [None]:
module list

In [None]:
echo $PATH

Now the ucsc_tools code is loaded! When you are ready to use commands from this package, you can just directly call them. Note that the `-h` or `--help` arguments can often be used to give help about a particular command.

In [None]:
addCols ~/number_cols.txt

In [None]:
#You can use the `which` command to get the location of the addCols tool 
which addCols 

In [None]:
#to remove the tool from your path, run: 
module unload ucsc_tools

Don't worry, you do not need to know off the top of your head the names of the modules you want. To see all software modules available on the AWS cluster, type:

In [None]:
module avail

## Exporting Needed Variables

Before we start analyzing data, we have one last setup task to do. We have established directories that you can access where all of the data for this project is stored, as well as intermediate outputs of the analysis we're going to perform (in case something goes wrong and you want to start fresh, without starting from the beginning). To keep our unix commands consistent between all of us, we've also pre-determined the directory structure for where your analysis outputs will go. The full set of directory variables we need to assign:

**\$\{WORK\_DIR\}** is your main work directory

**\$\{DATA\_DIR\}** is your data/ directory -- used for storing the subset of the data you will be working with.  

**\$\{SRC\_DIR\}** is your src/ directory -- used for storing code. 

**\$\{METADATA_DIR}** is the directory with the metadata file for this year's training camp.  


**\$\{AGGREGATE\_ANALYSIS\_DIR}** We will store the aggregate analysis results for all samples in this directory for common use by everyone. 

**\$\{AGGREGATE\_DATA\_DIR\}** is the data/ directory -- this is where we store all the raw data from the sequencer generated by the group

**\$\{HG\_DIR}** is the directory with the hg38 reference genome files 

**\$\{TMP\_DIR}** is the directory where your temporary files will be stored when you execute code. 



To set or assign all these variables, we execute the commands in our ipython notebook:

In [None]:
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"


In [None]:
echo $DATA_DIR

## Data and code for this project ##

It's generally good practice to always keep a backup copy of your raw data files in case you unintentionally delete or modify these files when performing your analysis. 

For this reason, you will copy the two samples you generated from the **\$AGGREGATE_DATA_DIR** folder to your personal **\$DATA_DIR** folder. 

In [None]:
ls $DATA_DIR

In [None]:
#Student  data files 
#Note: Uncomment the line (remove the #) to  copy the 2 replicate for a given strain/timepoint to your data directory. 

# meenakshi
#cp $AGGREGATE_DATA_DIR/0min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_HOG1_2* $DATA_DIR

# alanna
#cp $AGGREGATE_DATA_DIR/45min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_HOG1_2* $DATA_DIR

# ben
#cp $AGGREGATE_DATA_DIR/0min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_HOT1_2* $DATA_DIR

# raeline
#cp $AGGREGATE_DATA_DIR/45min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_HOT1_2* $DATA_DIR

# usman
#cp $AGGREGATE_DATA_DIR/0min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_MSN1_2* $DATA_DIR

# ronghao
#cp $AGGREGATE_DATA_DIR/45min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_MSN1_2* $DATA_DIR

# mingxin
#cp $AGGREGATE_DATA_DIR/0min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_MSN2_2* $DATA_DIR

# miriam
#cp $AGGREGATE_DATA_DIR/45min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_MSN2_2* $DATA_DIR

# tanner
#cp $AGGREGATE_DATA_DIR/0min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_MSN4_2* $DATA_DIR

# ali
#cp $AGGREGATE_DATA_DIR/45min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_MSN4_2* $DATA_DIR

# yannik
#cp $AGGREGATE_DATA_DIR/0min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_SKN7_2* $DATA_DIR

# sherry
#cp $AGGREGATE_DATA_DIR/45min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_SKN7_2* $DATA_DIR

# vincent
#cp $AGGREGATE_DATA_DIR/0min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_WT_2* $DATA_DIR

# kcochran
#cp $AGGREGATE_DATA_DIR/45min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_WT_2* $DATA_DIR

# michael
#cp $AGGREGATE_DATA_DIR/0min_YAP1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_YAP1_2* $DATA_DIR

# caleb
#cp $AGGREGATE_DATA_DIR/45min_YAP1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_YAP1_2* $DATA_DIR

# rahul
#cp $AGGREGATE_DATA_DIR/0min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_YAP6_2* $DATA_DIR

# soumyak
#cp $AGGREGATE_DATA_DIR/45min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_YAP6_2* $DATA_DIR

# micah
#cp $AGGREGATE_DATA_DIR/0min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/0min_YAP7_2* $DATA_DIR

# akundaje
#cp $AGGREGATE_DATA_DIR/45min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/45min_YAP7_2* $DATA_DIR



In [None]:
ls $DATA_DIR

Let the analysis begin!