# 1.2 Getting ready to run code on the cluster#

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.

Now that you can submit jobs like any self-respecting Unix ninja, you are ready to start analyzing data! Here you will learn about how to organize your research directory and setup the cluster environment to access all software you wish to use.

## Organizing your research as a pro##

This is a really nice paper with guidelines on organizing computational projects in an organized and snazzy fashion: (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424). 
![Analysis Pipeline](images/journal.pcbi.1000424.g001.png)

Let's see this in action!


First, let's set up our working directory (also known as "scratch directory")

In [3]:
export WORKDIR=/scratch/$(whoami)

echo $WORKDIR

/scratch/annashch


In [4]:
whoami

annashch


Organize your folder into subdirectories as a pro: 

In [5]:
cd ${WORKDIR}
mkdir data src
ls

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘src’: File exists
Sample_0min_MSN4_vs_45min_MSN4.differential.negative.txt
Sample_0min_MSN4_vs_45min_MSN4.differential.positive.txt
Sample_0min_MSN4_vs_45min_MSN4.txt
Sample_45min_HOT1_vs_45min_WT.differential.negative.txt
Sample_45min_HOT1_vs_45min_WT.differential.positive.txt
Sample_45min_HOT1_vs_45min_WT.txt
Sample_45min_MSN4_vs_45min_WT.differential.negative.txt
Sample_45min_MSN4_vs_45min_WT.differential.positive.txt
Sample_45min_MSN4_vs_45min_WT.txt
Sample_45min_SKN7_vs_45min_WT.differential.negative.txt
Sample_45min_SKN7_vs_45min_WT.differential.positive.txt
Sample_45min_SKN7_vs_45min_WT.txt
Strain_WT_vs_HOG1.differential.negative.txt
Strain_WT_vs_HOG1.differential.positive.txt
Strain_WT_vs_HOG1.txt
Strain_WT_vs_HOT1.differential.negative.txt
Strain_WT_vs_HOT1.differential.positive.txt
Strain_WT_vs_HOT1.txt
Strain_WT_vs_MSN1.differential.negative.txt
Strain_WT_vs_MSN1.differential.positive.txt
Stra

In [6]:
cd $WORKDIR

In [7]:
pwd

/scratch/annashch


## Preparing to run code on the cluster ## 

Our data processing will use multiple software tools. To be able to access them, we can load their paths into our session, by loading their respective modules.

To load a module, you can type
**module load [desiredModule]** - this is going to modify your path

Once a module is loaded, you can use the code associated with that module directly. For instance, let's say you want to load a module for BEDTools (a software package we will be using in this training camp). If you run:

In [8]:
source /etc/profile.d/modules.sh

In [9]:
echo $PATH

/opt/anaconda3/bin:/opt/anaconda3/condabin:/opt/anaconda3/bin:/opt/anaconda3/condabin:/usr/bin:/bin:/opt/homer/bin:/opt/gs/ghostscript-9.19-linux-x86_64:/opt/weblogo/weblogo:/opt/homer/bin:/opt/gs/ghostscript-9.19-linux-x86_64:/opt/weblogo/weblogo


In [10]:
module list 

Currently Loaded Modulefiles:
  1) /bedtools/2.26.0


In [11]:
module load bedtools

In [12]:
module list

Currently Loaded Modulefiles:
  1) /bedtools/2.26.0


In [13]:
echo $PATH

/opt/anaconda3/bin:/opt/anaconda3/condabin:/opt/anaconda3/bin:/opt/anaconda3/condabin:/usr/bin:/bin:/opt/homer/bin:/opt/gs/ghostscript-9.19-linux-x86_64:/opt/weblogo/weblogo:/opt/homer/bin:/opt/gs/ghostscript-9.19-linux-x86_64:/opt/weblogo/weblogo


It loads the bedtools code, such that when you are ready to use the code, you can just directly call commands. Note that the -h or --help arguments can often be used to give help about a particular tool.


In [14]:
module unload bedtools

In [15]:
module load bedtools 

In [16]:
which bedtools

/opt/anaconda3/bin/bedtools


Don't worry, you do not need to know off the top of your head the names of the modules you want. To see all software modules available on the AWS cluster, type:

In [17]:
module avail


------------------ /software/env_module/default/modulefiles/ -------------------
MACS2/2.1.1       fastqc/0.11.5     module-info       r/3.4.4
bedtools/2.26.0   homer/default     modules           samtools/1.7
bowtie/2.3.4.1    java/latest       null              ucsc_tools/latest
dot               module-git        picard-tools/1.95 use.own


## The .bashrc file (=your friend) ##

In [18]:
#Where is .bashrc?
#In our home directory
cd ~
pwd
ls


/homes/annashch
'0.0 Introduction to Jupyter notebooks.ipynb'
'1.0 Big Ideas.ipynb'
'1.1 Unix Basics.ipynb'
'1.2 Getting ready to run code on the cluster.ipynb'
'2.0 The metadata file and analysis overview.ipynb'
 2.1_Sequencing_Data_Analysis.ipynb
'3.1 Clustering analysis and PCA on fold change data .ipynb'
'3.2 Clustering analysis and PCA on count data.ipynb'
'3.3 Finding enriched GO Terms with GORilla.ipynb'
'3.4 Calling differentially expressed peaks with DESeq2 and limma .ipynb'
'3.5 GO Term enrichment for differentially accessible chromatin regions.ipynb'
'3.6 Finding TF motifs.ipynb'
'4.0 Introduction to shell scripts.ipynb'
'4.1 Introduction to SCG and Sherlock compute clusters.ipynb'
'4.2 Install Missing R packages.ipynb'
 Untitled.ipynb
 atac.out_def.json
 cromwell_input_template.json
 frag_len.png
 images
 test_file.txt


In [19]:
#But this doesn't show bashrc...
ls -ah #this does (shows all hidden files)
#.bash_logout automatically runs things when you log out

 .
 ..
 .bash_history
 .bash_logout
 .bashrc
 .cache
 .caper
 .caper_tmp
 .config
 .emacs.d
 .gnupg
 .ipynb
 .ipynb_checkpoints
 .ipython
 .jupyter
 .local
 .profile
 .python_history
 .ssh
 .sudo_as_admin_successful
'0.0 Introduction to Jupyter notebooks.ipynb'
'1.0 Big Ideas.ipynb'
'1.1 Unix Basics.ipynb'
'1.2 Getting ready to run code on the cluster.ipynb'
'2.0 The metadata file and analysis overview.ipynb'
 2.1_Sequencing_Data_Analysis.ipynb
'3.1 Clustering analysis and PCA on fold change data .ipynb'
'3.2 Clustering analysis and PCA on count data.ipynb'
'3.3 Finding enriched GO Terms with GORilla.ipynb'
'3.4 Calling differentially expressed peaks with DESeq2 and limma .ipynb'
'3.5 GO Term enrichment for differentially accessible chromatin regions.ipynb'
'3.6 Finding TF motifs.ipynb'
'4.0 Introduction to shell scripts.ipynb'
'4.1 Introduction to SCG and Sherlock compute clusters.ipynb'
'4.2 Install Missing R packages.ipynb'
 Untitled.ipynb
 atac.out_def.json
 cromwell_input_template

Wouldn't it be nice to have everything ready to run when you log into the cluster?
To avoid having to run module load commands every time you log in, you can add these commands to a .bashrc file, located in your home directory. The .bashrc file contains a set of commands that get executed every time you log into the server. In this way, every time you log in, you will be all set to run all code you wish.

Note: Technically, the ~/.bashrc file is not what's executed on login; it's ~/.bash_profile, which in turn calls ~/.bashrc. If your .bash_profile does not call .bashrc, put the line source ~/.bashrc in your .bash_profile. The difference between the two files is explained here: http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html

Let's add all our desired module loading commands into a .bashrc file. 

In [20]:
bedtools_load='module load bedtools/2.26.0'
#naive thing:
#NOTE: ~ is a shortcut for your home directory
#echo $bedtools_load >> ~/.bashrc #this might clutter up our bashrc if we run it a bunch of times

#only add the module load commands to the ~/.bashrc file if they don't exist in this file already 
#The || acts like an OR; it executes the command on the right if the command on the left errors out
#grep -E acts like search
#reminder: "$bedtoos_load" decodes to 'module load bedtools/2.17.0'
grep -E "$bedtools_load" ~/.bashrc || echo $bedtools_load >> ~/.bashrc


module load bedtools/2.26.0


## Defining shortcuts in the .bashrc file ##

Why stop here? You can make all your dreams come true in the .bashrc file!
For instance, you can add to the .bashrc file some shortcuts to your directories of interest, which you can then seamlessly use. Add the following to your .bashrc:

In [21]:
grep -E "shortcuts_defined" ~/.bashrc || 
(echo '#shortcuts_defined:' >> ~/.bashrc &&

echo 'export SUNETID="$(whoami)"' >> ~/.bashrc &&
echo 'export WORK_DIR="/scratch/${SUNETID}"' >> ~/.bashrc &&

echo 'export DATA_DIR="${WORK_DIR}/data"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"' >> ~/.bashrc &&

echo 'export SRC_DIR="${WORK_DIR}/src"' >> ~/.bashrc &&
echo '[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"' >> ~/.bashrc &&

echo 'export METADATA_DIR="/metadata"' >> ~/.bashrc &&
echo 'export AGGREGATE_DATA_DIR="/data"' >> ~/.bashrc &&
echo 'export AGGREGATE_ANALYSIS_DIR="/outputs"' >> ~/.bashrc &&
echo 'export YEAST_DIR="/saccer3"' >> ~/.bashrc &&

echo 'export TMP="${WORK_DIR}/tmp"' >> ~/.bashrc &&
echo 'export TEMP=$TMP' >> ~/.bashrc && 
echo 'export TMPDIR=$TMP' >> ~/.bashrc && 
echo '[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"' >> ~/.bashrc )

#shortcuts_defined:


**\$\{WORK\_DIR\}** is your main work directory

**\$\{DATA\_DIR\}** is your data/ directory -- used for storing the subset of the data you will be working with.  

**\$\{SRC\_DIR\}** is your src/ directory -- used for storing code. 

**\$\{METADATA_DIR}** is the directory with the metadata file for this year's training camp.  


**\$\{AGGREGATE\_ANALYSIS\_DIR}** We will store the aggregate analysis results for all samples in this directory for common use by everyone. 

**\$\{AGGREGATE\_DATA\_DIR\}** is the data/ directory -- this is where we store all the raw data from the sequencer generated by the group

**\$\{YEAST\_DIR}** is the directory with the yeast reference genome files 

**\$\{TMP\_DIR}** is the directory where your temporary files will be stored when you execute code. 



To see your ~/.bashrc and ~/.bash_profile files in action, logout and log in again. All modules should be loaded and all shortcuts should be set!

Since logging in/out would disrupt this tutorial, we execute the commands in our ipython notebook:

In [22]:
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"


: 1

## Data and code for this project ##

It's generally good practice to always keep a backup copy of your raw data files in case you unintentionally delete or modify these files when performing your analysis. 

For this reason, you will copy the two samples you generated from the **\$AGGREGATE_DATA_DIR** folder to your personal **\$DATA_DIR** folder. 

In [None]:
#PILOT datasets
#Note: Copy the 2 replicate for a given strain/timepoint to your data directory. Pick any set of 2 that you like. 

#srstern
#cp $AGGREGATE_DATA_DIR/2019_pilot/genegra2_0min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/srstern_0min_WT_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/2019_pilot/lstrand_45min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/ajberg5_45min_WT_2* $DATA_DIR

#cvduffy
#cp $AGGREGATE_DATA_DIR/2019_pilot/sierrasb_0min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/cvduffy_0min_MSN1_2* $DATA_DIR

#genegra2
#cp $AGGREGATE_DATA_DIR/2019_pilot/genegra2_45min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/mihayes_45min_MSN1_2* $DATA_DIR

#subkc
#cp $AGGREGATE_DATA_DIR/2019_pilot/subkc_0min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/annlin_0min_MSN2_2* $DATA_DIR

#clin5
#cp $AGGREGATE_DATA_DIR/2019_pilot/srstern_45min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/clin5_45min_MSN2_2* $DATA_DIR

#lstrand
#cp $AGGREGATE_DATA_DIR/2019_pilot/lstrand_0min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/clin5_0min_MSN4_2* $DATA_DIR

#miao1
#cp $AGGREGATE_DATA_DIR/2019_pilot/miao1_45min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/makena_45min_MSN4_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/2019_pilot/ajberg5_0min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/miao1_0min_HOG1_2* $DATA_DIR

#courtrun
#cp $AGGREGATE_DATA_DIR/2019_pilot/jarhodes_45min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/courtrun_45min_HOG1_2* $DATA_DIR

#jarhodes
#cp $AGGREGATE_DATA_DIR/2019_pilot/makena_0min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/jarhodes_0min_SKN7_2* $DATA_DIR

#sierrasb
#cp $AGGREGATE_DATA_DIR/2019_pilot/sierrasb_45min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/marinovg_45min_SKN7_2* $DATA_DIR

#annlin
#cp $AGGREGATE_DATA_DIR/2019_pilot/annlin_45min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/zahoor_45min_YAP7_2* $DATA_DIR

#mihayes
#cp $AGGREGATE_DATA_DIR/2019_pilot/mihayes_0min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/soumyak_0min_HOT1_2* $DATA_DIR


#### IF you want to run an additional sample, pick from the ones below, otherwise, TA's will run these #### 

#annashch
#cp $AGGREGATE_DATA_DIR/2019_pilot/courtrun_0min_YAP1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/annashch_0min_YAP1_2* $DATA_DIR

#soumyak
#cp $AGGREGATE_DATA_DIR/2019_pilot/cvduffy_45min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/soumyak_45min_HOT1_2* $DATA_DIR

#surag
#cp $AGGREGATE_DATA_DIR/2019_pilot/surag_45min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/albalsubr_45min_HOT1_2* $DATA_DIR

#abalsubr
#cp $AGGREGATE_DATA_DIR/2019_pilot/abalsubr_0min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/zahoor_0min_YAP6_2* $DATA_DIR

#zahoor
#cp $AGGREGATE_DATA_DIR/2019_pilot/subkc_45min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/annashch_45min_YAP6_2* $DATA_DIR

#lakss
#cp $AGGREGATE_DATA_DIR/2019_pilot/marinovg_0min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/2019_pilot/surag_0min_YAP7_2* $DATA_DIR

In [23]:
ls $DATA_DIR

In [24]:
#Student  data files from 2019 
#Note: Copy the 2 replicate for a given strain/timepoint to your data directory. Pick any set of 2 that you like. 
#srstern
#cp $AGGREGATE_DATA_DIR/genegra2_0min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/srstern_0min_WT_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/lstrand_45min_WT_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/ajberg5_45min_WT_2* $DATA_DIR

#cvduffy
#cp $AGGREGATE_DATA_DIR/sierrasb_0min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/cvduffy_0min_MSN1_2* $DATA_DIR

#genegra2
#cp $AGGREGATE_DATA_DIR/genegra2_45min_MSN1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/mihayes_45min_MSN1_2* $DATA_DIR

#subkc
#cp $AGGREGATE_DATA_DIR/subkc_0min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/annlin_0min_MSN2_2* $DATA_DIR

#clin5
#cp $AGGREGATE_DATA_DIR/srstern_45min_MSN2_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/clin5_45min_MSN2_2* $DATA_DIR

#lstrand
#cp $AGGREGATE_DATA_DIR/lstrand_0min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/clin5_0min_MSN4_2* $DATA_DIR

#miao1
#cp $AGGREGATE_DATA_DIR/miao1_45min_MSN4_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/makena_45min_MSN4_2* $DATA_DIR

#ajberg5
#cp $AGGREGATE_DATA_DIR/ajberg5_0min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/miao1_0min_HOG1_2* $DATA_DIR

#courtrun
#cp $AGGREGATE_DATA_DIR/jarhodes_45min_HOG1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/courtrun_45min_HOG1_2* $DATA_DIR

#jarhodes
#cp $AGGREGATE_DATA_DIR/makena_0min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/jarhodes_0min_SKN7_2* $DATA_DIR

#sierrasb
#cp $AGGREGATE_DATA_DIR/sierrasb_45min_SKN7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/marinovg_45min_SKN7_2* $DATA_DIR

#annlin
#cp $AGGREGATE_DATA_DIR/annlin_45min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/zahoor_45min_YAP7_2* $DATA_DIR

#mihayes
#cp $AGGREGATE_DATA_DIR/mihayes_0min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/soumyak_0min_HOT1_2* $DATA_DIR


#### IF you want to run an additional sample, pick from the ones below, otherwise, TA's will run these #### 

#annashch
cp $AGGREGATE_DATA_DIR/courtrun_0min_YAP1_1* $DATA_DIR
cp $AGGREGATE_DATA_DIR/annashch_0min_YAP1_2* $DATA_DIR

#soumyak
#cp $AGGREGATE_DATA_DIR/cvduffy_45min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/soumyak_45min_HOT1_2* $DATA_DIR

#surag
#cp $AGGREGATE_DATA_DIR/surag_45min_HOT1_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/albalsubr_45min_HOT1_2* $DATA_DIR

#abalsubr
#cp $AGGREGATE_DATA_DIR/abalsubr_0min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/zahoor_0min_YAP6_2* $DATA_DIR

#zahoor
#cp $AGGREGATE_DATA_DIR/subkc_45min_YAP6_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/annashch_45min_YAP6_2* $DATA_DIR

#lakss
#cp $AGGREGATE_DATA_DIR/marinovg_0min_YAP7_1* $DATA_DIR
#cp $AGGREGATE_DATA_DIR/surag_0min_YAP7_2* $DATA_DIR

In [25]:
ls $DATA_DIR

annashch_0min_YAP1_2_R1.fastq.gz  courtrun_0min_YAP1_1_R1.fastq.gz
annashch_0min_YAP1_2_R2.fastq.gz  courtrun_0min_YAP1_1_R2.fastq.gz


Let the analysis begin!