# 1.2 Shell scripts and job submission #

After a basic introduction to Unix, we will get our feet (keyboards?) wet actually submitting jobs (ok, maybe not keyboards). As a reminder, you can use the nano command to create/edit files (eg: nano nameOfMyFile.txt will open nameOfMyFile.txt for writing)

## Submitting jobs on the SGE ##

When many users are trying to do data processing on a server and there are limited computing resources, how is the distribution of the computing resources handled? On scg3 and many other servers, this is handled using the Sun Grid Engine (abbreviated SGE). The basic idea is that the users puts the commands they want to run in a shell script, and then submits that script to the SGE using qsub. If you do NOT use this approach, and instead execute the commands directly on your login shell, you will be running the commands on what is called the "head node", which is NOT equipped to do heavy lifting; if your commands require a lot of computing resources, you will end up making the head node slow for ALL users who log in, and everyone will hate you even if they may not know who you are.

In short: always use qsub for computationally intensive things.

We will walk through an example. Here is a reference for using SGE from the scg3 website: https://web.stanford.edu/group/scgpm/cgi-bin/informatics/wiki/index.php/Using_Sun_Grid_Engine. Anshul also has a summary of key commands and options on his website: https://sites.google.com/site/anshulkundaje/inotes/programming/clustersubmit/sun-grid-engine

By default, jobs have a memory limit of 3.7GB (per slot; parallel processing jobs may use multiple slots) and jobs in the standard queue have a runtime limit of 6 hours (wallclock, not CPU time).

In order to run jobs, you have to put the commands in what is called a "shell script". Create a shell script called myFirstShellScript.sh with the following contents.

In [4]:
#you can ignore this  line -- it checks to see if the file exists, and remove it if it does. So we don't end up writing the same information to the file multiple times. 
!if [ -f myFirstShellScript.sh ] ; then rm myFirstShellScript.sh; fi

! touch myFirstShellScript.sh 
! echo "#!/bin/sh" >> myFirstShellScript.sh
! echo "#this line is a comment; it is ignored during execution" >> myFirstShellScript.sh
! echo "#you can put any commands that you would normally type into the command line in here." >>myFirstShellScript.sh
! echo "#for example, this shell script just creates a directory" >> myFirstShellScript.sh
! echo "touch thisFileCreatedFromShellScript.txt" >> myFirstShellScript.sh

The **#!/bin/sh** at the beginning tells the operating system what software to use to interpret the script (in this case, it uses the program located at **/bin/sh**). Don't worry if you don't understand; just make sure your scripts begin with it.

Once you have created the script, make it executable:

In [6]:
#this command makes the script executable 
!chmod a+x myFirstShellScript.sh

Then, run it:

In [9]:
#this command runs the shell script
!./myFirstShellScript.sh
#the ls command indicates that "thisFileCreatedFromShellScript.txt" was created 
!ls

1.0 Big Ideas.ipynb
1.1 Unix Basics.ipynb
1.2 Shell scripts and job submission.ipynb
1.3 Getting ready to run code on the cluster.ipynb
2.0_Sequencing_Data_Analysis.ipynb
2.4 Creating count coverage tracks.ipynb
3.1 Clustering analysis and PCA.ipynb
3.2 Calling differentially expressed peaks.ipynb
3.3 GO Term Enrichment.ipynb
3.4 Finding TF motifs.ipynb
myFirstShellScript.sh
thisFileCreatedFromShellScript.txt


Shell scripts can also accept arguments (a fancy word for extra commands/options that you pass to the shell script). \$1 \$2 \$3 ... refer to the first, second, third etc. arguments passed into the shell script. Create another shell script called myFirstShellScriptWithArguments.sh with the following contents:

In [25]:
#you can ignore this  line -- it checks to see if the file exists, and remove it if it does. So we don't end up writing the same information to the file multiple times. 
!if [ -f myFirstShellScriptWithArguments.sh ] ; then rm myFirstShellScriptWithArguments.sh; fi

! touch myFirstShellScriptWithArguments.sh
! echo #!/bin/sh >> myFirstShellScriptWithArguments.sh
! echo touch "$"1 >> myFirstShellScriptWithArguments.sh
! echo mkdir "$"2 >> myFirstShellScriptWithArguments.sh




Once again, make the shell script executable: 

In [26]:
!chmod a+x myFirstShellScriptWithArguments.sh

Now run the following:

In [27]:
!./myFirstShellScriptWithArguments.sh customFileName.txt customDirectoryName
! ls

1.0 Big Ideas.ipynb
1.1 Unix Basics.ipynb
1.2 Shell scripts and job submission.ipynb
1.3 Getting ready to run code on the cluster.ipynb
2.0_Sequencing_Data_Analysis.ipynb
2.4 Creating count coverage tracks.ipynb
3.1 Clustering analysis and PCA.ipynb
3.2 Calling differentially expressed peaks.ipynb
3.3 GO Term Enrichment.ipynb
3.4 Finding TF motifs.ipynb
customDirectoryName
customFileName.txt
myFirstShellScript.sh
myFirstShellScriptWithArguments.sh
thisFileCreatedFromShellScript.txt


This was just an example, but hopefully you can see the potential power of using scripts like these to launch complicated bioinformatics processing jobs.

Once you have a shell script that contains the commands, you can submit it to the SGE using the qsub command. The general format looks something like this: 
```
qsub [-flags -flags -flags] path/To/Shell/Script.sh. 
```
The various flags specify options about how the job submission will work. Here is a list of some relevant flags:

-V : pass all environment variables to the job (environment variables are like settings specific to your login session; you often want to pass these settings to the job so that the commands will behave the same way they do when you type them into your login session).

-cwd : Execute the job in the current working directory (you also usually want this option set)

-wd [dir] : Set working directory for this job (don't use this if you've specified -cwd)

-w e : verify options and abort if there is an error

-N [jobname] : name of the job

-m ea : send an email when the job ends or aborts

-M emailAddress@stanford.edu : whom to email

-o [output_logfile]: specifies the file to write the output that would (in the absence of qsub) would be printed to 
the screen (technically "stdout" or "standard output")

-e [error_logfile]: specifies the file to write error messages to (in the absence of qsub, these would also be printed to the screen; technically "stderr" or "standard error")

-q [queue] : set the queue (we won't use this one but it may be useful on scg3)

-l h_vmem=[size] : specify the amount of memory required in size

-l h_rt=[hh:mm:ss] : specify the maximum run time (hours, minutes and seconds)

-pe shm [n_processors] : run a job that uses parallelisation using pthreads or other shared-memory API

-b y : allow direct command or binary file instead of a text script

An template qsub command might look something like this:
```
qsub -V -w e -N [job_name] -l h_vmem=[memory] -l h_rt=[time] -pe shm [n_processors] -o [outputlogfile] -e [errorlogfile] [pathtoScript] [arg1] [arg2]
```
You can also set some commonly used flags in the shell script itself. Here is a template shell script:
```
#!/bin/sh
#$ -V
#$ -N jobname
#$ -m ea
#$ -M youremail@stanford.edu
#$ -cwd
#$ -o /path/to/jobname.stdout
#$ -e /path/to/jobname.stderr
#$ -w e
[your job commands go here]
```
In general, it is best to use absolute paths in shell scripts submitted through qsub. You never know when relative paths will get you in trouble, even when using -cwd.

## Interactive jobs ##

Say you cannot put all the commands you have to execute in a shell script that will run on its own - in other words, you want to run the job interactively. You can use qlogin [resource options] to get an interactive shell (eg: `qlogin -l h_vmem=4G -pe shm 4`). Note that on scg3, you will be charged for all the time you spend logged into a qlogin shell regardless of whether you are actually running computationally intensive things on it, so be warned.

If you want a job to keep running on a qlogin shell even after you close your computer, you can use what is called a screen session. Create a new screen with screen -S [screenName]. Launch your commands as desired. Leave the screen temporarily using Ctrl+a d (called "detaching"). Obtain a list of running screens with screen -list. Resume a screen with screen -r screenName. Close a screen forever with exit. (Disclaimer: I (avanti) have never actually used screen with qlogin - I have used screen and qlogin independently but never together. No promises that this works as advertised).

