# 4.1 Introduction to SCG and Sherlock compute clusters #

## Submitting jobs with SLURM ##

When many users are trying to do data processing on a server and there are limited computing resources, how is the distribution of the computing resources handled? On SCG (https://srcc.stanford.edu/scg-genomics-cluster-genomics-scale), Sherlock (https://www.sherlock.stanford.edu/) and many other servers, this is handled using the SLURM,Sherlock uses Slurm, an open-source resource manager and job scheduler. The basic idea is that the users puts the commands they want to run in a shell script, and then submits that script to the scheduler using qsub. If you do NOT use this approach, and instead execute the commands directly on your login shell, you will be running the commands on what is called the "head node", which is NOT equipped to do heavy lifting; if your commands require a lot of computing resources, you will end up making the head node slow for ALL users who log in, and everyone will hate you even if they may not know who you are. Or even worse -- your job will simply be killed by the sysadmins. 

In short: always use qsub for computationally intensive things.

We will walk through an example. Here is a reference for using SLURM from the SCG website: 
*  https://docs.scg.stanford.edu/scg-user-guide/wiki/SchedulerReference

And a similar reference from the Sherlock website -- Sherlock is another widely used compute cluster at Stanford; you interact with it in the same way as you do with SCG. We will refer to SCG in further discussion, but much of this information is transferable to working on Sherlock. 
* https://www.sherlock.stanford.edu/docs/getting-started/submitting/


By default, for SCG, jobs have a memory limit of 3.7GB (per slot; parallel processing jobs may use multiple slots) and jobs in the standard queue have a runtime limit of 6 hours (wallclock, not CPU time).

In order to run jobs, you have to put the commands in a shell script (see section 1.2) 


Once you have a shell script that contains the commands, you can submit it to the scheduler using the qsub command. The general format looks something like this: 
```

qsub [-flags -flags -flags] path/To/Shell/Script.sh. 
```
The various flags specify options about how the job submission will work. Here is a list of some relevant flags:
This list shows some commonly-used options. See the qsub manual page (man qsub) for more details.
```
-N name --- set the name of the job
-l h_vmem=size --- specify the amount of memory required (e.g. 3G or 3500M)
-l h_rt=hh:mm:ss --- specify the maximum run time (hours, minutes and seconds)
-pe shm slots --- run a parallel job using pthreads or other shared-memory API
-R y --- reserve all requested resources
-t n --- run an array job with n instances
-cwd --- run the job in the current working directory
-wd dir --- set the working directory for the job
-o path --- define the path for saving the standard output stream of the command
-e path --- define the path for saving the standard error stream of the command
-j y --- merge the standard error stream into the standard output stream
-m ea --- send mail when the job ends or aborts
-P project --- set the job's project
-q queue --- set the queue
-b y --- allow command to be a binary file instead of a script
-w e --- verify options and abort if there is an error
```


An template qsub command might look something like this:
```
qsub -V -w e -N [job_name] -l h_vmem=[memory] -l h_rt=[time] -pe shm [n_processors] -o [outputlogfile] -e [errorlogfile] [pathtoScript] [arg1] [arg2]
```

You can also set some commonly used flags in the shell script itself. Here is a template shell script:
```
#!/bin/sh
#
# set the name of the job
#$ -N example_job
#
# set the maximum memory usage (per slot)
#$ -l h_vmem=3G
#
# set the maximum run time
#$ -l h_rt=12:00:00
#
# send mail when job ends or aborts
#$ -m ea
#
# specify an email address
#$ -M $USER@stanford.edu
#
# check for errors in the job submission options
#$ -w e
[your job commands go here]
```
In general, it is best to use absolute paths in shell scripts submitted through qsub. You never know when relative paths will get you in trouble, even when using -cwd.

## Interactive jobs ##

Say you cannot put all the commands you have to execute in a shell script that will run on its own - in other words, you want to run the job interactively. You can use qlogin [resource options] to get an interactive shell (eg: `qlogin -l h_vmem=4G -pe shm 4`). Note that on SCG you will be charged for all the time you spend logged into a qlogin shell regardless of whether you are actually running computationally intensive things on it, so be warned.
More information for SCG is available here: https://web.stanford.edu/group/scgpm/cgi-bin/informatics/wiki/index.php/Qlogin 


If you want a job to keep running on a qlogin shell even after you close your computer, you can use what is called a screen session. Create a new screen with `screen -S [screenName]`. Launch your commands as desired. Leave the screen temporarily using `Ctrl+a d` (called "detaching"). Obtain a list of running screens with `screen -list`. Resume a screen with `screen -r screenName`. Close a screen forever with `exit`. 


## SCG  tips ##

You will probably end up using scg at some point. Here are things to keep in mind when you do.

### Temporary files ###

On SCG, the local nodes often do not have a large amount of temporary space. So you should make sure your code is using a temporary directory with sufficient disk space. SCG3 has 100TB of scratch space at **/srv/gsfs0/scratch** that you can use for temporary files.

You can usually set the TMP environment variable in your ~/.bashrc or in your job submission script. (The ~/.bashrc file is a file with shell commands that are executed on login. Technically, it's ~/.bash_profile that is executed on login, but your ~/.bash_profile file should call source ~/.bashrc which runs ~/.bashrc). The difference between ~/.bash_profile and ~/.bashrc is explained here http://www.joshstaiger.org/archives/2005/07/bash_profile_vs.html

Create a directory for yourself in scratch using:
```
mkdir /srv/gsfs0/scratch/<yourusername>
Set $TMP to point to this directory using:
export TMP=/srv/gsfs0/scratch/<yourusername>
```

### Job queues ###

There are a number of job scheduling queues, each configured with different resource restrictions. In many cases the job scheduler will automatically select the appropriate queue based on the resources required by your job, but you can also specifically request a queue using qsub's "-q" option.

* The test.q queue has a runtime limit of one hour and you can only run one job at a time. However, there is a dedicated node for these jobs, so generally they will be dispatched quickly.
* The standard queue has a runtime limit of six hours.
* The extended queue has a runtime limit of seven days. Jobs in the extended queue may have to wait longer to be scheduled.
* The large queue is a special queue for large-memory jobs (see Large-Memory Jobs).
* The seq_pipeline queue is a special queue for jobs related to the Center's sequencing pipeline.

You can force a job to run on a particular node by:
Specifying both a queue name and a node name: `qsub -q standard@scg1-2-10 myscript`
Specifying a node name via the -l hostname= option: `qsub -l hostname=scg1-2-10 myscript`

### Make sure your job completed ###

Jobs can fail for many reasons. When they fail due to errors, it is easy to catch that by looking at the .err file. However, sometimes jobs just stop running due to memory issues, etc. For this reason it is SUPER SUPER IMPORTANT to end your scripts with a printout that says "Done" (or anything you want for that matter). Then, if you see that printout in your .out file you know the job successfully completed. If you do not see it, then you know that there was a problem with the job, even if there is no explicit error message. Such a "Done" print statement is present in the code we provide and should serve as an important example for ANY script you write.