### Jupyter notebook basics:
 * Navigate through cells with UP/DOWN arrow.
 * Code cells have brackets `[]:` on the left
 * You run them with SHIFT-ENTER.
 * The brackets change to `[*]:` while running.
 * They change to `[number]:` after finishing.

# Lecture: pre-processing and quality control with fastp

Here we will do a demo to run fastp on a dataset I have already downloaded. The following steps comprise a typical workflow.

1. Create a working directory for this exercise.
2. Locate data.
3. Set up _fastp_ through a container and make an alias.
4. Run the program on a single dataset. Inspect.
6. Write a script to run the other datasets in a batch job (we'll do this from the terminal)

First, let's set aside some space in the working directory. 

In [None]:
# Step 1: create a workspace
cd /scratch/summit/$USER
mkdir DSCI512_RNAseq
cd DSCI512_RNAseq
pwd

In the upper lefthand corner, go to `File` -> `Open from path...` and paste in the output from the previous command.  This will open the file browser to the current location.
***
Now we're going to link a data directory to this current directory. We do this because the data is very large and will take too long for a demonstration. 
The link will reside in the present directory and act like any other, except you won't be able to change its contents. It is __read-only.__

In [None]:
# make directories to use through processing
# skipping 01_input - we will make that with a link below
mkdir 02_output
mkdir 03_scripts
mkdir 04_logs


In [None]:
#link to the data directory (I have already downloaded everything)
ln -sv /scratch/summit/dcking@colostate.edu/DSCI512/2019/data 01_input

In [None]:
# Look at your directory structure.
ls -l

***

# Running fastp

We will run this through a singularity container:

 * Load the singularity module
 * Test the container with the full command (long)
 * Make an alias for the long command
 

In [None]:
# Step 3: load the module that works with containers
module load singularity
module list

&#9935; remember `module load singularity` for the script exercise below.

The following command:
`singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp`

breaks down like this:

 * __singularity__ - A program that reads a container.
 * __exec__ - verb: execute
 * ___[path to container image]___: The container itself, called an image.
 * __fastp__: The program you want to execute.

In [None]:
# Step 4: Run fastp through the container without arguments- gives catalog of available flags
singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp

__Note__: The warning <font color=orange>WARNING: Non existent 'bind path' source: '/rc_scratch'</font> is due to the configuration and is not a problem.

In [None]:
# Make an alias for fastp
fastp='singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp'

&#9935; remember the above command for the script.

You will now be able to type _fastp_ in place of the long command.

In [None]:
# Test the alias- same output.
$fastp

## fastp command usage

__The usage message tells us for paired end data:__

`-i readfile1.fastq -I readfile2.fastq`

`-o outputfile1.fastq -O outputfile2.fastq`

`[options]`

For the options:

 * __-x__: remove polyX (polyAs polyCs polyGs polyTs)
 * __-p__: overrepresentation analysis
 * __--thread__: We only have 1 on jupyterhub. We can use more in our script.
 * __-h,-j__: The report filenames in html, json (javascript object notation).

***
The command below runs on the smallest dataset. The backslashes `\` allow the command to wrap onto multiple lines.

In [None]:
time $fastp -i 01_input/SRR5832199_1.fastq       -I 01_input/SRR5832199_2.fastq \
           -o 02_output/SRR5832199_trim_1.fastq -O 02_output/SRR5832199_trim_2.fastq \
           -h 02_output/SRR5832199_report.html  -j 02_output/SRR5832199_report.json\
           --thread 1 \
           -x -p 


It is running while you still see <font color="blue">`[*]`</font> with the asterisk. Give it 1-2 minutes. Also, you'll see <font color=orange>WARNING: Non existent 'bind path' source: '/rc_scratch'</font> again until the output comes.

_Check the output_ of the command by navigating to `02_output` in the file browser.

***
_Now_, let's use a variable to simplify this command:

In [None]:
SRRID=SRR5832199
time $fastp -i 01_input/${SRRID}_1.fastq       -I 01_input/${SRRID}_2.fastq \
           -o 02_output/${SRRID}_trim_1.fastq -O 02_output/${SRRID}_trim_2.fastq \
           -h 02_output/${SRRID}_report.html  -j 02_output/${SRRID}_report.json\
           --thread 1 \
           -x -p 


&#9935; remember the above command for the script.

Let's modify the value of `SRRID` to run the command on another dataset.

# Step 5. Scripting and running a batch job

Now we're going to set up the full version of this. 

1. Go back to `File`--> `New` --> `Terminal`. This will open a web-based terminal in a new browser tab.
2. Do `cd /scratch/summit/$USER/DSCI512_RNAseq` (your current directory).
3. Using nano, copy the template script below into a new file called `fastp.sbatch`. Complete the unfilled sections using the commands marked with the tool &#9935; above.

## A template SBATCH script

```bash
#!/usr/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0:05:00
#SBATCH --qos=normal
#SBATCH --partition=shas
#SBATCH --output=04_logs/fastp.%j.out

#
# include the construction symbol statements to complete the following
#

# 1)setup:
#    a) Load modules
#    b) make shortcut

# 2) add IDs to SRRID like
SRRs="SRR5832188 SRR5832189 SRR5832190"

# 3) run the command in a loop for each file prefix
for SRRID in $SRRs
do
  # paste fastp command here
done
```

***

### SBATCH directives

* must be immediately beneath the `#!/usr/bin/bash`
* __--nodes:__ Number of compute nodes. We only use 1.
* __--ntasks:__ Number of cores (CPUs/threads). This can be up to 24. The requested value is set to `$SLURM_NTASKS` in the script.
* __--time:__ Requested amount of time. Defaults to 1 hour.
* __--qos:__ quality of service. See rc help for the breakdown.
* __--partition:__ Such as shas (haswell). Others are hi memory nodes, GPU, and testing nodes.
* __--output:__ The format of the output file. %j refers to the job id and is set at run time.

### Submitting the job

```sbatch --reservation number fastp.sbatch```

### checking the job status

```squeue -u $USER```

***

## My finished SBATCH script

```bash
#!/usr/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --time=0:20:00
#SBATCH --qos=normal
#SBATCH --partition=shas
#SBATCH --output=04_logs/fastp.%j.out


# 1) setup:
#  a) Load modules
#  b) make alias
module load singularity

# 'alias' doesn't work in scripts. Here's an alternative syntax to 'alias':
fastp='singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp'
# use like:
#  $fastp arg1 arg2 ...


# 2) figure out the non-redundant list of IDs to loop over
SRRs="SRR5832182 SRR5832183 SRR5832184"


# 3) run the command in a loop for each file
for SRRID in $SRRs
do
    time $fastp -i 01_input/${SRRID}_1.fastq       -I 01_input/${SRRID}_2.fastq \
           -o 02_output/${SRRID}_trim_1.fastq -O 02_output/${SRRID}_trim_2.fastq \
           -h 02_output/${SRRID}_report.html  -j 02_output/${SRRID}_report.json\
           --thread ${SLURM_NTASKS} \
           -x -p  
done
```

***
_More advanced. But doesn't require loop syntax and is faster._

## My ARRAY SBATCH script

SLURM will submit these jobs in parallel, and so I requested fewer resources per job. You just have to make sure to match the array parameter to the way the files are named.

They are SRR5832182 through SRR5832199. So I'll set the array to go from 82 to 99, and just attach it to the rest.

Run like:

`sbatch -a 82-99 fastp_array.sbatch`


***
```bash
#!/usr/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --time=0:03:00
#SBATCH --qos=normal
#SBATCH --partition=shas
#SBATCH --output=04_logs/fastp.%j.%a.out

# run like
# sbatch -a 82-99 fastp_array.sbatch

# 1) setup:
#  a) Load modules
#  b) make alias
module load singularity

# 'alias' doesn't work in scripts. Here's an alternative syntax to 'alias':
fastp='singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp'
# use like:
#  $fastp arg1 arg2 ...


# 3) Figure out the file root from the job array id
# This script must be run like:
#  sbatch --array=82-99 fastp_array.sbatch
# in order for the IDs to match up to the filenames properly.

SRRID="SRR58321${SLURM_ARRAY_TASK_ID}"

time $fastp -i 01_input/${SRRID}_1.fastq -I 01_input/${SRRID}_2.fastq \
    -o 02_output/${SRRID}_trim_1.fastq   -O 02_output/${SRRID}_trim_2.fastq \
    -h 02_output/${SRRID}_report.html    -j 02_output/${SRRID}_report.json\
    --thread ${SLURM_NTASKS} \
    -x -p  

```