# Lecture: pre-processing and quality control with fastp

Here we will do a demo to run fastp on a dataset I have already downloaded. The following steps comprise a typical workflow.

1. Create a working directory for this exercise.
2. Locate data.
3. Setup _fastp_ through a container and make an alias.
4. Run the program on a single dataset.
5. Look at the html result.
6. Write a script to run the other datasets in a batch job (we'll do this from the terminal)

First, let's set aside some space in the working directory. 

In [None]:
# Step 1: create a workspace
cd /scratch/summit/$USER
mkdir -p DSCI512_RNAseq
cd DSCI512_RNAseq
pwd

Now we're going to link a data directory to this current directory. We do this because the data is very large and will take too long for a demonstration. 
The link will reside in the present directory and act like any other, except you won't be able to change its contents. It is __read-only.__

In [None]:
# make directories to use through processing
# skipping 01_input - we will make that with a link below
mkdir 02_output
mkdir 03_scripts
mkdir 04_logs


In [None]:
#link to the data directory (I have already downloaded everything)
ln -sv /scratch/summit/dcking@colostate.edu/DSCI512/2019/data 01_input

In [None]:
# look at your directory structure
ls -l

Mine looks like this:

Notice that the files are very large- many close to 2 gigabytes. Data of this size should be kept in your scratch directory (_or in this case, mine_).

We've set up our workspace and located the data. &#10003; 


***

Run the following command to get a link to a graphical interface for this directory. 

## Open the file browser to you current location
&#8681;&#8681;&#8681;&#8681;&#8681;

In [None]:
echo https://jupyter.rc.colorado.edu/user/$USER/tree$PWD


&#8679;&#8679;&#8679;&#8679;&#8679;<font size="3">__This link opens file browser to current directory__</font> &#8679;&#8679;&#8679;&#8679;&#8679;
***
***
***
# Running fastp

We will run this through a singularity container:

 * Load the singularity module
 * Test the container with the full command (long)
 * Make an alias for the long command

In [None]:
# Step 3: load the module that works with containers
module load singularity
module list

The following command:
`singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp`
 * __singularity__ - A program that reads a container.
 * __exec__ - verb: execute
 * ___[path to container image]___: The container itself, called an image.
 * __fastp__: The program you want to execute.

In [None]:
# Step 4: Run fastp through the container without arguments- gives catelog of available flags
singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp

__Note__: The warning <font color=orange>WARNING: Non existent 'bind path' source: '/rc_scratch'</font> is due to the configuration and is not a problem.

In [None]:
# Make an alias for fastp
alias fastp='singularity exec /projects/dcking@colostate.edu/containers/Summit_RNAseq_container.sif fastp'

You will now be able to type _fastp_ in place of the long command.

In [None]:
# Test the alias- same output.
fastp

__The usage message tells us for paired end data:__

`-i readfile1.fastq -I readfile2.fastq`

`-o outputfile1.fastq -O outputfile2.fastq`

`[options]`

For the options:

 * __-x__: remove polyX (polyAs polyCs polyGs polyTs)
 * __-p__: overrepresentation analysis
 * __--thread__: We only have 1 on jupyterhub. We can use more in our script.
 * __-h,-j__: The report filenames in html, json (javascript object notation).

***
The backslashes `\` below allow me to wrap the command onto multiple lines.

In [None]:
time fastp -i 01_input/SRR5832199_1.fastq       -I 01_input/SRR5832199_2.fastq \
           -o 02_output/SRR5832199_trim_1.fastq -O 02_output/SRR5832199_trim_2.fastq \
           -h 02_output/SRR5832199_report.html  -j 02_output/SRR5832199_report.json\
           --thread 1 \
           -x -p 


You'll see <font color=orange>WARNING: Non existent 'bind path' source: '/rc_scratch'</font> again until the output comes.

It is running while you still see <font color="blue">`In [*]`</font> with the asterisk.

# Preparing and running a batch job

Now we're going to set up the full version of this. Open your terminal emulator and log on to summit.