# HW06: Quality Control and Trimming

This notebook will go through the workflow for read quality control and trimming. 

1. Write the run script to check the quality of the reads BEFORE trimming (using fastqc)
2. Write the run script to trim and filter low-quality reads with [Trimmomatic](https://carpentries-lab.github.io/metagenomics-analysis/03-trimming-filtering/index.html).
3. Write the run script to check the quality of the reads AFTER trimming (using fastqc)
4. Launch each of the run scripts using the launcher script.


## Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

In [None]:
# set the variables for your netid and xfile
# note that each person has 8 SRA accession ids in the xfile.
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/06_qc_trimming"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our raw fastq files.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"

## Creating a config file
Each of the run scripts below executes code that requires certain variables to be set. So we don't need to edit the code in each of the scripts, we are going to use a config file that defines all of these variables. Then when we want to use these variables in the script, we will "source" the config file to set the variables. This is generally a good practice in writing scripts on the HPC, that makes it so you only need to modify the config file (rather than each individual run scripts). 

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export FASTQC=/contrib/singularity/shared/bhurwitz/fastqc-0.11.9.sif" >> config.sh
!echo "export TRIMMOMATIC=/contrib/singularity/shared/bhurwitz/trimmomatic:0.39--hdfd78af_2.sif" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right working directory?
!cat config.sh

### Step 1: Assessing Read Quality for the Raw Reads from the SRA

Now that we have all of our raw sequence data downloaded, we are ready to start the quality control process. We will use a tool called fastqc that generates a report about the quality of our sequence data. First, we create reports showing us the quality of the reads from each accession BEFORE trimming. That way we can see how well our trimming step works.

Let's create a run script to run the fastqc program.

In [None]:
# Create a script to run fastqc on each of our accessions
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. fastqc runs on each of the fastq files in the $FASTQ_DIR
# 3. We will copy the reports to our home directory so you can visualize these on the HPC (via Jupyter)

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                      
#SBATCH --output=06A_fastqc-%a.out
#SBATCH --cpus-per-task=1                  
#SBATCH --mem=4G                           

pwd; hostname; date

source ./config.sh
names=($(cat $FASTQ_DIR/$XFILE))

apptainer run ${FASTQC} fastqc $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq*

TRIM_DIR="${WORK_DIR}/before_qc_trimming"
if [[ ! -d "$TRIM_DIR" ]]; then
  echo "$TRIM_DIR does not exist. Directory created"
  mkdir -p $TRIM_DIR
fi

mv $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.html $TRIM_DIR
mv $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.zip $TRIM_DIR
cp -r $TRIM_DIR ~/be487-fall-2024/assignments/06_qc_trimming
 
'''

with open('06A_run_fastqc.sh', mode='w') as file:
    file.write(my_code)

### Step 2: Creating a run script to trim and filter bad reads from the .fastq files

In order to run trimmomatic in a PE (paired-end) format we'll need two files. These files are:  *_1.fastq.gz and *_2.fastq.gz for each accession from the SRA. You downloaded these in 05_getting_data. 

### Initial Data Management

Trimmomatic will give us 4 output files (forward paired, forward unpaired, reverse paired and reverse unpaired. To keep our data organized, let's create output directories so the script can organize our data as it runs.


In [None]:
# Create the trimmed and unpaired directories
import os

trim_dir = work_dir + "/trimmed_reads"
unpair_dir = work_dir + "/unpaired_reads"

if os.path.isdir(trim_dir):
    print("trim_dir exists")
else:
    os.mkdir(trim_dir)

if os.path.isdir(unpair_dir):
    print("unpair_dir exists")
else:
    os.mkdir(unpair_dir)

In [None]:
# we need to copy the adapter file into your current working directory
!cp ~/be487-fall2024/assignments/06_qc_trimming/TruSeq3-PE-2.fa .  

In [None]:
# Let's create a run script that runs trimmomatic on all of our fastq files
# you can only run this after the *.fastq files are gzipped (be sure you checked via 05_getting_data_check)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7
#SBATCH --output=Job-trim-%a.out
#SBATCH --cpus-per-task=1                   
#SBATCH --mem=4G                   
 
pwd; hostname; date
source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))

TRIM_DIR="${WORK_DIR}/trimmed_reads"
UNPAIR_DIR="${WORK_DIR}/unpaired_reads"

apptainer run ${TRIMMOMATIC} trimmomatic PE -phred33 \
    ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz \
    ${TRIM_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz ${UNPAIR_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz \
    ${TRIM_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz ${UNPAIR_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz \
    ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:20
'''

with open('06B_run_trimmomatic.sh', mode='w') as file:
    file.write(my_code)

## Step 3 QC Final Check

Create a run script that performs a final quality control check, using fastqc, on the trimmed fastq files.

This script will use the fastqc tool, and a similar script to the one in 2A, but will check the reads that are in the "trimmed" directory. The results should be output to the 

If you have any doubts about the trimming process, you can always run fastqc on the trimmed data and double check that you see all "green". You can check the fastqc files using Jupyter to check for any failures or other warnings.

In [None]:
# Create a script to run fastqc on each of our accessions
# Round 2! This will check the fastq files after screening and cleaning with trimmomatic

my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                      
#SBATCH --output=06C_fastqc-%a.out
#SBATCH --cpus-per-task=1                  
#SBATCH --mem=4G                           

pwd; hostname; date

source ./config.sh
names=($(cat $FASTQ_DIR/$XFILE))

apptainer run ${FASTQC} fastqc ${WORK_DIR}/trimmed_reads/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq*

TRIM_DIR="${WORK_DIR}/after_qc_trimming"
if [[ ! -d "$TRIM_DIR" ]]; then
  echo "$TRIM_DIR does not exist. Directory created"
  mkdir -p $TRIM_DIR
fi

mv ${WORK_DIR}/trimmed_reads/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.html $TRIM_DIR
mv ${WORK_DIR}/trimmed_reads/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.zip $TRIM_DIR
cp -r $TRIM_DIR ~/be487-fall-2024/assignments/06_qc_trimming
 
'''

with open('06C_run_fastqc.sh', mode='w') as file:
    file.write(my_code)

## Step 4: Putting it all together

Once you have created the the run scripts, you are ready to put them together in a pipeline to run each of the steps one by one. Notice which steps are dependent on the others

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 06A_run_fastqc: first job - no dependencies
job1=$(sbatch 06A_run_fastqc.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

# 06B_run_trimmomatic: jid2 depends on jid1
job2=$(sbatch --dependency=afterok:$jid1 06B_run_trimmomatic.sh)
jid2=$(echo $job2 | sed 's/^Submitted batch job //')
echo $jid2

# 06C_run_fastqc: jid3 depends on jid2
job3=$(sbatch --dependency=afterok:$jid2 06C_run_fastqc.sh)
jid3=$(echo $job3 | sed 's/^Submitted batch job //')
echo $jid3

'''

with open('06_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
!chmod +x *sh

In [None]:
# now let's run it!
!./06_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Notice that 06B jobs are dependent on 06A jobs finishing and etc.
!squeue --user=$netid

### What happens next?

Your code will take a little time to get "picked up" by the HPC and move from PD (pending) to R (running). Come back in about a day to double check you got all of the raw sequence files using the hw06_check.ipynb notebook. But, for now, relax and enjoy your day!

### Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/06_qc_trimming/hw06_qc_trimming.ipynb $work_dir