# Processing Raw Sequences

Scripts used to trim raw sequences, check quality, map to ref.

## 1. trimming 

using trim-galore [documentation](https://github.com/FelixKrueger/TrimGalore)

This code uses an array to run jobs in parallel

In [None]:
#!/bin/bash
#SBATCH --job-name=trim_galore_array
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH -p cpu
#SBATCH -t 12:00:00
#SBATCH --array=1-120
#SBATCH -o slurm-%A_%a.out
#SBATCH --mail-type=END,FAIL

#-----------------modules-----------------#
module load conda/latest

conda activate cutadapt
# trim-galore is already installed in this env 

#---------------change wd----------------#

# to scratch workspace with downloaded seqs

cd /scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld

#-----------------commands----------------#

# parent directory containing sample subdirectories
PARENT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/01.RawData"

# output dir for all trimmed files
OUTDIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all"

SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" /scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/sample_dirs.txt)

R1=("$SAMPLE"/*_1*.fq.gz)
R2=("$SAMPLE"/*_2*.fq.gz)

echo "Running Trim Galore on: $SAMPLE"

trim_galore --paired --fastqc -j 4 -o "$OUTDIR" "${R1[@]}" "${R2[@]}"


## 1. check quality
using FastQC and MultiQC to check quality after trimming adapters

**1a. FastQC**

In [None]:
#!/bin/bash
#SBATCH --job-name=trim_galore_array
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH -p cpu
#SBATCH -t 12:00:00
#SBATCH --array=1-120
#SBATCH -o slurm-%A_%a.out
#SBATCH --mail-type=END,FAIL

#-----------------modules-----------------#
module load conda/latest

conda activate fastqc

#---------------change wd----------------#

# to scratch workspace with downloaded seqs

cd /scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all

#-----------------commands----------------#

OUTPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/fastqc"

fastqc -t 4 -o "$OUTPUT_DIR" *fq.gz

In [None]:
#!/bin/bash
#SBATCH --job-name=fastqc_array
#SBATCH -c 4                 # cores per task
#SBATCH --mem=16G             # memory per node
#SBATCH -p cpu
#SBATCH -t 12:00:00
#SBATCH --array=1-120         # number of array tasks
#SBATCH -o slurm-%A_%a.out
#SBATCH --mail-type=END,FAIL

# Load conda and activate environment
module load conda/latest
conda activate fastqc

# Set working directories
INPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all"
OUTPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/fastqc"
cd "$INPUT_DIR"

# Number of files per task
FILES_PER_TASK=4

# Compute which lines (files) this array task will process
START=$(( (SLURM_ARRAY_TASK_ID - 1) * FILES_PER_TASK + 1 ))
END=$(( SLURM_ARRAY_TASK_ID * FILES_PER_TASK ))

# Loop over assigned files
for i in $(seq $START $END); do
    FILE=$(sed -n "${i}p" fq_files.txt)
    if [ -n "$FILE" ]; then  # skip if line is empty
        echo "Processing $FILE"
        fastqc -t 2 -o "$OUTPUT_DIR" "$FILE"
    fi
done

**1b. MultiQC** to view all 120 samples at once