# Processing Raw Sequences

Scripts used to trim raw sequences, check quality, map to ref.

## 1. trimming 

using `trim-galore` - [documentation](https://github.com/FelixKrueger/TrimGalore)

This code uses an array to run jobs in parallel

In [None]:
#!/bin/bash
#SBATCH --job-name=trim_galore_array
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH -p cpu
#SBATCH -t 12:00:00
#SBATCH --array=1-120
#SBATCH -o slurm-%A_%a.out
#SBATCH --mail-type=END,FAIL

#-----------------modules-----------------#
module load conda/latest

conda activate cutadapt
# trim-galore is already installed in this env 

#---------------change wd----------------#

# to scratch workspace with downloaded seqs

cd /scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld

#-----------------commands----------------#

# parent directory containing sample subdirectories
PARENT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/01.RawData"

# output dir for all trimmed files
OUTDIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all"

SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" /scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/sample_dirs.txt)

R1=("$SAMPLE"/*_1*.fq.gz)
R2=("$SAMPLE"/*_2*.fq.gz)

echo "Running Trim Galore on: $SAMPLE"

trim_galore --paired --fastqc -j 4 -o "$OUTDIR" "${R1[@]}" "${R2[@]}"


## 2. check quality
using FastQC and MultiQC to check quality after trimming adapters

**2a. FastQC** to generate quality assessment files

In [None]:
#!/bin/bash
#SBATCH --job-name=fastqc_array
#SBATCH -c 4                 # cores per task
#SBATCH --mem=16G             # memory per node
#SBATCH -p cpu
#SBATCH -t 12:00:00
#SBATCH --array=1-120         # number of array tasks
#SBATCH -o slurm-%A_%a.out
#SBATCH --mail-type=END,FAIL

# Load conda and activate environment
module load conda/latest
conda activate fastqc

# Set working directories
INPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all"
OUTPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/fastqc"
cd "$INPUT_DIR"

# Number of files per task
FILES_PER_TASK=4

# Compute which lines (files) this array task will process
START=$(( (SLURM_ARRAY_TASK_ID - 1) * FILES_PER_TASK + 1 ))
END=$(( SLURM_ARRAY_TASK_ID * FILES_PER_TASK ))

# Loop over assigned files
for i in $(seq $START $END); do
    FILE=$(sed -n "${i}p" fq_files.txt)
    if [ -n "$FILE" ]; then  # skip if line is empty
        echo "Processing $FILE"
        fastqc -t 2 -o "$OUTPUT_DIR" "$FILE"
    fi
done

**2b. MultiQC** to view all 120 samples (480 files) at once

>multiqc runs quickly, so can be done in terminal and don't need to submit a job. both the html and zip files need to be in the same directory.

In [None]:
multiqc .

## 3. alignment

using `hisat2` ([manual](https://daehwankimlab.github.io/hisat2/manual/)) - following pipeline from [how to page](https://daehwankimlab.github.io/hisat2/howto/)

(using hisat2 over bowtie2 bc bowtie2 isn't splice aware)

#### 3a. build genome index with exons and splice sites
download reference genome (genome.fa) and GTF file (to make exon, splice site file; genome.gtf)

In [None]:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0/GCF_002022765.2_C_virginica-3.0_genomic.gff.gz
gunzip GCF_002022765.2_C_virginica-3.0_genomic.gff.gz

make exon and splice sites files from GTF file

In [None]:
# Extract splice sites
hisat2_extract_splice_sites.py GCF_002022765.2_C_virginica-3.0_genomic.gff > oyster.ss

# Extract exons
hisat2_extract_exons.py GCF_002022765.2_C_virginica-3.0_genomic.gff > oyster.exon


Build HFM index - with exon and splice site info using the files above

(HFM = hierarchical FM index for a reference genome)

In [None]:
#!/bin/bash
#SBATCH --job-name=hisat2_build
#SBATCH -c 8                 # cores per task
#SBATCH --mem=64G             # memory per node
#SBATCH -p cpu
#SBATCH -t 1:00:00
#SBATCH --cpus-per-task=16     
#SBATCH -o hisat2_build.log
#SBATCH --mail-type=END,FAIL

#-----------------modules-----------------#
module load conda/latest

conda activate hisat2-env

#---------------change wd----------------#

cd /work/pi_sarah_gignouxwolfsohn_uml_edu/julia_mcdonough_student_uml_edu/ref_files

#-----------------commands----------------#

hisat2-build -p 16 \
--exon oyster.exon \
--ss oyster.ss \
GCF_002022765.2_C_virginica-3.0_genomic.fna oyster_index


#### 3b. align sequences to genome index

In [None]:
#!/bin/bash
#SBATCH --job-name=hisat2_align
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH -p cpu
#SBATCH -t 4:00:00
#SBATCH -o hisat2_align_%j.log
#SBATCH --mail-type=END,FAIL

#-----------------modules-----------------#

module load conda/latest

conda activate hisat2-env

#-----------------set paths----------------#

INDEX="/work/pi_sarah_gignouxwolfsohn_uml_edu/julia_mcdonough_student_uml_edu/ref_files/hisat2_index"
INPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/trimmed_all"
OUTPUT_DIR="/scratch4/workspace/julia_mcdonough_student_uml_edu-novogene_dwnld/hisat2-align"

#-----------------commands----------------#

for R1 in ${INPUT_DIR}/*_gi_1_val_1.fq.gz; do
    # Remove _gi_1_val_1.fq.gz to get sample base name
    SAMPLE=$(basename $R1 _gi_1_val_1.fq.gz)
    
    # Construct R2 with gi_2
    R2="${INPUT_DIR}/${SAMPLE}_gi_2_val_2.fq.gz"
    
    # Check if R2 exists
    if [[ ! -f "$R2" ]]; then
        echo "Warning: R2 file not found: $R2"
        continue
    fi
    
    echo "Aligning $SAMPLE..."
    echo "R1: $R1"
    echo "R2: $R2"
    
    hisat2 -p 16 \
        -x $INDEX/oyster_index \
        -1 $R1 \
        -2 $R2 \
        -S ${OUTPUT_DIR}/${SAMPLE}.sam \
        2> ${OUTPUT_DIR}/${SAMPLE}.log
done