# QC and trimming COL_032024 seq data 

In [None]:
#INSTALLATION
module load conda/latest
conda create -n qc
conda activate qc
conda install -c bioconda trim-galore

In [None]:
sbatch Col_qc.sh

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/slurm-qc-%j.out  # %j = job ID

module load conda/latest

# Run qc with trim galore and fastqc
conda activate qc

# Define the paths and variables
FILEPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL_files/10162024fastq_data'
OUTPUT_RESULTS='/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/trimmed'
NSLOTS=4  

#create filename if not already created
ls $FILEPATH -1 | sed 's/_R.*_001.fastq.gz//' | uniq | cat > $OUTPUT_RESULTS/'032024_sampleids.txt'

SAMPLE_NAMES_FILE="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/trimmed/032024_sampleids.txt"

# Check if the file exists
if [ ! -e "$SAMPLE_NAMES_FILE" ]; then
    echo "Error: $SAMPLE_NAMES_FILE does not exist."
    exit 1
fi

# Read each line from the file and perform actions
while IFS= read -r sample_id; do
    # Form the full file names
    input_r1="$FILEPATH/${sample_id}_R1_001.fastq.gz"
    input_r2="$FILEPATH/${sample_id}_R2_001.fastq.gz"
    
    # Ensure the input files exist before running the tools
    if [ ! -e "$input_r1" ] || [ ! -e "$input_r2" ]; then
        echo "Error: Input files do not exist for sample $sample_id"
        continue
    fi

    # Run trim_galore
    trim_galore -j "$NSLOTS" -q 20 --phred33 --length 20 --paired $input_r1 $input_r2 --fastqc -o $OUTPUT_RESULTS


done < "$SAMPLE_NAMES_FILE"

# bash script file name: Col_qc
# JOB-ID: 26388440
#trimmed read seqs in folder: /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/trimmed

This script ran well but I maxxed out workspace storage!! 

Removed the --don't_gzip flag, moved sequence data to project/COL_files and then will delete raw fastq when trim galore is finished.
**These changes are reflected in the above bash script (Col_qc.sh)**.

Output files: 

#SAMPLEID_R1_001.fastq.gz_trimming_report.txt \
#SAMPLEID_R1_001_val_1.fq.gz \
#SAMPLEID_R1_001_val_1_fastqc.html \
#SAMPLEID_R1_001_val_1_fastqc.zip \
#SAMPLEID_R2_001.fastq.gz_trimming_report.txt \
#SAMPLEID_R2_001_val_2.fq.gz \
#SAMPLEID_R2_001_val_2_fastqc.html \
#SAMPLEID_R2_001_val_2_fastqc.zip \
*kept all for multiqc*

## MultiQC documentation
    https://github.com/MultiQC/MultiQC/ 

In [None]:
# Installation
module load conda/latest
conda create --name multiqc python=3.11
conda activate multiqc
conda install -c bioconda multiqc 

In [None]:
conda activate multiqc 
# in directory with fastqc output: /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/trimmed
multiqc .
conda deactivate

Output files:

#multiqc_data \
#multiqc_report.html (download this file)

### For 032024 samples

Sequence data looks good. 032024_COL_SAN_T5_158_DLAB_S15_R1_001_val_1/2 has the lowest seq read count with 48.2 M and 032024_COL_SAN_T5_130_MCAV_S37_R1_001_val/2 has the high seq read count with 88.6 M.

Saved MultiQC General Statistics Table in COL_files.

Moved QC reports (zip,html,txt files) to /project/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL_files/032024_fastqc_reports/

Moved trimmed seq data into respective coral species directory for next step: OFAV, PSTR, DLAB, MCAV