🧠 Goal of Lesson 3
Assess the quality of your raw reads with FastQC

Summarize results with MultiQC

Trim adapters and low-quality sequences with fastp

✅ 0. process all .sra files inside raw_data folder:

In [None]:
bash extract_all_fastq_parallel.sh
or
bash extract_all_fastq.sh

✅ 1. Quality Check with FastQC
You've already used fastqc, but here's how you can do it efficiently again, assuming your fastq files are in raw_data and organized in subfolders:
If not already done, create and run this script:

In [None]:
#!/bin/bash
# File: run_fastqc.sh
# Location: rnaseq_project/raw_data/

mkdir -p ../fastqc_results

find . -name "*.fastq.gz" | while read fq; do
    echo "Running FastQC on $fq"
    fastqc "$fq" -o ../fastqc_results
done

In [None]:
bash run_fastqc.sh

✅ 2. Summarize with MultiQC
Install MultiQC if not done yet:

conda install -c bioconda multiqc

Then run:

In [None]:
cd ../fastqc_results
multiqc .

This will generate an multiqc_report.html summarizing all FastQC results. Open it with your browser or in Jupyerlab

In [None]:
✅ 3. Trim Reads with fastp
Install:

conda install -c bioconda fastp

Now create a script to automatically trim all pairs:

In [None]:
#!/bin/bash
# File: trim_all_fastq_parallel.sh
# Location: rnaseq_project/raw_data/
# Optimized for 8 threads across 4 cores

# Create output directory
mkdir -p ../trimmed_data

# Function to process a single sample
process_sample() {
    local folder="$1"
    local r1=$(find "$folder" -name "*_1.fastq.gz" | head -n1)
    local r2=$(find "$folder" -name "*_2.fastq.gz" | head -n1)
    local sample=$(basename "$folder")

    if [[ -f "$r1" && -f "$r2" ]]; then
        echo "Trimming $sample"
        
        fastp \
          -i "$r1" \
          -I "$r2" \
          -o ../trimmed_data/"${sample}"_1.trimmed.fastq.gz \
          -O ../trimmed_data/"${sample}"_2.trimmed.fastq.gz \
          --detect_adapter_for_pe \
          --thread 2 \
          --html ../trimmed_data/"${sample}".html \
          --json ../trimmed_data/"${sample}".json
    fi
}

export -f process_sample

# Find all directories and process them in parallel
find . -maxdepth 1 -type d ! -path . | \
    parallel --jobs 4 --load 100% process_sample

echo "All trimming jobs completed!"

In [None]:
bash trim_all_fastq_parallel.sh

Now all the reads are trimmed :)

////////////////////////////////////////////////////
useful command (run in terminal) to monitor fasterq-dump|pigz|gzip|fastp procceses:

while true; do date -u "+%Y-%m-%d %H:%M:%S"; echo "User: $USER"; echo "----------------------------------------"; ps aux | grep -E "fasterq-dump|pigz|gzip|fastp" | grep -v "grep"; echo; sleep 30; done