# QC of raw GPS library ONT data

Goal: assess the quality of the raw ONT data.

Approach: run FastQC and NanoPlot (on Cambridge HPC as file is very large, ~ 200 GB):

```bash
    #!/bin/bash

    #SBATCH -A JNATHAN-SL2-CPU
    #SBATCH --mail-type=BEGIN,END,FAIL
    #SBATCH -p cclake
    #SBATCH -D /home/nw416/rds/rds-jan_1-tpuFdqHBAEk/gps_ont
    #SBATCH -o ont_qc.log
    #SBATCH -c 20
    #SBATCH -t 08:00:00
    #SBATCH -J ont_qc

    # Initialize Conda for script usage
    source "/home/nw416/miniforge3/etc/profile.d/conda.sh"
    conda activate ont

    DATA_DIR=/home/nw416/rds/rds-jan_1-tpuFdqHBAEk/gps_ont
    RAW_FASTQ=$DATA_DIR/284LHC_1_GPS-S1S6-L1L6.fastq.gz

    # QC of raw data
    # ---------------------------------------
    mkdir -p fastqc_raw
    mkdir -p nanoplot_raw

    fastqc -t 20 $RAW_FASTQ -o fastqc_raw 2>&1 | tee fastqc_raw.log
    NanoPlot --fastq $RAW_FASTQ -o nanoplot_raw --threads 20 2>&1 | tee nanoplot_raw.log
```
File name: 01_qc.sh

------

**Note: all code is run in the following Conda environment:**

```yaml
name: ont
channels:
    - conda-forge
    - bioconda
    - nodefaults
dependencies:
    - conda-forge::ipykernel
    - conda-forge::nbconvert
    - bioconda::cutadapt=5.2
    - bioconda::chopper=0.12.0
    - bioconda::nanoplot=1.46.2
    - bioconda::fastqc=0.12.1
    - conda-forge::biopython=1.86
    - conda-forge::r-tidyverse=2.0.0
    - conda-forge::r-cowplot=1.2.0
    - bioconda::minimap2=2.30
    - bioconda::samtools=1.23
```

In [3]:
from IPython.display import IFrame

IFrame(src='fastqc/284LHC_1_GPS-S1S6-L1L6_fastqc.html', width=1400, height=800)

In [5]:
from IPython.display import IFrame

IFrame(src='NanoPlot/NanoPlot-report.html', width=1400, height=800)

------
Gemini summary of NanoPlot results:

## NanoPlot Run Summary: ONT PromethION (R10.4.1)

### ðŸ“Š Key Performance Metrics
| Metric | Your Run | Typical "Good" PromethION | Status |
| :--- | :--- | :--- | :--- |
| **Total Yield** | **192.4 Gb** | 100â€“150 Gb | ðŸš€ **Exceptional** |
| **Median Quality** | **Q20.2** | Q18â€“Q20 | âœ… **High End** |
| **Read Length N50** | **10.6 kb** | 8â€“15 kb | âœ… **Healthy** |
| **% Reads > Q20** | **52.2%** | 30â€“50% | âœ… **Excellent** |


### ðŸŽ¯ Quality Distribution
* **>Q15 (85% accuracy):** 89.5% of reads (~178 Gb).
* **>Q20 (99% accuracy):** 52.2% of reads (~119 Gb).

### ðŸ’¡ Analysis & Insights
This run is performing significantly above the average for a PromethION flow cell. Crossing the **190 Gb** threshold while maintaining a **median Q20.2** suggests an optimal library prep and high-quality DNA input.




# Next step

For trimming of backbone sequences use cutadapt in the next step.

Optimise mismatch parameters in trimming to balance sensitivity and specificity.