# Taxonomic Profiling Pipeline
This notebook runs a preprocessing + taxonomic classification workflow:
- QC (FastQC, fastp, MultiQC)
- Host read removal (Kneaddata)
- Taxonomic profiling (Kraken2 + Bracken)
- Visualization (Krona)


In [None]:
# Install dependencies (adjust depending on available Colab packages)
!mamba install -y fastqc fastp multiqc kneaddata kraken2 bracken krona


## 1. Set configuration
Define input reads, databases, and output directory.


In [None]:
import os

READ1 = "/content/data/sample_R1.fastq.gz"
READ2 = "/content/data/sample_R2.fastq.gz"

OUTPUT_DIR = "/content/output"
OUTPUT_PREFIX = "sample"

KNEADDATA_DB = "/content/databases/kneaddata/human_genome"
KNEADDATA_DB_TYPE = "Homo_sapiens"

KRAKEN2_DB = "/content/databases/kraken2"
KRAKEN2_DB_URL = "https://genome-idx.s3.amazonaws.com/kraken/k2_standard_8gb_202310.tgz"

BRACKEN_READ_LEN = 150
BRACKEN_KMER = 35

THREADS = "4"

os.makedirs(OUTPUT_DIR, exist_ok=True)


## 2. Database setup
Download Kneaddata and Kraken2 databases (if missing) and build Bracken DB.


In [None]:
# Kneaddata DB
if not os.path.exists(KNEADDATA_DB):
    !kneaddata_database --download {KNEADDATA_DB_TYPE} bowtie2 $(dirname {KNEADDATA_DB})

# Kraken2 DB
if not os.path.exists(KRAKEN2_DB):
    !wget {KRAKEN2_DB_URL} -O /tmp/k2_db.tgz
    !tar -xvzf /tmp/k2_db.tgz -C $(dirname {KRAKEN2_DB})

# Bracken DB
bracken_file = os.path.join(KRAKEN2_DB, f"database{BRACKEN_READ_LEN}mers.kmer_distrib")
if not os.path.exists(bracken_file):
    !bracken-build -d {KRAKEN2_DB} -t {THREADS} -k {BRACKEN_KMER} -l {BRACKEN_READ_LEN}


## 3. Quality Control (QC)
Run FastQC, trimming with fastp, and MultiQC reports.


In [None]:
# FastQC raw reads
!fastqc -o {OUTPUT_DIR} {READ1} {READ2}

# fastp trimming
!fastp \
    -i {READ1} -I {READ2} \
    -o {OUTPUT_DIR}/{OUTPUT_PREFIX}_R1.clean.fastq.gz \
    -O {OUTPUT_DIR}/{OUTPUT_PREFIX}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
    --cut_front --cut_tail --cut_mean_quality 20 \
    --length_required 50 \
    --trim_poly_g \
    --thread {THREADS} \
    --html {OUTPUT_DIR}/{OUTPUT_PREFIX}_fastp.html \
    --json {OUTPUT_DIR}/{OUTPUT_PREFIX}_fastp.json

# FastQC cleaned reads
!fastqc -o {OUTPUT_DIR} \
    {OUTPUT_DIR}/{OUTPUT_PREFIX}_R1.clean.fastq.gz \
    {OUTPUT_DIR}/{OUTPUT_PREFIX}_R2.clean.fastq.gz

# MultiQC summary
!multiqc {OUTPUT_DIR} -o {OUTPUT_DIR}


## 4. Host read removal with Kneaddata
Removes host contamination from cleaned reads.


In [None]:
!kneaddata \
    -i1 {OUTPUT_DIR}/{OUTPUT_PREFIX}_R1.clean.fastq.gz \
    -i2 {OUTPUT_DIR}/{OUTPUT_PREFIX}_R2.clean.fastq.gz \
    -db {KNEADDATA_DB} \
    -o {OUTPUT_DIR}/kneaddata_cleaned \
    -t {THREADS} \
    --output-prefix {OUTPUT_PREFIX}_cleaned


## 5. Taxonomic classification with Kraken2


In [None]:
!kraken2 \
    --db {KRAKEN2_DB} \
    --threads {THREADS} \
    --paired \
    {OUTPUT_DIR}/kneaddata_cleaned/{OUTPUT_PREFIX}_cleaned_paired_1.fastq.gz \
    {OUTPUT_DIR}/kneaddata_cleaned/{OUTPUT_PREFIX}_cleaned_paired_2.fastq.gz \
    --report {OUTPUT_DIR}/{OUTPUT_PREFIX}.kraken2.report \
    --output {OUTPUT_DIR}/{OUTPUT_PREFIX}.kraken2.out


## 6. Abundance estimation with Bracken


In [None]:
!bracken \
    -d {KRAKEN2_DB} \
    -i {OUTPUT_DIR}/{OUTPUT_PREFIX}.kraken2.report \
    -o {OUTPUT_DIR}/{OUTPUT_PREFIX}.bracken.species \
    -r {BRACKEN_READ_LEN} -l S


## 7. Visualization with Krona


In [None]:
# Krona taxonomy update (if needed)
!ktUpdateTaxonomy.sh

# Prepare and visualize
!cut -f2,3 {OUTPUT_DIR}/{OUTPUT_PREFIX}.kraken2.out > {OUTPUT_DIR}/{OUTPUT_PREFIX}.krona.input
!ktImportTaxonomy {OUTPUT_DIR}/{OUTPUT_PREFIX}.krona.input -o {OUTPUT_DIR}/{OUTPUT_PREFIX}.krona.html
