The structure of this notebook goes as follows:
1) Load in fastq.gz files, assign relevant information, and cut/trim the raw sequences
2) Inspecting the nature and quality of all your samples and decide parameters of further analysis
3) Assign taxa with reference classifier, then calculate core diversity metrics
4) Visualise alpha/beta diversity
5) Calculate permanova/betadisper (which helps explain the result of diversity metrics)
6) Identify differential taxa (specific microbes that differ in subsets of your samples)
7) Create cladogram

Section 1 we must create  a "manifest.tsv" (which provides paths to the actual sequence files) and a "metadata.tsv" (which provides details to the experiment which gives context to each sample in analysis). These two files will guide the entire analysis pipeline

In [None]:
# Import all necessary packages
import os
import re
import pandas as pd
from pathlib import Path
import biom
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from matplotlib.patches import Ellipse
import glob
import csv
import json
import qiime2
import seaborn as sns
from skbio.stats.ordination import OrdinationResults
import shutil

In [None]:

'''This cell creates a manifest file for QIIME2 from a directory of FASTQ files. 
Alternatively, you can create the manifest file manually if you have specific naming conventions.'''

# Get notebook directory
project_dir = Path(os.getcwd())

# Define input/output dirs
raw_dir = project_dir / "raw"
qiime_dir = project_dir / 'qiime'
manifest_file = project_dir / "manifest.tsv"

# Recursively list all FASTQ files
fastqs = sorted(raw_dir.rglob("*.fq.gz"))

if not fastqs:
    raise FileNotFoundError(f"No FASTQ files found in {raw_dir}")

# Helper: normalize sample names
def extract_sample_name(fname: Path) -> str:
    """
    Remove read direction markers (_R1, _R2, _1, _2) and extensions
    to infer the sample name.
    """
    name = fname.stem  # strip .gz
    if name.endswith(".fastq"):
        name = name[:-6]
    elif name.endswith(".fq"):
        name = name[:-3]

    # Remove read direction markers
    name = re.sub(r"(_R?[12])$", "", name)
    name = re.sub(r"(_R?[12]$)", "", name)
    name = re.sub(r"[-_.](R1|R2|1|2)$", "", name)

    return name

# Group by sample name
samples = {}
for f in fastqs:
    sample = extract_sample_name(f)
    if sample not in samples:
        samples[sample] = {"R1": None, "R2": None}

    if re.search(r"(R1|_1)\.f(ast)?q\.gz$", f.name):
        samples[sample]["R1"] = f.resolve()
    elif re.search(r"(R2|_2)\.f(ast)?q\.gz$", f.name):
        samples[sample]["R2"] = f.resolve()

# Build manifest DataFrame
records = []
for sample, files in samples.items():
    if files["R1"] and files["R2"]:
        records.append({
            "sampleid": sample,
            "forward-absolute-filepath": str(files["R1"]),
            "reverse-absolute-filepath": str(files["R2"]),
        })
    else:
        print(f"⚠️ Skipping {sample}: missing R1 or R2")

manifest = pd.DataFrame(records)

# Save manifest
manifest.to_csv(manifest_file, sep="\t", index=False)

print(f"✅ Manifest written to: {manifest_file.resolve()}")
print(f"Samples included: {len(manifest)}")
print(manifest.head)

Section 2 we start actually working with our sequences. We will inspect quality, denoise, truncate, and inspect again.

In [None]:
# This imports the fastq files and creates a QIIME2 artifact
!mkdir -p qiime
# Select the manifest 
!qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-format PairedEndFastqManifestPhred33V2 \
  --input-path "manifest.tsv" \
  --output-path "qiime/demux.qza"

In [None]:
# This creates a qiime2 visualisation artefact with which you can check on QIIME2 View the quality of the reads
# Based on the QC plots, you can decide the length at which to trim the reads in the next step
!qiime demux summarize \
    --i-data "qiime/demux.qza" \
    --o-visualization "qiime/demux.qzv"

In [None]:
# DADA2 is a popular denoising algorithm that corrects amplicon errors while also removing chimeras
# It will return artefacts which can be converted to visualisation to see if we trimmed appropriately
!qiime dada2 denoise-paired \
    --i-demultiplexed-seqs "qiime/demux.qza" \
    --p-trim-left-f 10 \
    --p-trim-left-r 10 \
    --p-trunc-len-f 240 \
    --p-trunc-len-r 240 \
    --o-table "qiime/table.qza" \
    --o-representative-sequences "qiime/rep-seqs.qza" \
    --o-denoising-stats "qiime/stats.qza" \
    --p-n-threads 0

In [None]:
# Now we convert the dada2 artefacts in the previous cell to visualisations
# If it passes the QC, we can proceed to assign taxonomy
!qiime feature-table summarize \
    --i-table "qiime/table.qza" \
    --o-visualization "qiime/table.qzv" \
    --m-sample-metadata-file "metadata.tsv"

!qiime feature-table tabulate-seqs \
    --i-data "qiime/rep-seqs.qza" \
    --o-visualization "qiime/rep-seqs.qzv"

!qiime metadata tabulate \
    --m-input-file "qiime/stats.qza" \
    --o-visualization "qiime/stats.qzv"

In [None]:
# Point this to your actual classifier
classifier_path = "/home/patwuch/projects/microbiome/reference/silva-138-99-nb-classifier.qza"

!qiime feature-classifier classify-sklearn \
    --i-classifier {classifier_path} \
    --i-reads "qiime/rep-seqs.qza" \
    --o-classification "qiime/taxonomy.qza"

# Exhaustive list of taxonomic assignments with confidence scores
!qiime metadata tabulate \
    --m-input-file "qiime/taxonomy.qza" \
    --o-visualization "qiime/taxonomy.qzv"


In [None]:
# Taxa bar plots are a common way to visualise taxonomic composition across samples
!qiime taxa barplot \
    --i-table "qiime/table.qza" \
    --i-taxonomy "qiime/taxonomy.qza" \
    --m-metadata-file "metadata.tsv" \
    --o-visualization "qiime/taxa-bar-plots.qzv"

# Krona is a helpful way to interactively explore taxonomic composition
# But it is not great for static figures in publications
!qiime krona collapse-and-plot \
    --i-table "qiime/table.qza" \
    --i-taxonomy "qiime/taxonomy.qza" \
    --o-krona-plot "qiime/krona.qzv"


  import pkg_resources
[32mSaved Visualization to: qiime/taxa-bar-plots.qzv[0m
  import pkg_resources
[32mSaved Visualization to: qiime/krona.qzv[0m
[?25h[0m

In [None]:
# If you want static plots of alpha and beta diversity as png, you must use the following commands
# These export qiime artefacts into more standard bioinformatics file formats
!qiime tools export \
  --input-path qiime/table.qza \
  --output-path exported/exported-feature-table

!qiime tools export \
  --input-path qiime/taxonomy.qza \
  --output-path exported/exported-taxonomy

In [None]:
output_dir = "taxa_barplots"
os.makedirs(output_dir, exist_ok=True)

# ===========================
# USER SETTINGS
# ===========================
feature_table_biom = "exported/exported-feature-table/feature-table.biom"
taxonomy_tsv = "exported/exported-taxonomy/taxonomy.tsv"
top_n = 20            # Number of taxa to show in legend

# ===========================
# LOAD DATA
# ===========================
table = biom.load_table(feature_table_biom)
df = pd.DataFrame(table.matrix_data.toarray().T, 
                  index=table.ids(axis='sample'), 
                  columns=table.ids(axis='observation'))

taxonomy = pd.read_csv(taxonomy_tsv, sep='\t', index_col=0)

level_dict = {
    "Kingdom": 0,
    "Phylum": 1,
    "Class": 2,
    "Order": 3,
    "Family": 4,
    "Genus": 5,
}

# ===========================
# LOOP THROUGH TAXONOMIC LEVELS
# ===========================
for tax_level, level_index in level_dict.items():
    print(f"Processing {tax_level}...")

    taxonomy[tax_level] = taxonomy['Taxon'].str.split(';').str[level_index]
    taxonomy[tax_level] = taxonomy[tax_level].fillna("Unassigned")

    df_tax = df.groupby(taxonomy[tax_level], axis=1).sum()
    df_tax_norm = df_tax.div(df_tax.sum(axis=1), axis=0)

    mean_abundance = df_tax_norm.mean(axis=0)
    top_taxa = mean_abundance.sort_values(ascending=False).head(top_n).index
    df_top = df_tax_norm[top_taxa].copy()
    df_top['Other'] = df_tax_norm.drop(columns=top_taxa).sum(axis=1)

    # Reverse columns for correct stacking/legend
    df_top_plot = df_top[df_top.columns[::-1]]  

    # ===========================
    # CREATE A CLEAN COLOR SCHEME
    # ===========================
    n_colors = df_top_plot.shape[1]
    cmap = cm.get_cmap('tab20', n_colors)  # Using tab20 colormap
    colors = [cmap(i) for i in range(n_colors)]

    # Plot
    plt.figure(figsize=(12,6))
    df_top_plot.plot(kind='bar', stacked=True, width=0.8, ax=plt.gca(), color=colors)
    plt.ylabel("Relative abundance")
    plt.xlabel("Samples")
    plt.xticks(rotation=90)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.title(f"Relative abundance at {tax_level} level")
    plt.tight_layout()

    output_png = f"taxa_barplots/taxa_barplot_{tax_level}.png"
    plt.savefig(output_png, dpi=300)
    plt.close()


Section 3 we take the previous artefacts and calculate core metrics

In [None]:
!qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences "qiime/rep-seqs.qza" \
    --o-alignment "qiime/aligned-rep-seqs.qza" \
    --o-masked-alignment "qiime/masked-aligned-rep-seqs.qza" \
    --o-tree "qiime/unrooted-tree.qza" \
    --o-rooted-tree "qiime/rooted-tree.qza"


In [None]:
# This creates artefacts (qza) that store metrics as well as visualisation artefacts (qzv)
# The latter of which you can play around with on QIIME2 View interactively
# But here we will be creating static graphs using the qza
!qiime diversity core-metrics-phylogenetic \
    --i-phylogeny "qiime/rooted-tree.qza" \
    --i-table "qiime/table.qza" \
    --p-sampling-depth 30000 \
    --m-metadata-file "metadata.tsv" \
    --output-dir "core-metrics-results"


Section 4 we create alpha and beta diversity visualisations.

In [None]:
# ---- Load alpha diversity artifacts ----
shannon = qiime2.Artifact.load("core-metrics-results/shannon_vector.qza").view(pd.Series)
evenness = qiime2.Artifact.load("core-metrics-results/evenness_vector.qza").view(pd.Series)
faith = qiime2.Artifact.load("core-metrics-results/faith_pd_vector.qza").view(pd.Series)
observed = qiime2.Artifact.load("core-metrics-results/observed_features_vector.qza").view(pd.Series)

# ---- Load metadata ----
metadata = pd.read_csv("metadata.tsv", sep="\t", index_col=0)

# ---- Combine alpha diversity into one DataFrame ----
alpha_df = pd.concat([
    shannon.rename("Shannon"),
    evenness.rename("Evenness"),
    faith.rename("Faith_PD"),
    observed.rename("Observed_Features")
], axis=1)

# ---- Merge alpha diversity with metadata ----
merged = alpha_df.join(metadata)

# ---- Output folder ----
os.makedirs("alpha_diversity_plots", exist_ok=True)
sns.set(style="whitegrid")

# ---- Define comparisons in flexible style ----
comparisons = [
    ("Group", None),                # all groups
    ("Group", ["C", "CH", "CL"]),  # only C subgroups
    ("Group", ["E", "EH", "EL"]),  # only E subgroups
    ("MainType", ["C", "E"])       # main type comparison
]

# ---- Function to plot alpha diversity ----
def plot_alpha_diversity(df, x_col, levels=None, title=None, outfile=None):
    # Subset if levels provided
    if levels is not None:
        df = df[df[x_col].isin(levels)]
    
    # Melt for seaborn
    melted = df.melt(
        id_vars=[x_col],
        value_vars=["Shannon", "Evenness", "Faith_PD", "Observed_Features"],
        var_name="Metric",
        value_name="Diversity"
    )
    
    # Plot
    g = sns.catplot(
        data=melted,
        x=x_col, y="Diversity",
        col="Metric",
        kind="box",
        col_wrap=2,
        sharey=False,
        height=4, aspect=1.2
    )
    g.map_dataframe(sns.stripplot, x=x_col, y="Diversity", color="black", alpha=0.5)
    plt.subplots_adjust(top=0.85)
    
    if title:
        g.figure.suptitle(title)
    if outfile:
        g.savefig(outfile, dpi=300, bbox_inches="tight")
    plt.close(g.fig)

# ---- Loop over comparisons ----
for x_col, levels in comparisons:
    name = f"{x_col}_{'_'.join(levels) if levels else 'all'}"
    title = f"Alpha Diversity - {name.replace('_', ' ')}"
    outfile = f"alpha_diversity_plots/alpha_diversity_{name}.png"
    
    plot_alpha_diversity(
        df=merged,
        x_col=x_col,
        levels=levels,
        title=title,
        outfile=outfile
    )


In [None]:

# Load PCoAResults directly from subfolder to plot beta diversity
pcoa_results = {
    "Bray-Curtis": qiime2.Artifact.load("core-metrics-results/bray_curtis_pcoa_results.qza").view(OrdinationResults),
    "Jaccard": qiime2.Artifact.load("core-metrics-results/jaccard_pcoa_results.qza").view(OrdinationResults),
    "Unweighted UniFrac": qiime2.Artifact.load("core-metrics-results/unweighted_unifrac_pcoa_results.qza").view(OrdinationResults),
    "Weighted UniFrac": qiime2.Artifact.load("core-metrics-results/weighted_unifrac_pcoa_results.qza").view(OrdinationResults),
}

metadata = pd.read_csv("metadata.tsv", sep="\t", index_col=0)

# Number of PCs to rename
n_pcs = 30

# Define comparisons
comparisons = [
    ("Group", None),                
    ("Group", ["C", "CH", "CL"]),
    ("Group", ["E", "EH", "EL"]),
    ("MainType", ["C", "E"])
]

# Output folder
output_dir = "beta_diversity_plots"
os.makedirs(output_dir, exist_ok=True)

# Loop through comparisons
for col, filter_values in comparisons:
    # Create figure with 2x2 grid
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    axes = axes.flatten()

    for ax, (distance_metric, pcoa_res) in zip(axes, pcoa_results.items()):
        coords = pcoa_res.samples
        df = coords.merge(metadata, left_index=True, right_index=True)
        df.rename(columns={i: f'PC{i+1}' for i in range(n_pcs)}, inplace=True)

        df_subset = df.copy()
        if filter_values is not None:
            df_subset = df_subset[df_subset[col].isin(filter_values)]

        unique_groups = df_subset[col].unique()
        palette = sns.color_palette(n_colors=len(unique_groups))
        group_to_color = dict(zip(unique_groups, palette))

        # Scatter plot
        sns.scatterplot(
            x="PC1", y="PC2", hue=col, data=df_subset,
            s=100, alpha=0.8, palette=group_to_color, ax=ax
        )

        # Draw ellipses
        for group, data_subset in df_subset.groupby(col):
            if len(data_subset) < 2:
                continue
            centroid_x = data_subset["PC1"].mean()
            centroid_y = data_subset["PC2"].mean()

            width = data_subset["PC1"].std() * 2
            height = data_subset["PC2"].std() * 2

            ellipse = Ellipse(
                (centroid_x, centroid_y),
                width=width, height=height,
                edgecolor=group_to_color[group],
                facecolor='none', lw=2, alpha=0.7
            )
            ax.add_patch(ellipse)
            ax.scatter(centroid_x, centroid_y, marker='x', color='black', s=120, zorder=10)

        ax.set_title(distance_metric)
        ax.set_xlabel("PC1")
        ax.set_ylabel("PC2")

    # Adjust legend → only show one combined legend outside the grid
    handles, labels = axes[0].get_legend_handles_labels()
    fig.legend(handles, labels, title=col, bbox_to_anchor=(1.05, 0.5), loc='center left')

    # Main title
    subset_label = "all" if filter_values is None else "_".join(filter_values)
    fig.suptitle(f"PCoA comparison ({col} = {subset_label})", fontsize=16)

    plt.tight_layout(rect=[0, 0, 0.85, 0.95])

    # Save figure
    filename = f"PCoA_comparison_{col}_{subset_label}.png"
    filepath = os.path.join(output_dir, filename)
    plt.savefig(filepath, dpi=300)
    plt.close()


In [None]:
# Export rooted tree and taxonomy for use in R or other software
!qiime tools export \
    --input-path "qiime/rooted-tree.qza" \
    --output-path "exported/rooted-tree"
!qiime tools export \
    --input-path "qiime/taxonomy.qza" \
    --output-path "exported/taxonomy"


Section 5 we create the PERMANOVA and BETADISPER tables using a combination of python and R, as R creates more legible results than qiime in this case.

!!! The R cell will not run if you do not run the below cell to initiate R magic first !!!

In [None]:
core_metrics_dir = "core-metrics-results"
qza_files = [f for f in os.listdir(core_metrics_dir) if f.endswith("_distance_matrix.qza")]

distance_names = []
distance_paths = []

for qza in qza_files:
    name = qza.replace("_distance_matrix.qza", "")
    out_path = os.path.join("exported", f"{name}_distance_matrix.tsv")

    # Make sure export dir exists
    os.makedirs("exported", exist_ok=True)

    # Use a temp dir for the raw export
    tmp_dir = f"_tmp_export_{name}"
    os.makedirs(tmp_dir, exist_ok=True)

    # Export with QIIME2
    !qiime tools export --input-path "{os.path.join(core_metrics_dir, qza)}" --output-path "{tmp_dir}"

    # Move the exported file
    raw_exported = os.path.join(tmp_dir, "distance-matrix.tsv")
    if os.path.exists(raw_exported):
        shutil.move(raw_exported, out_path)

    # Clean up temp dir
    shutil.rmtree(tmp_dir)

    distance_names.append(name)
    distance_paths.append(out_path)

# Define comparisons
comparison_names = ["all", "E_group", "C_group", "E_vs_C"]
comparison_columns = ["Group", "Group", "Group", "MainType"]


In [None]:
%load_ext rpy2.ipython

In [None]:
%%R -i distance_names -i distance_paths -i comparison_names -i comparison_columns 

library(vegan)
library(dplyr)
library(tibble)

# Build named lists in R
distance_files <- setNames(as.list(distance_paths), distance_names)
comparisons <- setNames(as.list(comparison_columns), comparison_names)

load_distance <- function(file) {
  mat <- as.matrix(read.table(file, header=TRUE, row.names=1, sep="\t", check.names=FALSE))
  as.dist(mat)
}

# Load metadata
meta <- read.table("metadata.tsv", header=TRUE, sep="\t", row.names=1)

# Define allowed levels per comparison to prevent nonsensical outputs
allowed_levels <- list(
  C_group = c("C", "CH", "CL"),
  E_group = c("E", "EH", "EL") # Add other comparisons here as needed
)

results <- list()

for (dname in names(distance_files)) {
  d <- load_distance(distance_files[[dname]])
  
  for (comp_name in names(comparisons)) {
    col <- comparisons[[comp_name]]
    
    # Only include the allowed levels for this comparison
    factor_levels <- intersect(unique(meta[[col]]), allowed_levels[[comp_name]])
    
    ## ---- Global PERMANOVA ----
    ad_global <- adonis2(d ~ meta[[col]], permutations=9999)
    ad_row <- as.data.frame(ad_global[1, ])
    ad_row <- rownames_to_column(ad_row, "Term")
    ad_row$distance <- dname
    ad_row$comparison <- comp_name
    ad_row$scope <- "global"
    ad_row$pair <- NA
    ad_row$test <- "permanova"
    results[[length(results)+1]] <- ad_row

    ## ---- Global betadisper ----
    bd_global <- betadisper(d, meta[[col]])
    bd_global_anova <- anova(bd_global)
    bd_row <- as.data.frame(bd_global_anova[1, ])
    bd_row <- rownames_to_column(bd_row, "Term")
    bd_row$distance <- dname
    bd_row$comparison <- comp_name
    bd_row$scope <- "global"
    bd_row$pair <- NA
    bd_row$test <- "betadisper"
    results[[length(results)+1]] <- bd_row

    ## ---- Pairwise ----
    if (length(factor_levels) > 1) {  # pairwise only makes sense if >=2 levels
      pairs <- combn(factor_levels, 2, simplify = FALSE)
      for (p in pairs) {
        idx <- meta[[col]] %in% p
        d_sub <- as.dist(as.matrix(d)[idx, idx])
        meta_sub <- meta[idx, ]
        
        # PERMANOVA pair
        ad_pair <- adonis2(d_sub ~ meta_sub[[col]], permutations=9999)
        adp_row <- as.data.frame(ad_pair[1, ])
        adp_row <- rownames_to_column(adp_row, "Term")
        adp_row$distance <- dname
        adp_row$comparison <- comp_name
        adp_row$scope <- "pairwise"
        adp_row$pair <- paste(p, collapse="_vs_")
        adp_row$test <- "permanova"
        results[[length(results)+1]] <- adp_row
        
        # betadisper pair
        bd_pair <- betadisper(d_sub, meta_sub[[col]])
        bd_pair_anova <- anova(bd_pair)
        bdp_row <- as.data.frame(bd_pair_anova[1, ])
        bdp_row <- rownames_to_column(bdp_row, "Term")
        bdp_row$distance <- dname
        bdp_row$comparison <- comp_name
        bdp_row$scope <- "pairwise"
        bdp_row$pair <- paste(p, collapse="_vs_")
        bdp_row$test <- "betadisper"
        results[[length(results)+1]] <- bdp_row
      }
    }
  }
}

# Bind all results with clean column ordering
permanova_permdisp_results <- bind_rows(results) %>%
  select(distance, comparison, scope, pair, test, everything())

write.table(
  permanova_permdisp_results, 
  file = "permanova_permdisp_results.tsv", 
  sep = "\t", 
  quote = FALSE, 
  row.names = FALSE
)


Section 6 to evaluate differential taxa, we use either ANCOMBC or ANCOMBC2.
If you want to get full results and static plots, follow the ANCOMBC2 route.
If you want to get easy visualisation for Qiime2 View, use the ANCOMBC approach.
Either way they should yield similar results so long as your parameters remain consistent.

In [None]:
# ANCOMBC approach
# -------------------------------
# 1. Compare all 6 groups (Group)
# -------------------------------
!qiime composition ancombc \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-formula Group \
  --o-differentials "qiime/ancombc-Group-results.qza"

!qiime composition da-barplot \
  --i-data "qiime/ancombc-Group-results.qza" \
  --p-significance-threshold 0.001 \
  --o-visualization "qiime/ancombc-Group-results.qzv"


# -------------------------------
# 2. Compare all 6 groups focusing on MainType (E vs C)
# -------------------------------
!qiime composition ancombc \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-formula MainType \
  --o-differentials "qiime/ancombc-E-vs-C-results.qza"

!qiime composition da-barplot \
  --i-data "qiime/ancombc-E-vs-C-results.qza" \
  --p-significance-threshold 0.001 \
  --o-visualization "qiime/ancombc-E-vs-C-results.qzv"


# -------------------------------
# 3. Compare Modifier within MainType E
# -------------------------------
!qiime feature-table filter-samples \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-where "MainType='E'" \
  --o-filtered-table "qiime/table-E.qza"

!qiime composition ancombc \
  --i-table "qiime/table-E.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-formula Modifier \
  --o-differentials "qiime/ancombc-E-Modifier-results.qza"

!qiime composition da-barplot \
  --i-data "qiime/ancombc-E-Modifier-results.qza" \
  --p-significance-threshold 0.001 \
  --o-visualization "qiime/ancombc-E-Modifier-results.qzv"


# -------------------------------
# 4. Compare Modifier within MainType C
# -------------------------------
!qiime feature-table filter-samples \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-where "MainType='C'" \
  --o-filtered-table "qiime/table-C.qza"

!qiime composition ancombc \
  --i-table "qiime/table-C.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-formula Modifier \
  --o-differentials "qiime/ancombc-C-Modifier-results.qza"

!qiime composition da-barplot \
  --i-data "qiime/ancombc-C-Modifier-results.qza" \
  --p-significance-threshold 0.001 \
  --o-visualization "qiime/ancombc-C-Modifier-results.qzv"


In [None]:
# -------------------------------
# ANCOMBC2 APPROACH
# -------------------------------
# 1. Compare all 6 groups (Group)
# Note ancombc2 automatically uses a ONE vs. REST approach for differential analysis
# -------------------------------
!qiime composition ancombc2 \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-fixed-effects-formula Group \
  --o-ancombc2-output "qiime/ancombc2-Group-results.qza"


# -------------------------------
# 2. Compare all 6 groups focusing on MainType (E vs C)
# -------------------------------
!qiime composition ancombc2 \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-fixed-effects-formula MainType \
  --o-ancombc2-output "qiime/ancombc2-E-vs-C-results.qza"


# -------------------------------
# 3. Compare Modifier within MainType E
# -------------------------------
!qiime feature-table filter-samples \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-where "MainType='E'" \
  --o-filtered-table "qiime/table-E.qza"

!qiime composition ancombc2 \
  --i-table "qiime/table-E.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-fixed-effects-formula Modifier \
  --o-ancombc2-output "qiime/ancombc2-E-Modifier-results.qza"


# -------------------------------
# 4. Compare Modifier within MainType C
# -------------------------------
!qiime feature-table filter-samples \
  --i-table "qiime/table.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-where "MainType='C'" \
  --o-filtered-table "qiime/table-C.qza"

!qiime composition ancombc2 \
  --i-table "qiime/table-C.qza" \
  --m-metadata-file "metadata.tsv" \
  --p-fixed-effects-formula Modifier \
  --o-ancombc2-output "qiime/ancombc2-C-Modifier-results.qza"

In [None]:
# To create static plots we have to export the ancombc2 artefacts to jsonl files
!qiime tools export \
  --input-path qiime/ancombc2-Group-results.qza \
  --output-path exported/ancombc2-Group-results

!qiime tools export \
  --input-path qiime/ancombc2-E-vs-C-results.qza \
  --output-path exported/ancombc2-E-vs-C-results

!qiime tools export \
  --input-path qiime/ancombc2-E-Modifier-results.qza \
  --output-path exported/ancombc2-E-Modifier-results

!qiime tools export \
  --input-path qiime/ancombc2-C-Modifier-results.qza \
  --output-path exported/ancombc2-C-Modifier-results


In [None]:

# --- DIRECTORIES TO PROCESS ---
# List all your exported ANCOM-BC2 result directories
EXPORTED_FOLDERS = [
    'exported/ancombc2-Group-results',
    'exported/ancombc2-E-vs-C-results',
    'exported/ancombc2-E-Modifier-results',
    'exported/ancombc2-C-Modifier-results'
]
# ------------------------------
def jsonl_to_tsv(input_filename, output_filename):
    """
    Parses a specific JSONL format (with a schema header) and converts it to TSV.
    """
    try:
        with open(input_filename, 'r', encoding='utf-8') as infile:
            # 1. Read the schema line (first line)
            schema_line = infile.readline()
            if not schema_line:
                print("Error: Input file is empty.")
                return

            schema = json.loads(schema_line)
            
            # Extract the header/field names from the 'fields' array
            # This ensures the correct order for the TSV output
            header = [field['name'] for field in schema.get('fields', [])]

            if not header:
                print("Error: Could not extract header from the schema.")
                return

            # 2. Open the output file for writing TSV
            with open(output_filename, 'w', newline='', encoding='utf-8') as outfile:
                # Use the csv module with the tab character ('\t') as the delimiter
                writer = csv.writer(outfile, delimiter='\t')
                
                # Write the header row
                writer.writerow(header)
                
                # 3. Process the remaining data lines
                for line in infile:
                    if not line.strip(): # Skip empty lines
                        continue
                        
                    try:
                        data_record = json.loads(line)
                        
                        # Extract values in the order defined by the header
                        row_data = [data_record.get(field_name, '') for field_name in header]
                        
                        # Write the data row
                        writer.writerow(row_data)
                        
                    except json.JSONDecodeError as e:
                        print(f"Skipping malformed JSON line: {line.strip()}. Error: {e}", file=sys.stderr)
                        continue
                        
        print(f"Successfully converted {input_filename} to {output_filename}")

    except FileNotFoundError:
        print(f"Error: The file '{input_filename}' was not found.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        
# Run the conversion for all specified directories
for folder in EXPORTED_FOLDERS:
    # Check if the directory exists before trying to convert
    if os.path.isdir(folder):
        print(f"Processing folder: {folder}")
        
        # Use glob to find all files ending with .jsonl in the current folder
        # The ** is for recursive searching (optional, but good practice if needed)
        jsonl_files = glob.glob(os.path.join(folder, '*.jsonl'))

        if not jsonl_files:
            print(f"No .jsonl files found in '{folder}'.")

        for input_file in jsonl_files:
            # 1. Get the base filename (e.g., 'data_01') without the extension
            base_name = os.path.splitext(os.path.basename(input_file))[0]
            
            # 2. Construct the output TSV filename
            # The output file will be placed in the same directory as the input file
            output_file = os.path.join(folder, f"{base_name}.tsv")
            
            # 3. Run the conversion function
            print(f"  Converting '{os.path.basename(input_file)}' to '{os.path.basename(output_file)}'...")
            jsonl_to_tsv(input_file, output_file)
            
    else:
        print(f"Folder not found: '{folder}'. Please check your path.")

print("\n JSONL to .tsv conversion complete. ")
print("\n Combining all .tsv files...")

# Column names you expect in each folder (change if yours are named differently)
EXPECTED_FILES = ["diff.tsv", "lfc.tsv", "p.tsv", "q.tsv", "se.tsv", "W.tsv", "passed_ss.tsv"]

for folder in EXPORTED_FOLDERS:
    if not os.path.isdir(folder):
        print(f" Folder not found: {folder}")
        continue

    print(f"\n Processing folder: {folder}")

    dfs = {}  # dictionary to store all loaded TSVs
    for fname in EXPECTED_FILES:
        file_path = os.path.join(folder, fname)
        if os.path.exists(file_path):
            print(f"   Found: {fname}")
            dfs[fname.replace(".tsv", "")] = pd.read_csv(file_path, sep="\t", index_col=0)
        else:
            print(f"   Missing: {fname}")

    if not dfs:
        print("   No TSV files found — skipping this folder.")
        continue

    # Combine all TSVs into one DataFrame
    combined = pd.concat(dfs, axis=1)

    # Optional: flatten multi-index columns (e.g., diff:ComparisonA)
    combined.columns = [f"{stat}:{col}" for stat, col in combined.columns]

    # Save the final combined file
    output_path = os.path.join(folder, "ancombc2_combined.tsv")
    combined.to_csv(output_path, sep="\t")
    print(f"   Combined table saved: {output_path}")

print("\n All comparison sets processed. ")


In [None]:
# ----------------- CONFIG -----------------
EXPORTED_FOLDERS = [
    'exported/ancombc2-Group-results',
    'exported/ancombc2-E-vs-C-results',
    'exported/ancombc2-E-Modifier-results',
    'exported/ancombc2-C-Modifier-results'
]

OUTPUT_DIR = "differentials_histograms_barplots"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# ----------------- LOAD TAXONOMY -----------------
tax_df = pd.read_csv("exported/exported-taxonomy/taxonomy.tsv", sep="\t", index_col=0)  # index is Feature ID

# ----------------- LOOP THROUGH COMPARISONS -----------------
for folder in EXPORTED_FOLDERS:
    comparison_name = os.path.basename(folder).replace("ancombc2-", "").replace("-results", "")
    print(f"\n📊 Processing {comparison_name}")

    combined_path = os.path.join(folder, "ancombc2_combined.tsv")
    if not os.path.exists(combined_path):
        print(f"❌ Missing {combined_path}, skipping.")
        continue

    df = pd.read_csv(combined_path, sep="\t", index_col=0)

    # --- Map Feature IDs to readable taxa names ---
    df["Taxon"] = df.index.map(lambda x: tax_df.loc[x, "Taxon"] if x in tax_df.index else "Unassigned")

    # --- Find q-value columns ---
    q_cols = [c for c in df.columns if c.startswith("q:")]

    # --- Combined q-value histograms ---
    thresholds = [0.05, 0.1, 0.2]
    if q_cols:
        max_cols = 3
        n_q = len(q_cols)
        n_rows = (n_q + max_cols - 1) // max_cols
        n_cols = min(n_q, max_cols)

        fig, axes = plt.subplots(n_rows, n_cols, figsize=(6*n_cols, 4*n_rows), squeeze=False)

        for i, q_col in enumerate(q_cols):
            row = i // max_cols
            col = i % max_cols
            ax = axes[row, col]

            for t in thresholds:
                sig_count = (df[q_col] < t).sum()
                print(f"  {q_col}: q < {t:.2f}: {sig_count} taxa")

            sns.histplot(df[q_col].dropna(), bins=40, kde=False, ax=ax)
            ax.axvline(0.05, color="red", linestyle="--", label="0.05")
            ax.axvline(0.1, color="orange", linestyle="--", label="0.10")
            ax.axvline(0.2, color="green", linestyle="--", label="0.20")
            ax.set_title(f"{q_col}")
            ax.set_xlabel("q-value")
            ax.set_ylabel("Count")
            ax.legend()

        # Turn off empty subplots
        total_plots = n_rows * n_cols
        if total_plots > n_q:
            for j in range(n_q, total_plots):
                row = j // max_cols
                col = j % max_cols
                axes[row, col].axis("off")

        plt.suptitle(f"Q-value distributions - {comparison_name}")
        plt.tight_layout(rect=[0, 0, 1, 0.95])
        plt.savefig(os.path.join(OUTPUT_DIR, f"{comparison_name}_qval_hist_combined.png"), dpi=300)
        plt.close()
        print(f"✅ Combined Q-value histogram saved: {comparison_name}_qval_hist_combined.png")

    # --- Identify relevant columns for barplot ---
    lfc_cols = [c for c in df.columns if c.startswith("lfc:")]
    se_cols = [c for c in df.columns if c.startswith("se:")]
    import textwrap
    # --- Build barplot data ---
    barplot_data = []
    for lfc_col in lfc_cols:
        comparison = lfc_col.replace("lfc:", "")
        se_col = f"se:{comparison}"
        q_col = f"q:{comparison}"

        lfc = df[lfc_col]
        se = df[se_col] if se_col in df.columns else pd.Series([None]*len(df), index=df.index)
        q = df[q_col] if q_col in df.columns else pd.Series([1.0]*len(df), index=df.index)

        sig = q < 0.1
        for idx in df.index[sig]:
            barplot_data.append({
                "Taxon": df.loc[idx, "Taxon"],  # readable taxon name
                "Comparison": comparison,
                "lfc": lfc.loc[idx],
                "se": se.loc[idx],
                "q": q.loc[idx]
            })

    barplot_df = pd.DataFrame(barplot_data)

    if not barplot_df.empty:
        # Wrap long taxa names at 20 characters
        barplot_df["Taxon_wrapped"] = barplot_df["Taxon"].apply(lambda x: "\n".join(textwrap.wrap(x, 20)))

        # Adjust figure width based on number of unique taxa
        fig_width = max(12, len(barplot_df["Taxon"].unique()) * 0.6)
        plt.figure(figsize=(fig_width, 6))

        # Create the barplot and get axes object
        ax = sns.barplot(
            data=barplot_df,
            x="Taxon_wrapped", y="lfc", hue="Comparison",
            dodge=True, palette="coolwarm", errorbar=None
        )

        # Add error bars correctly for each bar
        for i, row in barplot_df.iterrows():
            if pd.notna(row["se"]):
                # Find the center x-position of the corresponding bar
                bars = [b for b in ax.patches if b.get_height() == row["lfc"] and b.get_x() >= 0]
                if bars:
                    bar = bars[0]  # take first matching bar
                    height = bar.get_height()
                    ax.errorbar(
                        bar.get_x() + bar.get_width() / 2,  # center of bar
                        height,
                        yerr=row["se"],
                        fmt='none', c='black', capsize=3
                    )

        plt.xticks(rotation=80, ha="right")
        plt.ylabel("Log fold change (lfc)")
        plt.title(f"Differential taxa barplot - {comparison_name} (q < 0.1)")
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_DIR, f"{comparison_name}_barplot.png"), dpi=300)
        plt.close()
        print(f"✅ Barplot saved for {comparison_name}")
    else:
        print("⚠️ No significant taxa for barplot (q < 0.1).")



print("\n🎉 Finished! Results in:", OUTPUT_DIR)


Section 7 we create cladograms from our constructed trees. Once again we will be using R so make sure to run the below cell for R magic if you haven't loaded it in for the current session.

In [1]:
%load_ext rpy2.ipython

In [2]:
import rpy2
print(rpy2.__file__)

/home/patwuch/miniforge3/envs/qiime2-2025.7/lib/python3.10/site-packages/rpy2/__init__.py


In [20]:
%%R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install ggtree if not present
if (!requireNamespace("ggtree", quietly = TRUE))
    BiocManager::install("ggtree", update=FALSE, ask=FALSE)

# Also install tidyverse if needed
install.packages(c("ggplot2", "dplyr", "readr"))

# ----------------- R MAGIC SETUP -----------------
library(ggtree)
library(ggplot2)
library(dplyr)
library(readr)
library(treeio)

# ----------------- CONFIG -----------------
tree_file <- "exported/rooted-tree/tree.nwk"
output_dir <- "cladograms"
dir.create(output_dir, showWarnings = FALSE)

# ----------------- READ & VISUALIZE TREE -----------------
# Load tree
tree <- read.tree(tree_file)

# Generate circular cladogram
p <- ggtree(tree, layout="circular") +
    geom_tippoint(color="black", size=2) +
    theme(legend.position="none") +
    ggtitle("Circular Cladogram - Full Phylogenetic Tree")

# Save PNG
ggsave(filename=file.path(output_dir, "full_cladogram_R.png"),
       plot=p, width=8, height=8, dpi=300)

cat("✅ Full cladogram saved to", file.path(output_dir, "full_cladogram_R.png"), "\n")



R[write to console]: trying URL 'https://cran.csie.ntu.edu.tw/src/contrib/ggplot2_4.0.0.tar.gz'

R[write to console]: Content type 'application/octet-stream'
R[write to console]:  length 3810397 bytes (3.6 MB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write

x86_64-conda-linux-gnu-c++ -std=gnu++17 -I"/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/include" -DNDEBUG   -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -I/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -Wl,-rpath-link,/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib    -fpic  -fvisibility-inlines-hidden  -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1714471492496/work=/usr/local/src/conda/r-base-4.3.3 -fdebug-prefix-map=/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7=/usr/local/src/conda-prefix  -c chop.cpp -o chop.o
x86_64-conda-linux-gnu-c++ -std=gnu++17 -I"/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/include" -DNDEBUG   -DNDEBUG -D_FORTIFY_SO

installing to /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/library/00LOCK-dplyr/00new/dplyr/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (dplyr)
* installing *source* package ‘readr’ ...
** package ‘readr’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘x86_64-conda-linux-gnu-cc (conda-forge gcc 15.1.0-4) 15.1.0’
using C++ compiler: ‘x86_64-conda-linux-gnu-c++ (conda-forge gcc 15.1.0-4) 15.1.0’


x86_64-conda-linux-gnu-c++ -std=gnu++17 -I"/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/include" -DNDEBUG  -I'/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/library/cpp11/include' -I'/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/library/tzdb/include' -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -I/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -Wl,-rpath-link,/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib    -fpic  -fvisibility-inlines-hidden  -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1714471492496/work=/usr/local/src/conda/r-base-4.3.3 -fdebug-prefix-map=/home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7=/usr/local/src/conda-prefix

installing to /home/patwuch/miniforge3/envs/qiime2-amplicon-2025.7/lib/R/library/00LOCK-readr/00new/readr/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (readr)
R[write to console]: 

R[write to console]: 
R[write to console]: The downloaded source packages are in
	‘/tmp/RtmpclMlBk/downloaded_packages’
R[write to console]: 
R[write to console]: 

R[write to console]: Updating HTML index of packages in '.Library'

R[write to console]: Making 'packages.html' ...
R[write to console]:  done



✅ Full cladogram saved to cladograms/full_cladogram_R.png 
