# Phase 3: ARG Distribution Analysis

This notebook analyzes the distribution of antibiotic resistance genes across:
- Individual species and genera
- Phylogenetic clades
- Estimated ecological environments

## Metrics Calculated
- ARG prevalence (% genomes with specific ARGs)
- ARG diversity (unique ARGs per species)
- Hotspot scores (combined metric for ranking)
- Statistical tests for significance

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Load ARG annotations from Phase 2
# arg_df = pd.read_csv('../data/arg_annotations.csv')

print("Phase 3: ARG Distribution Analysis")

## Compute ARG Prevalence by Species

In [None]:
# TODO: Query BERDL to calculate:
# - Total genomes per species
# - Genomes with at least one ARG per species
# - Prevalence = (genomes_with_ARG / total_genomes) * 100
# Note: This assumes ARGs have been identified and saved to a temporary table/dataframe

prevalence_query = """
SELECT 
    s.GTDB_species,
    COUNT(DISTINCT g.genome_id) as total_genomes,
    COUNT(DISTINCT CASE WHEN arg_genes.gene_id IS NOT NULL THEN g.genome_id END) as genomes_with_arg,
    COUNT(DISTINCT arg_genes.gene_id) as unique_args,
    ROUND(100.0 * COUNT(DISTINCT CASE WHEN arg_genes.gene_id IS NOT NULL THEN g.genome_id END) / 
          COUNT(DISTINCT g.genome_id), 2) as prevalence_percent
FROM kbase_ke_pangenome.genome g
JOIN kbase_ke_pangenome.gtdb_species_clade s 
    ON g.gtdb_species_clade_id = s.gtdb_species_clade_id
LEFT JOIN arg_genes ON g.genome_id = arg_genes.genome_id
GROUP BY s.GTDB_species
ORDER BY prevalence_percent DESC
"""

print("Prevalence query prepared")
print("Note: This query assumes arg_genes table/view exists with columns: gene_id, genome_id")

## Identify Hotspot Species

In [None]:
# TODO: Calculate hotspot scores and identify statistical outliers
# Hotspot score = function of (prevalence, diversity, mean_genes_per_genome)

print("Hotspot identification in progress...")

## Phylogenetic and Environmental Patterns

In [None]:
# TODO: Analyze ARG patterns at higher taxonomic levels
# - Distribution across phyla
# - Distribution across orders
# - Correlation with estimated environment (if available)

print("Phylogenetic analysis in progress...")

## Visualization

In [None]:
# TODO: Create visualizations:
# - Top 20 hotspot species
# - ARG prevalence distribution
# - Phylogenetic tree with ARG overlay

print("Visualizations to be generated...")