# Phase 2: Identify Antibiotic Resistance Genes (ARGs)

This notebook identifies antibiotic resistance genes in the pangenome collection by:
1. Querying for functional annotations matching ARG databases
2. Creating an ARG annotation dataset
3. Linking genes to resistance mechanisms and drug classes

## Data Sources for ARG Detection
- CARD (Comprehensive Antibiotic Resistance Database)
- ResFinder database
- PATRIC resistance annotations
- Keyword matching in gene descriptions/functions

In [None]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
import re

# Spark session should be initialized from notebook 01
print("Phase 2: ARG Identification")

## Step 1: Download and Parse ARG Reference Databases

TODO: Fetch and parse CARD, ResFinder, PATRIC databases

In [None]:
# Placeholder: Create ARG reference dataset
# In practice, this would involve:
# 1. Downloading CARD database
# 2. Parsing ResFinder annotations
# 3. Creating BLASTP database for homology matching

arg_keywords = [
    'antibiotic', 'resistance', 'resistant', 'beta-lactamase', 'efflux',
    'ampicillin', 'penicillin', 'tetracycline', 'fluoroquinolone', 'macrolide',
    'vancomycin', 'methicillin', 'carbapenem', 'cephalosporin'
]

print(f"ARG keywords for detection: {len(arg_keywords)}")
print(arg_keywords)

## Step 2: Query BERDL for Gene Annotations

In [None]:
# Query genes with functional annotations
# This assumes gene descriptions are available in the database

genes_query = """
SELECT 
    gene_id,
    orthogroup_id,
    genome_id,
    gene_description,
    gene_function
FROM kbase_ke_pangenome.gene
WHERE gene_description IS NOT NULL
    OR gene_function IS NOT NULL
LIMIT 1000
"""

# This will be populated after exploring table structure in notebook 01
print("Query prepared for gene annotation retrieval")

## Step 3: Identify ARGs by Keyword Matching

In [None]:
# Implement keyword matching logic with case-insensitive matching
# TODO: Extract and annotate ARGs from gene descriptions and annotations

def extract_drug_class(description):
    """Extract drug class from gene description (case-insensitive)"""
    if not description:
        return None
    
    drug_mappings = {
        'beta-lactam': ['beta-lactamase', 'penicillin', 'ampicillin', 'cephalosporin', 'carbapenem', 'beta_lactam'],
        'tetracycline': ['tetracycline', 'tet', 'oxytetracycline'],
        'fluoroquinolone': ['fluoroquinolone', 'quinolone', 'gyra', 'gyrb', 'qnr'],
        'macrolide': ['macrolide', 'erythromycin', 'erm', 'mls'],
        'vancomycin': ['vancomycin', 'vana', 'vanb', 'vanc', 'vand'],
        'aminoglycoside': ['aminoglycoside', 'kanamycin', 'gentamicin', 'streptomycin', 'aac', 'aad', 'ags'],
        'sulfonamide': ['sulfonamide', 'sulfon', 'dhps'],
        'chloramphenicol': ['chloramphenicol', 'cam'],
    }
    
    desc_lower = description.lower()
    for drug_class, keywords in drug_mappings.items():
        if any(kw in desc_lower for kw in keywords):
            return drug_class
    
    return None

print("ARG detection functions prepared with case-insensitive matching")

## Step 4: Create ARG Annotation Dataset

In [None]:
# TODO: Create comprehensive ARG annotation table
# Columns: gene_id, orthogroup_id, arg_name, drug_class, resistance_mechanism, source_db, confidence

print("ARG annotation dataset creation in progress...")

## Next Steps

1. Save ARG annotation dataset to data/arg_annotations.csv
2. Continue to notebook 03_distribution_analysis.ipynb for prevalence analysis