# Notebook 01: Data Exploration & Lifestyle Classification

**Project**: Lifestyle-Based COG Stratification

**Goal**: Assess `ncbi_env` coverage, build a genome-level lifestyle classifier, and identify target species for COG analysis.

**Data Sources**:
- `kbase_ke_pangenome.ncbi_env` — NCBI environment metadata (4.1M rows, EAV format)
- `kbase_ke_pangenome.genome` — Genome metadata (293K rows)
- `kbase_ke_pangenome.pangenome` — Per-species pangenome stats (27K rows)
- `kbase_ke_pangenome.gtdb_species_clade` — Taxonomy (27K rows)

**Output**: `../data/species_lifestyle_classification.csv`

In [None]:
spark = get_spark_session()

## 1. Table Row Counts

Verify expected table sizes before analysis.

In [None]:
tables = ['ncbi_env', 'genome', 'sample', 'pangenome', 'gtdb_species_clade']
for t in tables:
    cnt = spark.sql(f"SELECT COUNT(*) as cnt FROM kbase_ke_pangenome.{t}").collect()[0]['cnt']
    print(f"{t}: {cnt:,} rows")

## 2. Explore ncbi_env Structure

The `ncbi_env` table is EAV format: each row is one (BioSample, attribute_name, content) triple.
We need to find which `harmonized_name` values are most populated and useful for lifestyle classification.

In [None]:
# What harmonized_name values exist and how many biosamples have them?
attr_counts = spark.sql("""
    SELECT harmonized_name, COUNT(DISTINCT accession) as n_biosamples
    FROM kbase_ke_pangenome.ncbi_env
    GROUP BY harmonized_name
    ORDER BY n_biosamples DESC
    LIMIT 30
""")
attr_counts.show(30, truncate=False)

In [None]:
# Sample rows to understand the data format
spark.sql("""
    SELECT accession, harmonized_name, content
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    LIMIT 20
""").show(20, truncate=False)

## 3. Assess Genome-to-ncbi_env Join Coverage

How many of the 293K genomes have `ncbi_biosample_id` values that match `ncbi_env.accession`?

In [None]:
# Check how many genomes have non-null biosample IDs
spark.sql("""
    SELECT
        COUNT(*) as total_genomes,
        SUM(CASE WHEN ncbi_biosample_id IS NOT NULL AND ncbi_biosample_id != '' THEN 1 ELSE 0 END) as has_biosample,
        SUM(CASE WHEN ncbi_biosample_id IS NOT NULL AND ncbi_biosample_id != '' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_biosample
    FROM kbase_ke_pangenome.genome
""").show()

In [None]:
# Check how many genomes actually join to ncbi_env
spark.sql("""
    SELECT
        COUNT(DISTINCT g.genome_id) as genomes_with_env
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
""").show()

In [None]:
# How many genomes have lifestyle-relevant attributes specifically?
spark.sql("""
    SELECT
        ne.harmonized_name,
        COUNT(DISTINCT g.genome_id) as n_genomes
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
    WHERE ne.harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    GROUP BY ne.harmonized_name
    ORDER BY n_genomes DESC
""").show()

## 4. Examine Content Values for Lifestyle Classification

For each lifestyle-relevant attribute, examine the most common content values.
This will help us build classification rules.

In [None]:
# Top 'host' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'host'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

In [None]:
# Top 'isolation_source' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'isolation_source'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

In [None]:
# Top 'env_broad_scale' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'env_broad_scale'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

## 5. Build Lifestyle Classifier

Classification rules (refine after inspecting content values above):

**Host-associated**: Genome has a valid `host` attribute (not 'not applicable', etc.) OR `isolation_source` contains host-related keywords (blood, sputum, wound, stool, urine, clinical, patient, etc.)

**Free-living**: Genome has NO valid `host` attribute AND `isolation_source` or `env_broad_scale` contains environmental keywords (soil, water, marine, freshwater, sediment, rhizosphere, etc.)

**Ambiguous/Unknown**: Everything else

In [None]:
# Extract all lifestyle-relevant metadata per genome
# Pivot the EAV table into a wide format with one row per genome
genome_env = spark.sql("""
    SELECT
        g.genome_id,
        g.gtdb_species_clade_id,
        MAX(CASE WHEN ne.harmonized_name = 'host' THEN ne.content END) as host,
        MAX(CASE WHEN ne.harmonized_name = 'isolation_source' THEN ne.content END) as isolation_source,
        MAX(CASE WHEN ne.harmonized_name = 'env_broad_scale' THEN ne.content END) as env_broad_scale,
        MAX(CASE WHEN ne.harmonized_name = 'env_local_scale' THEN ne.content END) as env_local_scale,
        MAX(CASE WHEN ne.harmonized_name = 'env_medium' THEN ne.content END) as env_medium
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
    WHERE ne.harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    GROUP BY g.genome_id, g.gtdb_species_clade_id
""")

genome_env.cache()
print(f"Genomes with any lifestyle metadata: {genome_env.count():,}")
genome_env.show(10, truncate=50)

In [None]:
from pyspark.sql import functions as F

# Define classification rules
# NOTE: Refine these keyword lists after inspecting the content values above!

# Sentinel values to exclude
na_values = ['not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown', '']

# Host-related isolation source keywords (clinical/host samples)
host_keywords = ['blood', 'sputum', 'wound', 'stool', 'urine', 'clinical',
                 'patient', 'feces', 'abscess', 'throat', 'nasal', 'skin',
                 'lung', 'cerebrospinal', 'bile', 'tissue', 'biopsy',
                 'human', 'cattle', 'chicken', 'pig', 'mouse', 'bovine',
                 'swine', 'poultry', 'gut', 'intestin', 'fecal', 'rectal']

# Environmental isolation source keywords
env_keywords = ['soil', 'water', 'marine', 'freshwater', 'sediment', 'river',
                'lake', 'ocean', 'sea', 'rhizosphere', 'root', 'leaf',
                'air', 'hot spring', 'hydrothermal', 'mine', 'compost',
                'wastewater', 'activated sludge', 'biofilm', 'rock', 'sand']


def has_valid_host(host_col):
    """Returns True if host column has a meaningful value."""
    return (
        host_col.isNotNull()
        & (~F.lower(host_col).isin(na_values))
    )


def matches_keywords(col, keywords):
    """Returns True if column contains any of the keywords (case-insensitive)."""
    conditions = [F.lower(col).contains(kw) for kw in keywords]
    return F.when(col.isNull(), F.lit(False)).otherwise(
        conditions[0] if len(conditions) == 1
        else F.greatest(*[c.cast('int') for c in conditions]).cast('boolean')
    )


# Classify each genome
classified = genome_env.withColumn(
    'lifestyle',
    F.when(
        has_valid_host(F.col('host')) | matches_keywords(F.col('isolation_source'), host_keywords),
        F.lit('host_associated')
    ).when(
        matches_keywords(F.col('isolation_source'), env_keywords)
        | matches_keywords(F.col('env_broad_scale'), env_keywords),
        F.lit('free_living')
    ).otherwise(F.lit('ambiguous'))
)

# Summary
classified.groupBy('lifestyle').count().orderBy('count', ascending=False).show()

In [None]:
# Spot-check: sample genomes from each category
for cat in ['host_associated', 'free_living', 'ambiguous']:
    print(f"\n=== {cat} (sample 5) ===")
    classified.filter(F.col('lifestyle') == cat).select(
        'genome_id', 'host', 'isolation_source', 'env_broad_scale'
    ).show(5, truncate=60)

## 6. Aggregate to Species Level

Assign each species a lifestyle based on majority vote of its constituent genomes.
Filter to species with:
- >= 10 genomes total (for meaningful pangenome)
- >= 70% of classified genomes agreeing on lifestyle (clear assignment)

In [None]:
# Count lifestyle categories per species
species_lifestyle = classified.filter(
    F.col('lifestyle') != 'ambiguous'
).groupBy('gtdb_species_clade_id', 'lifestyle').count()

# Get total classified genomes per species
from pyspark.sql.window import Window

w = Window.partitionBy('gtdb_species_clade_id')
species_lifestyle = species_lifestyle.withColumn(
    'total_classified', F.sum('count').over(w)
).withColumn(
    'fraction', F.col('count') / F.col('total_classified')
)

species_lifestyle.show(20, truncate=False)

In [None]:
# Assign species lifestyle: majority vote with >= 70% threshold
# Keep only the dominant lifestyle per species
w_rank = Window.partitionBy('gtdb_species_clade_id').orderBy(F.desc('count'))

species_assignment = species_lifestyle.withColumn(
    'rank', F.row_number().over(w_rank)
).filter(
    (F.col('rank') == 1) & (F.col('fraction') >= 0.7)
).select(
    'gtdb_species_clade_id',
    F.col('lifestyle').alias('species_lifestyle'),
    F.col('count').alias('n_classified_dominant'),
    'total_classified',
    F.col('fraction').alias('lifestyle_fraction')
)

print(f"Species with clear lifestyle assignment: {species_assignment.count():,}")
species_assignment.groupBy('species_lifestyle').count().show()

In [None]:
# Join with pangenome stats to filter by genome count and add taxonomy
target_species = species_assignment.join(
    spark.sql("""
        SELECT p.gtdb_species_clade_id, p.no_genomes, p.no_gene_clusters, p.no_core,
               p.no_aux_genome, p.no_singleton_gene_clusters,
               sc.GTDB_taxonomy, sc.GTDB_species
        FROM kbase_ke_pangenome.pangenome p
        JOIN kbase_ke_pangenome.gtdb_species_clade sc
            ON p.gtdb_species_clade_id = sc.gtdb_species_clade_id
    """),
    on='gtdb_species_clade_id',
    how='inner'
).filter(
    F.col('no_genomes') >= 10
)

# Add phylum for phylogenetic stratification
target_species = target_species.withColumn(
    'phylum', F.split(F.col('GTDB_taxonomy'), ';').getItem(1)
)

print(f"Target species (>= 10 genomes, clear lifestyle): {target_species.count():,}")
target_species.groupBy('species_lifestyle').count().show()
target_species.groupBy('species_lifestyle', 'phylum').count().orderBy(
    'species_lifestyle', F.desc('count')
).show(30, truncate=False)

In [None]:
# Save species lifestyle classification
target_pdf = target_species.toPandas()
target_pdf.to_csv('../data/species_lifestyle_classification.csv', index=False)
print(f"Saved {len(target_pdf)} species to ../data/species_lifestyle_classification.csv")
target_pdf.head(10)

## Findings

Record observations here after running:

- How many genomes had usable lifestyle metadata? ___
- How many species classified as host-associated? ___
- How many species classified as free-living? ___
- Which phyla are represented in each category? ___
- Any concerns about classification quality? ___
- Are the keyword lists adequate or do they need refinement? ___