# Notebook 01: Openness Stratification

**Project**: Openness vs Functional Composition

**Goal**: Extract pangenome statistics, compute openness metrics, stratify species into quartiles, and select ~10 species per quartile balanced by phylum.

**Data Sources**:
- `kbase_ke_pangenome.pangenome` — Per-species pangenome stats (27K rows)
- `kbase_ke_pangenome.gtdb_species_clade` — Taxonomy (27K rows)

**Output**: `../data/species_openness_quartiles.csv`

In [None]:
spark = get_spark_session()

## 1. Extract Pangenome Stats

Get all species with >= 50 genomes (robust pangenome estimates).

In [None]:
pangenome_df = spark.sql("""
    SELECT 
        p.gtdb_species_clade_id,
        CAST(p.no_genomes AS INT) as no_genomes,
        CAST(p.no_gene_clusters AS INT) as no_gene_clusters,
        CAST(p.no_core AS INT) as no_core,
        CAST(p.no_aux_genome AS INT) as no_aux_genome,
        CAST(p.no_singleton_gene_clusters AS INT) as no_singleton,
        sc.GTDB_taxonomy,
        sc.GTDB_species
    FROM kbase_ke_pangenome.pangenome p
    JOIN kbase_ke_pangenome.gtdb_species_clade sc
        ON p.gtdb_species_clade_id = sc.gtdb_species_clade_id
    WHERE CAST(p.no_genomes AS INT) >= 50
""")

print(f"Species with >= 50 genomes: {pangenome_df.count():,}")
pangenome_df.show(5, truncate=60)

## 2. Compute Openness Metrics

In [None]:
from pyspark.sql import functions as F

# Compute openness metrics
openness_df = pangenome_df.withColumn(
    'core_fraction', F.col('no_core') / F.col('no_gene_clusters')
).withColumn(
    'singleton_fraction', F.col('no_singleton') / F.col('no_gene_clusters')
).withColumn(
    'accessory_fraction', F.col('no_aux_genome') / F.col('no_gene_clusters')
).withColumn(
    'openness', 1 - (F.col('no_core') / F.col('no_gene_clusters'))
).withColumn(
    'phylum', F.split(F.col('GTDB_taxonomy'), ';').getItem(1)
)

# Summary statistics
openness_pdf = openness_df.toPandas()
print(f"Species count: {len(openness_pdf)}")
print(f"\nOpenness distribution:")
print(openness_pdf['openness'].describe())
print(f"\nPhylum distribution:")
print(openness_pdf['phylum'].value_counts().head(15))

## 3. Stratify into Quartiles

In [None]:
import pandas as pd

# Assign quartiles by openness
openness_pdf['openness_quartile'] = pd.qcut(
    openness_pdf['openness'], q=4, labels=['Q1_closed', 'Q2', 'Q3', 'Q4_open']
)

print("Species per quartile:")
print(openness_pdf['openness_quartile'].value_counts().sort_index())
print("\nOpenness range per quartile:")
print(openness_pdf.groupby('openness_quartile')['openness'].agg(['min', 'max', 'median']))

In [None]:
# Phylum distribution per quartile
print("Phylum distribution by quartile:")
print(pd.crosstab(openness_pdf['openness_quartile'], openness_pdf['phylum']))

## 4. Select Target Species

Select ~10 species per quartile, balanced by phylum where possible.
Prefer species with more genomes (better pangenome estimates).

In [None]:
# Select top species per quartile, balancing phylum representation
target_species = []

for q in ['Q1_closed', 'Q2', 'Q3', 'Q4_open']:
    q_df = openness_pdf[openness_pdf['openness_quartile'] == q].copy()
    
    # Get top phyla in this quartile
    top_phyla = q_df['phylum'].value_counts().head(5).index.tolist()
    
    selected = []
    # Take 2 species per top phylum (up to 10 total)
    for phylum in top_phyla:
        phylum_species = q_df[q_df['phylum'] == phylum].nlargest(2, 'no_genomes')
        selected.append(phylum_species)
        if sum(len(s) for s in selected) >= 10:
            break
    
    q_selected = pd.concat(selected).head(10)
    target_species.append(q_selected)
    print(f"\n{q}: {len(q_selected)} species")
    print(q_selected[['GTDB_species', 'phylum', 'no_genomes', 'openness']].to_string(index=False))

target_df = pd.concat(target_species)
print(f"\nTotal target species: {len(target_df)}")

## 5. Save Results

In [None]:
# Save all species with openness metrics
openness_pdf.to_csv('../data/all_species_openness.csv', index=False)
print(f"Saved {len(openness_pdf)} species to ../data/all_species_openness.csv")

# Save target species for notebook 02
target_df.to_csv('../data/species_openness_quartiles.csv', index=False)
print(f"Saved {len(target_df)} target species to ../data/species_openness_quartiles.csv")

## Findings

Record after running:
- How many species have >= 50 genomes? ___
- What is the openness range? ___
- Are quartiles phylogenetically balanced? ___
- Any concerns about species selection? ___