# NB01: Data Extraction — Ecotype Species Screening

**Run on**: BERDL JupyterHub (Spark available via `get_spark_session()`)

## Goal

Extract and summarize three data dimensions for all 338 species that have phylogenetic tree data:

1. **Phylogenetic substructure** — per-species branch distance statistics from `phylogenetic_tree_distance_pairs`
2. **Environmental diversity** — per-species environmental category counts and entropy from `nmdc_ncbi_biosamples.env_triads_flattened`
3. **Pangenome openness** — core/accessory/singleton counts from `pangenome`

All heavy aggregations run on Spark. Only small summary tables (~338 rows each) are brought to the driver via `.toPandas()`.

## Outputs

| File | Description | Rows |
|------|-------------|------|
| `species_tree_list.csv` | All tree species with phylogenetic_tree_id | ~338 |
| `species_phylo_stats.csv` | Branch distance statistics per species | ~338 |
| `species_pangenome_stats.csv` | Pangenome openness metrics per species | ~338 |
| `species_env_stats.csv` | Environmental diversity metrics per species | ~338 |
| `genome_biosample_map.csv` | genome_id → biosample_accession for tree species | ~90K |

## Key Pitfalls

- **Genome ID format in `phylogenetic_tree_distance_pairs`**: IDs may be bare accessions (no `RS_`/`GB_` prefix). Always verify before joining to `genome` table.
- **Cross-database joins**: Use `nmdc_ncbi_biosamples.env_triads_flattened` with full `database.table` notation.
- **`--` in species IDs**: Avoid placing these in SQL `IN()` clauses. Use Spark DataFrame joins instead.
- **Never `.toPandas()` on large tables**: Aggregate on Spark, collect only summary rows.

In [1]:
# Cell 1: Setup

import os
import numpy as np
import pandas as pd
from pyspark.sql import functions as F

# On JupyterHub: get_spark_session() is available without import.
# Locally (with .venv-berdl active): import from scripts/
try:
    from get_spark_session import get_spark_session
except ImportError:
    pass  # Already available on JupyterHub

spark = get_spark_session()

OUTPUT_PATH = "../data"
os.makedirs(OUTPUT_PATH, exist_ok=True)

print(f"Spark version: {spark.version}")
print(f"Output path: {os.path.abspath(OUTPUT_PATH)}")

Spark version: 4.0.1
Output path: /Users/MCashman/Documents/Projects/Repos/BERIL-research-observatory/projects/ecotype_species_screening/data


---
## STEP 1: Load Tree Species Universe

In [2]:
# Cell 2: Load all species with phylogenetic trees + pangenome stats

tree_species_df = spark.sql("""
    SELECT
        pt.gtdb_species_clade_id,
        pt.phylogenetic_tree_id,
        sc.GTDB_species,
        p.no_genomes,
        p.no_core,
        p.no_aux_genome,
        p.no_singleton_gene_clusters,
        p.no_gene_clusters,
        sc.mean_intra_species_ANI,
        sc.min_intra_species_ANI
    FROM kbase_ke_pangenome.phylogenetic_tree pt
    JOIN kbase_ke_pangenome.gtdb_species_clade sc
        ON pt.gtdb_species_clade_id = sc.gtdb_species_clade_id
    JOIN kbase_ke_pangenome.pangenome p
        ON pt.gtdb_species_clade_id = p.gtdb_species_clade_id
""")

tree_species_pd = tree_species_df.toPandas()

# Derived pangenome metrics
tree_species_pd['singleton_fraction'] = (
    tree_species_pd['no_singleton_gene_clusters'] / tree_species_pd['no_gene_clusters']
)
tree_species_pd['core_fraction'] = (
    tree_species_pd['no_core'] / tree_species_pd['no_gene_clusters']
)
tree_species_pd['aux_fraction'] = (
    tree_species_pd['no_aux_genome'] / tree_species_pd['no_gene_clusters']
)

print(f"Tree species loaded: {len(tree_species_pd)}")
print(f"Genome count range: {tree_species_pd['no_genomes'].min()} – {tree_species_pd['no_genomes'].max()}")
print(f"Genome count median: {tree_species_pd['no_genomes'].median():.0f}")
print(f"\nTop 10 by genome count:")
print(tree_species_pd.nlargest(10, 'no_genomes')[['GTDB_species', 'no_genomes', 'singleton_fraction']].to_string())

Tree species loaded: 338
Genome count range: 12 – 2604
Genome count median: 132

Top 10 by genome count:
                     GTDB_species  no_genomes  singleton_fraction
337   s__Clostridioides_difficile        2604            0.560065
336     s__Enterococcus_B_faecium        2533            0.519981
335  s__Enterobacter_hormaechei_A        2453            0.606672
334     s__Campylobacter_D_jejuni        2313            0.475709
333      s__Enterococcus_faecalis        2250            0.510081
332     s__Listeria_monocytogenes        2212            0.554432
331     s__Neisseria_meningitidis        2121            0.483305
330     s__Streptococcus_pyogenes        2082            0.492156
300   s__Listeria_monocytogenes_B        1923            0.538122
301    s__Mycobacterium_abscessus        1825            0.447394


In [3]:
# Cell 3: Save species tree list and pangenome stats

tree_species_pd.to_csv(f"{OUTPUT_PATH}/species_tree_list.csv", index=False)
print(f"Saved species_tree_list.csv ({len(tree_species_pd)} species)")

# Also save pangenome stats separately for downstream notebooks
pangenome_cols = [
    'gtdb_species_clade_id', 'GTDB_species', 'no_genomes', 'no_core',
    'no_aux_genome', 'no_singleton_gene_clusters', 'no_gene_clusters',
    'singleton_fraction', 'core_fraction', 'aux_fraction',
    'mean_intra_species_ANI', 'min_intra_species_ANI'
]
tree_species_pd[pangenome_cols].to_csv(f"{OUTPUT_PATH}/species_pangenome_stats.csv", index=False)
print(f"Saved species_pangenome_stats.csv")

Saved species_tree_list.csv (338 species)
Saved species_pangenome_stats.csv


---
## STEP 2: Phylogenetic Substructure Statistics

Compute branch distance statistics entirely on Spark.
The full table has 22.6M rows — we never `.toPandas()` it directly.
We aggregate to ~338 rows before collecting.

In [4]:
# Cell 4: Check genome ID format in phylogenetic_tree_distance_pairs
# IMPORTANT: IDs here may be bare accessions (no RS_/GB_ prefix)

sample_pairs = spark.sql("""
    SELECT phylogenetic_tree_id, genome1_id, genome2_id, branch_distance
    FROM kbase_ke_pangenome.phylogenetic_tree_distance_pairs
    LIMIT 5
""").toPandas()

print("Sample rows from phylogenetic_tree_distance_pairs:")
print(sample_pairs.to_string())

# Check prefix: do IDs start with RS_ or GB_?
sample_genome_id = sample_pairs['genome1_id'].iloc[0]
print(f"\nSample genome1_id: {sample_genome_id}")
print(f"Starts with RS_ or GB_: {sample_genome_id.startswith('RS_') or sample_genome_id.startswith('GB_')}")

Sample rows from phylogenetic_tree_distance_pairs:
                   phylogenetic_tree_id       genome1_id       genome2_id  branch_distance
0  20adf934-9676-560f-99d0-fdb9eccf8644  GCF_021198225.1  GCF_900162285.1         0.008065
1  20adf934-9676-560f-99d0-fdb9eccf8644  GCF_021198225.1  GCF_900162295.1         0.008065
2  20adf934-9676-560f-99d0-fdb9eccf8644  GCF_021198225.1  GCF_900162315.1         0.008506
3  20adf934-9676-560f-99d0-fdb9eccf8644  GCF_021198225.1  GCF_900162325.1         0.008876
4  20adf934-9676-560f-99d0-fdb9eccf8644  GCF_021198225.1  GCF_900162345.1         0.005438

Sample genome1_id: GCF_021198225.1
Starts with RS_ or GB_: False


In [5]:
# Cell 5: Determine correct join key for genome IDs in distance table
# If IDs lack prefix, we need to strip RS_/GB_ from genome.genome_id before joining

# Check alignment between phylogenetic_tree_distance_pairs genome IDs
# and kbase_ke_pangenome.genome genome_ids

# Get one species to test
test_tree_id = tree_species_pd['phylogenetic_tree_id'].iloc[0]
test_species_id = tree_species_pd['gtdb_species_clade_id'].iloc[0]

# Get genome IDs from distance table for this species
dist_genome_ids = spark.sql(f"""
    SELECT DISTINCT genome1_id
    FROM kbase_ke_pangenome.phylogenetic_tree_distance_pairs
    WHERE phylogenetic_tree_id = '{test_tree_id}'
    LIMIT 5
""").toPandas()

# Get genome IDs from genome table for same species
genome_table_ids = spark.sql(f"""
    SELECT genome_id
    FROM kbase_ke_pangenome.genome
    WHERE gtdb_species_clade_id = '{test_species_id}'
    LIMIT 5
""").toPandas()

print(f"Test species: {test_species_id[:60]}")
print(f"\ngenome1_id values from distance table: {dist_genome_ids['genome1_id'].tolist()}")
print(f"genome_id values from genome table:    {genome_table_ids['genome_id'].tolist()}")

Test species: s__Bacillus_A_anthracis--RS_GCF_000742895.1

genome1_id values from distance table: ['GCF_000793565.1', 'GCF_022220625.1', 'GCF_000161635.1', 'GCF_001272985.1', 'GCF_002211135.1']
genome_id values from genome table:    ['GB_GCA_020172645.1', 'RS_GCF_000007845.1', 'RS_GCF_000008165.1', 'RS_GCF_000008505.1', 'RS_GCF_000011625.1']


In [6]:
# Cell 6: Compute per-species branch distance statistics on Spark
#
# Metrics computed per species:
#   - n_genomes_in_tree: count of distinct genomes in tree
#   - n_pairs: total pairwise comparisons
#   - mean, std, min, max, median (p50), IQR (p75-p25) of branch_distance
#   - cv (coefficient of variation = std / mean): high CV -> substructure
#   - max_median_ratio: another substructure indicator

phylo_stats_df = spark.sql("""
    SELECT
        phylogenetic_tree_id,
        COUNT(DISTINCT genome1_id)                                        AS n_genomes_in_tree,
        COUNT(*)                                                           AS n_pairs,
        AVG(branch_distance)                                              AS mean_branch_dist,
        STDDEV(branch_distance)                                           AS std_branch_dist,
        MIN(branch_distance)                                              AS min_branch_dist,
        MAX(branch_distance)                                              AS max_branch_dist,
        PERCENTILE(branch_distance, 0.25)                                 AS q25_branch_dist,
        PERCENTILE(branch_distance, 0.50)                                 AS median_branch_dist,
        PERCENTILE(branch_distance, 0.75)                                 AS q75_branch_dist,
        PERCENTILE(branch_distance, 0.90)                                 AS q90_branch_dist,
        PERCENTILE(branch_distance, 0.75) - PERCENTILE(branch_distance, 0.25) AS iqr_branch_dist,
        STDDEV(branch_distance) / NULLIF(AVG(branch_distance), 0)        AS cv_branch_dist,
        MAX(branch_distance) / NULLIF(PERCENTILE(branch_distance, 0.50), 0) AS max_median_ratio
    FROM kbase_ke_pangenome.phylogenetic_tree_distance_pairs
    GROUP BY phylogenetic_tree_id
""")

phylo_stats_pd = phylo_stats_df.toPandas()

print(f"Phylo stats computed for {len(phylo_stats_pd)} species")
print(f"\nCV summary (coefficient of variation):")
print(phylo_stats_pd['cv_branch_dist'].describe())

print(f"\nSpecies with highest CV (most substructure):")
print(phylo_stats_pd.nlargest(10, 'cv_branch_dist')[
    ['phylogenetic_tree_id', 'n_genomes_in_tree', 'cv_branch_dist', 'max_median_ratio']
].to_string())

Phylo stats computed for 338 species

CV summary (coefficient of variation):
count    338.000000
mean       0.561952
std        0.451880
min        0.118667
25%        0.337553
50%        0.461460
75%        0.654129
max        5.844860
Name: cv_branch_dist, dtype: float64

Species with highest CV (most substructure):
                     phylogenetic_tree_id  n_genomes_in_tree  cv_branch_dist  max_median_ratio
65   1fef4fd7-4d34-57fa-89a4-94a643af9b4e                 73        5.844860       1437.120067
284  455e7227-6fe7-520a-acf9-22644bc4a212                 72        2.731984         74.916667
149  c6a2acc3-8a43-5786-b233-5052fe9e5f03                 93        2.562477         45.577465
91   1d79b707-10ef-53c6-b6f1-79a1dcaedf75                 71        2.137329         34.682099
243  fe22dc4d-4e14-5538-b7d8-866bdd54290a                230        2.061939        105.401786
235  78b6cc03-d808-5dff-9999-cb2087a92ff9                 94        1.988798        123.040541
194  ae699787-0

In [7]:
# Cell 7: Join phylo stats back to species IDs and save

phylo_stats_pd = phylo_stats_pd.merge(
    tree_species_pd[['gtdb_species_clade_id', 'GTDB_species', 'phylogenetic_tree_id']],
    on='phylogenetic_tree_id',
    how='left'
)

# How many tree IDs matched to a species?
n_matched = phylo_stats_pd['gtdb_species_clade_id'].notna().sum()
print(f"Phylo stats with matched species ID: {n_matched}/{len(phylo_stats_pd)}")
if n_matched < len(phylo_stats_pd):
    print("Unmatched tree IDs (investigate):")
    print(phylo_stats_pd[phylo_stats_pd['gtdb_species_clade_id'].isna()]['phylogenetic_tree_id'].tolist())

phylo_stats_pd.to_csv(f"{OUTPUT_PATH}/species_phylo_stats.csv", index=False)
print(f"\nSaved species_phylo_stats.csv")

Phylo stats with matched species ID: 338/338

Saved species_phylo_stats.csv


---
## STEP 3: Environmental Diversity via NMDC BioSamples

Link: `kbase_ke_pangenome.sample` → `nmdc_ncbi_biosamples.env_triads_flattened`

We compute, per species:
- Number of genomes with any environmental annotation
- Number of distinct `env_broad_scale` categories
- Shannon entropy of `env_broad_scale` distribution

In [8]:
# Cell 8: Explore nmdc_ncbi_biosamples schema

# Check what tables are available
spark.sql("SHOW TABLES IN nmdc_ncbi_biosamples").show(20, truncate=False)

# Check env_triads_flattened schema
print("\nenv_triads_flattened schema:")
spark.sql("DESCRIBE nmdc_ncbi_biosamples.env_triads_flattened").show(30, truncate=False)

print("\nSample rows from env_triads_flattened:")
spark.sql("SELECT * FROM nmdc_ncbi_biosamples.env_triads_flattened LIMIT 3").show(truncate=False)

+--------------------+---------------------------------+-----------+
|namespace           |tableName                        |isTemporary|
+--------------------+---------------------------------+-----------+
|nmdc_ncbi_biosamples|attribute_harmonized_pairings    |false      |
|nmdc_ncbi_biosamples|bioprojects_flattened            |false      |
|nmdc_ncbi_biosamples|biosamples_attributes            |false      |
|nmdc_ncbi_biosamples|biosamples_flattened             |false      |
|nmdc_ncbi_biosamples|biosamples_ids                   |false      |
|nmdc_ncbi_biosamples|biosamples_links                 |false      |
|nmdc_ncbi_biosamples|content_pairs_aggregated         |false      |
|nmdc_ncbi_biosamples|env_triads_flattened             |false      |
|nmdc_ncbi_biosamples|harmonized_name_dimensional_stats|false      |
|nmdc_ncbi_biosamples|harmonized_name_usage_stats      |false      |
|nmdc_ncbi_biosamples|measurement_evidence_percentages |false      |
|nmdc_ncbi_biosamples|measurement_

+-------------+---------+-------+
|col_name     |data_type|comment|
+-------------+---------+-------+
|accession    |string   |NULL   |
|attribute    |string   |NULL   |
|instance     |bigint   |NULL   |
|raw_original |string   |NULL   |
|raw_component|string   |NULL   |
|id           |string   |NULL   |
|label        |string   |NULL   |
|prefix       |string   |NULL   |
|source       |string   |NULL   |
+-------------+---------+-------+


Sample rows from env_triads_flattened:


+------------+---------------+--------+----------------------+----------------------+-------------+----------------------+------+------+
|accession   |attribute      |instance|raw_original          |raw_component         |id           |label                 |prefix|source|
+------------+---------------+--------+----------------------+----------------------+-------------+----------------------+------+------+
|SAMN10911001|env_medium     |0       |soil                  |soil                  |ENVO:00001998|soil                  |ENVO  |OAK   |
|SAMN10911002|env_broad_scale|0       |temperate desert biome|temperate desert biome|ENVO:01000182|temperate desert biome|ENVO  |OAK   |
|SAMN10911002|env_local_scale|0       |high desert grassland |high desert grassland |ENVO:00000114|agricultural field    |ENVO  |OAK   |
+------------+---------------+--------+----------------------+----------------------+-------------+----------------------+------+------+



In [9]:
# Cell 9: Check the biosamples_flattened table for fallback env fields
# (if env_triads_flattened has sparse coverage)

print("biosamples_flattened schema (first 20 cols):")
spark.sql("DESCRIBE nmdc_ncbi_biosamples.biosamples_flattened").show(20, truncate=False)

# Check a sample row for env-related columns
print("\nSample row (env-related columns only):")
spark.sql("""
    SELECT accession, isolation_source, env_broad_scale, env_local_scale, env_medium,
           geo_loc_name, lat_lon
    FROM nmdc_ncbi_biosamples.biosamples_flattened
    WHERE env_broad_scale IS NOT NULL
    LIMIT 5
""").show(truncate=False)

biosamples_flattened schema (first 20 cols):


+-------------------+---------+-------+
|col_name           |data_type|comment|
+-------------------+---------+-------+
|submission_date    |string   |NULL   |
|last_update        |string   |NULL   |
|publication_date   |string   |NULL   |
|access             |string   |NULL   |
|id                 |string   |NULL   |
|accession          |string   |NULL   |
|package_content    |string   |NULL   |
|status_status      |string   |NULL   |
|status_when        |string   |NULL   |
|is_reference       |string   |NULL   |
|curation_date      |string   |NULL   |
|curation_status    |string   |NULL   |
|owner_abbreviation |string   |NULL   |
|owner_name         |string   |NULL   |
|owner_url          |string   |NULL   |
|description_title  |string   |NULL   |
|organism_name      |string   |NULL   |
|taxonomy_id        |string   |NULL   |
|taxonomy_name      |string   |NULL   |
|description_comment|string   |NULL   |
+-------------------+---------+-------+
only showing top 20 rows

Sample row (en

+------------+----------------+--------------------+---------------------------------+---------------------------------+-----------------------------+----------------+
|accession   |isolation_source|env_broad_scale     |env_local_scale                  |env_medium                       |geo_loc_name                 |lat_lon         |
+------------+----------------+--------------------+---------------------------------+---------------------------------+-----------------------------+----------------+
|SAMN26930614|NULL            |soil [ENVO:00001998]|agricultural soil [ENVO:00002259]|agricultural soil [ENVO:00002259]|Australia: Pampas, Queensland|27.73 S 151.33 E|
|SAMN26930615|NULL            |soil [ENVO:00001998]|agricultural soil [ENVO:00002259]|agricultural soil [ENVO:00002259]|Australia: Pampas, Queensland|27.73 S 151.33 E|
|SAMN26930616|NULL            |soil [ENVO:00001998]|agricultural soil [ENVO:00002259]|agricultural soil [ENVO:00002259]|Australia: Pampas, Queensland|27.73 S 15

In [10]:
# Cell 10: Get genome → biosample accession map for all tree species
#
# NOTE: spark.createDataFrame() is not supported via Spark Connect (version mismatch).
# Use a SQL subquery against phylogenetic_tree to filter tree species instead.

genome_biosample_df = spark.sql("""
    SELECT
        g.genome_id,
        g.gtdb_species_clade_id,
        s.ncbi_biosample_accession_id
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.sample s
        ON g.genome_id = s.genome_id
    WHERE g.gtdb_species_clade_id IN (
        SELECT gtdb_species_clade_id FROM kbase_ke_pangenome.phylogenetic_tree
    )
""")

genome_biosample_df.cache()

n_genomes = genome_biosample_df.count()
n_with_biosample = genome_biosample_df.filter(
    F.col('ncbi_biosample_accession_id').isNotNull()
).count()

print(f"Genomes in tree species: {n_genomes:,}")
print(f"Genomes with biosample accession: {n_with_biosample:,} ({100*n_with_biosample/n_genomes:.1f}%)")

# Save genome-biosample map
genome_biosample_pd = genome_biosample_df.toPandas()
genome_biosample_pd.to_csv(f"{OUTPUT_PATH}/genome_biosample_map.csv", index=False)
print(f"\nSaved genome_biosample_map.csv ({len(genome_biosample_pd):,} rows)")

Genomes in tree species: 94,424
Genomes with biosample accession: 94,424 (100.0%)



Saved genome_biosample_map.csv (94,424 rows)


In [11]:
# Cell 11: Join genomes to env_triads_flattened and compute env diversity per species
#
# IMPORTANT: env_triads_flattened is EAV format.
#   attribute column contains 'env_broad_scale'/'env_local_scale'/'env_medium' as values.
#   label column contains the resolved ENVO term.
# Use conditional COUNT(DISTINCT ...) to pivot in aggregation — do NOT reference e.env_broad_scale.

env_diversity_df = spark.sql("""
    SELECT
        g.gtdb_species_clade_id,
        COUNT(DISTINCT g.genome_id)                                                          AS n_genomes_total,
        COUNT(DISTINCT CASE WHEN t.accession IS NOT NULL THEN g.genome_id END)               AS n_genomes_with_env,
        COUNT(DISTINCT CASE WHEN t.attribute = 'env_broad_scale' THEN t.label END)           AS n_distinct_env_broad,
        COUNT(DISTINCT CASE WHEN t.attribute = 'env_local_scale' THEN t.label END)           AS n_distinct_env_local,
        COUNT(DISTINCT CASE WHEN t.attribute = 'env_medium'      THEN t.label END)           AS n_distinct_env_medium
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.sample s
        ON g.genome_id = s.genome_id
    LEFT JOIN nmdc_ncbi_biosamples.env_triads_flattened t
        ON s.ncbi_biosample_accession_id = t.accession
    WHERE g.gtdb_species_clade_id IN (
        SELECT gtdb_species_clade_id FROM kbase_ke_pangenome.phylogenetic_tree
    )
    GROUP BY g.gtdb_species_clade_id
""")

env_diversity_pd = env_diversity_df.toPandas()

env_diversity_pd['env_coverage_fraction'] = (
    env_diversity_pd['n_genomes_with_env'] / env_diversity_pd['n_genomes_total']
)

print(f"Species with env diversity computed: {len(env_diversity_pd)}")
print(f"\nEnv coverage fraction summary:")
print(env_diversity_pd['env_coverage_fraction'].describe())
print(f"\nSpecies with zero env coverage: {(env_diversity_pd['env_coverage_fraction'] == 0).sum()}")
print(f"\nn_distinct_env_broad summary:")
print(env_diversity_pd['n_distinct_env_broad'].describe())
print(f"\nTop 10 by env_broad diversity:")
top_env = env_diversity_pd.merge(
    tree_species_pd[['gtdb_species_clade_id', 'GTDB_species']], on='gtdb_species_clade_id', how='left'
).nlargest(10, 'n_distinct_env_broad')
print(top_env[['GTDB_species', 'n_genomes_with_env', 'n_distinct_env_broad', 'env_coverage_fraction']].to_string())

Species with env diversity computed: 338

Env coverage fraction summary:
count    338.000000
mean       0.279481
std        0.268579
min        0.000000
25%        0.051246
50%        0.192490
75%        0.459233
max        1.000000
Name: env_coverage_fraction, dtype: float64

Species with zero env coverage: 17

n_distinct_env_broad summary:
count    338.000000
mean       5.177515
std        4.727390
min        0.000000
25%        2.000000
50%        4.000000
75%        8.000000
max       39.000000
Name: n_distinct_env_broad, dtype: float64

Top 10 by env_broad diversity:
                         GTDB_species  n_genomes_with_env  n_distinct_env_broad  env_coverage_fraction
7              s__Bacillus_velezensis                 125                    39               0.193199
210              s__Bacillus_subtilis                 127                    31               0.227191
262  s__Lactiplantibacillus_plantarum                 196                    24               0.267760
16       

In [12]:
# Cell 12: Compute Shannon entropy of env_broad_scale distribution per species
#
# Filter env_triads_flattened to attribute='env_broad_scale', use label as the category.

import numpy as np

env_counts_df = spark.sql("""
    SELECT
        g.gtdb_species_clade_id,
        t.label AS env_broad_scale,
        COUNT(DISTINCT g.genome_id) AS n_genomes_in_env
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.sample s
        ON g.genome_id = s.genome_id
    JOIN nmdc_ncbi_biosamples.env_triads_flattened t
        ON s.ncbi_biosample_accession_id = t.accession
    WHERE t.attribute = 'env_broad_scale'
      AND g.gtdb_species_clade_id IN (
          SELECT gtdb_species_clade_id FROM kbase_ke_pangenome.phylogenetic_tree
      )
    GROUP BY g.gtdb_species_clade_id, t.label
""")

env_counts_pd = env_counts_df.toPandas()

def shannon_entropy(counts):
    total = counts.sum()
    if total == 0:
        return 0.0
    probs = counts / total
    return float(-np.sum(probs * np.log2(probs + 1e-12)))

entropy_per_species = (
    env_counts_pd
    .groupby('gtdb_species_clade_id')['n_genomes_in_env']
    .apply(shannon_entropy)
    .reset_index()
    .rename(columns={'n_genomes_in_env': 'env_broad_entropy'})
)

print(f"Entropy computed for {len(entropy_per_species)} species with env annotations")
print(entropy_per_species['env_broad_entropy'].describe())

env_counts_pd.to_csv(f"{OUTPUT_PATH}/species_env_category_counts.csv", index=False)
print(f"\nSaved species_env_category_counts.csv ({len(env_counts_pd):,} rows)")

Entropy computed for 321 species with env annotations
count    3.210000e+02
mean     1.523243e+00
std      9.152944e-01
min     -1.442823e-12
25%      8.658566e-01
50%      1.584963e+00
75%      2.222192e+00
max      3.615519e+00
Name: env_broad_entropy, dtype: float64

Saved species_env_category_counts.csv (2,051 rows)


In [13]:
# Cell 13: Merge env stats and save

species_env_stats = env_diversity_pd.merge(
    entropy_per_species, on='gtdb_species_clade_id', how='left'
).merge(
    tree_species_pd[['gtdb_species_clade_id', 'GTDB_species']],
    on='gtdb_species_clade_id', how='left'
)

# Fill entropy = 0 for species with no env annotations
species_env_stats['env_broad_entropy'] = species_env_stats['env_broad_entropy'].fillna(0.0)

species_env_stats.to_csv(f"{OUTPUT_PATH}/species_env_stats.csv", index=False)
print(f"Saved species_env_stats.csv ({len(species_env_stats)} species)")

print(f"\nSpecies with zero env coverage (will score 0 for env diversity): "
      f"{(species_env_stats['n_genomes_with_env'] == 0).sum()}")
print(f"Tip: species with 0 env coverage are not disqualified — they score lowest on env dimension only.")

Saved species_env_stats.csv (338 species)

Species with zero env coverage (will score 0 for env diversity): 17
Tip: species with 0 env coverage are not disqualified — they score lowest on env dimension only.


---
## STEP 4: Sanity Checks and Handoff Summary

In [14]:
# Cell 14: Verify all output files and print summary

import os

output_files = [
    'species_tree_list.csv',
    'species_pangenome_stats.csv',
    'species_phylo_stats.csv',
    'species_env_stats.csv',
    'species_env_category_counts.csv',
    'genome_biosample_map.csv',
]

print("=== Output File Inventory ===")
for f in output_files:
    fpath = os.path.join(OUTPUT_PATH, f)
    if os.path.exists(fpath):
        df = pd.read_csv(fpath)
        print(f"  {f}: {len(df):,} rows, {df.shape[1]} cols")
    else:
        print(f"  MISSING: {f}")

print()
print("=== Universe Summary ===")
print(f"  Total tree species: {len(tree_species_pd)}")
print(f"  Species with >=20 genomes in tree: {(tree_species_pd['no_genomes'] >= 20).sum()}")
print(f"  Species with >=50 genomes in tree: {(tree_species_pd['no_genomes'] >= 50).sum()}")
print()
print("=== Phylo Substructure Summary ===")
phylo_check = pd.read_csv(f"{OUTPUT_PATH}/species_phylo_stats.csv")
print(f"  Median CV: {phylo_check['cv_branch_dist'].median():.3f}")
print(f"  Top quartile CV threshold (75th pct): {phylo_check['cv_branch_dist'].quantile(0.75):.3f}")
print()
print("=== Environmental Diversity Summary ===")
env_check = pd.read_csv(f"{OUTPUT_PATH}/species_env_stats.csv")
n_any_env = (env_check['n_genomes_with_env'] > 0).sum()
print(f"  Species with any env annotation: {n_any_env}/{len(env_check)} ({100*n_any_env/len(env_check):.0f}%)")
print(f"  Median n_distinct_env_broad (among annotated): "
      f"{env_check[env_check['n_genomes_with_env']>0]['n_distinct_env_broad'].median():.1f}")
print(f"  Median env entropy (among annotated): "
      f"{env_check[env_check['env_broad_entropy']>0]['env_broad_entropy'].median():.2f}")

print()
print("All done. Download data/ directory and proceed with NB02 locally.")

=== Output File Inventory ===
  species_tree_list.csv: 338 rows, 13 cols
  species_pangenome_stats.csv: 338 rows, 12 cols
  species_phylo_stats.csv: 338 rows, 16 cols
  species_env_stats.csv: 338 rows, 9 cols
  species_env_category_counts.csv: 2,051 rows, 3 cols
  genome_biosample_map.csv: 94,424 rows, 3 cols

=== Universe Summary ===
  Total tree species: 338
  Species with >=20 genomes in tree: 336
  Species with >=50 genomes in tree: 334

=== Phylo Substructure Summary ===
  Median CV: 0.461
  Top quartile CV threshold (75th pct): 0.654

=== Environmental Diversity Summary ===
  Species with any env annotation: 321/338 (95%)
  Median n_distinct_env_broad (among annotated): 5.0
  Median env entropy (among annotated): 1.74

All done. Download data/ directory and proceed with NB02 locally.
