# NB02: Map Pathways to Fitness Browser Genes

**Requires**: BERDL JupyterHub (Spark access)

**Purpose**: Extract Fitness Browser gene-fitness data and map genes to GapMind pathway categories
using SEED subsystem annotations as a proxy for pathway membership. This approach replaces the
DIAMOND-based link table from `conservation_vs_fitness` by querying BERDL databases directly.

**Inputs**:
- `data/gapmind_genome_pathways.csv` (from NB01)
- `data/gapmind_pathway_summary.csv` (from NB01)
- BERDL: `kescience_fitnessbrowser.*`, `kbase_ke_pangenome.gtdb_metadata`

**Outputs**:
- `data/organism_metadata.csv` — FB organism info
- `data/organism_mapping.tsv` — FB org → GapMind species mapping
- `data/seed_annotations.csv` — SEED subsystem annotations per gene
- `data/gene_fitness_aggregates.csv` — Mean |t|, max |t| per gene
- `data/essential_genes.tsv` — Protein-coding genes absent from genefitness
- `data/pathway_fitness_metrics.csv` — Per-organism per-pathway fitness summary

**Runtime**: ~15-25 minutes (Spark aggregation over 27M genefitness rows)

In [1]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

spark = get_spark_session()

PROJECT_ROOT = Path('/home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency')
DATA_DIR = PROJECT_ROOT / 'data'
DATA_DIR.mkdir(exist_ok=True, parents=True)

# Also make src importable
sys.path.insert(0, str(PROJECT_ROOT / 'src'))
from pathway_utils import categorize_pathway, classify_pathway_dependency

print(f'Spark session: {spark}')
print(f'Data directory: {DATA_DIR}')

Spark session: <pyspark.sql.connect.session.SparkSession object at 0x70b52dc5d2b0>
Data directory: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data


## 1. Explore Fitness Browser Schema

Inspect what tables are available and understand their structure.

In [2]:
print('=== Tables in kescience_fitnessbrowser ===')
spark.sql('SHOW TABLES IN kescience_fitnessbrowser').show(50, truncate=False)

print('\n=== Schema: organism ===')
spark.sql('DESCRIBE kescience_fitnessbrowser.organism').show(50, truncate=False)

print('\n=== Schema: gene ===')
spark.sql('DESCRIBE kescience_fitnessbrowser.gene').show(50, truncate=False)

print('\n=== Schema: genefitness ===')
spark.sql('DESCRIBE kescience_fitnessbrowser.genefitness').show(50, truncate=False)

print('\n=== Schema: seedannotation ===')
spark.sql('DESCRIBE kescience_fitnessbrowser.seedannotation').show(50, truncate=False)

=== Tables in kescience_fitnessbrowser ===
+------------------------+---------------------------------+-----------+
|namespace               |tableName                        |isTemporary|
+------------------------+---------------------------------+-----------+
|kescience_fitnessbrowser|organism                         |false      |
|kescience_fitnessbrowser|gene                             |false      |
|kescience_fitnessbrowser|ortholog                         |false      |
|kescience_fitnessbrowser|experiment                       |false      |
|kescience_fitnessbrowser|genefitness                      |false      |
|kescience_fitnessbrowser|cofit                            |false      |
|kescience_fitnessbrowser|specificphenotype                |false      |
|kescience_fitnessbrowser|genedomain                       |false      |
|kescience_fitnessbrowser|genefeature                      |false      |
|kescience_fitnessbrowser|straindataseek                   |false      |
|kescien

## 2. Get Fitness Browser Organism Metadata

In [3]:
organisms = spark.sql('SELECT * FROM kescience_fitnessbrowser.organism').toPandas()

print(f'FB organisms: {len(organisms)}')
print('\nColumns:', organisms.columns.tolist())
print('\nSample:')
print(organisms.head(10).to_string())

organisms.to_csv(DATA_DIR / 'organism_metadata.csv', index=False)
print(f'\nSaved to: {DATA_DIR}/organism_metadata.csv')

FB organisms: 48

Columns: ['orgId', 'division', 'genus', 'species', 'strain', 'taxonomyId']

Sample:
             orgId             division             genus           species               strain taxonomyId
0  acidovorax_3H11   Betaproteobacteria        Acidovorax               sp.           GW101-3H11      12916
1             ANA3  Gammaproteobacteria        Shewanella               sp.                ANA-3      94122
2           azobra  Alphaproteobacteria      Azospirillum        brasilense                Sp245    1064539
3            BFirm   Betaproteobacteria      Burkholderia      phytofirmans                 PsJN     398527
4           Btheta        Bacteroidetes       Bacteroides  thetaiotaomicron             VPI-5482     226186
5          Burk376   Betaproteobacteria  Paraburkholderia         bryophila          376MFSha3.1    1169143
6            Caulo  Alphaproteobacteria       Caulobacter        crescentus               NA1000     565050
7             Cola        Bacteroi

## 3. Inspect Raw GapMind Data

Check the `sequence_scope` column to see if it contains gene identifiers.
If so, we can use it for more precise pathway-gene mapping.

In [4]:
print('Raw gapmind_pathways sample (pre-aggregation):')
raw_sample = spark.sql("""
    SELECT *
    FROM kbase_ke_pangenome.gapmind_pathways
    WHERE pathway = 'his'
    LIMIT 10
""").toPandas()
print(raw_sample.to_string())

print('\n\nDistinct sequence_scope values:')
sc_vals = spark.sql("""
    SELECT sequence_scope, COUNT(*) as n
    FROM kbase_ke_pangenome.gapmind_pathways
    GROUP BY sequence_scope
    ORDER BY n DESC
    LIMIT 20
""").toPandas()
print(sc_vals.to_string())

Raw gapmind_pathways sample (pre-aggregation):
         genome_id pathway                                            clade_name metabolic_category sequence_scope  nHi  nMed  nLo  score     score_category  score_simplified
0  GCA_021627005.1     his  s__Cryptobacteroides_sp900544195--GB_GCA_900544195.1                 aa           core    8     1    2    3.9  steps_missing_low               0.0
1  GCA_021630165.1     his  s__Cryptobacteroides_sp900544195--GB_GCA_900544195.1                 aa           core    9     1    1    6.9  steps_missing_low               0.0
2  GCA_021623465.1     his  s__Cryptobacteroides_sp900544195--GB_GCA_900544195.1                 aa           core    9     1    1    6.9  steps_missing_low               0.0
3  GCA_021635085.1     his  s__Cryptobacteroides_sp900544195--GB_GCA_900544195.1                 aa           core    5     1    5   -5.1  steps_missing_low               0.0
4  GCA_021621345.1     his  s__Cryptobacteroides_sp900544195--GB_GCA_900544195

## 4. Get SEED Subsystem Annotations

SEED subsystems provide a functional categorization of genes that overlaps with GapMind pathways.
We use them as a proxy for pathway membership when direct gene-pathway links are unavailable.

In [5]:
print('Sample SEED annotations:')
seed_sample = spark.sql('SELECT * FROM kescience_fitnessbrowser.seedannotation LIMIT 10').toPandas()
print(seed_sample.to_string())
print('\nColumns:', seed_sample.columns.tolist())

Sample SEED annotations:
     orgId        locusId                                                                                                         seed_desc
0  Pedo557  CA265_RS14400                                                                DNA topoisomerase IB (poxvirus type) (EC 5.99.1.2)
1  Pedo557  CA265_RS14405                                                                     Methionyl-tRNA formyltransferase (EC 2.1.2.9)
2  Pedo557  CA265_RS14410                                                                          Aminodeoxychorismate lyase (EC 4.1.3.38)
3  Pedo557  CA265_RS14415                                                    Ribosomal large subunit pseudouridine synthase D (EC 4.2.1.70)
4  Pedo557  CA265_RS14425                                                         1-aminocyclopropane-1-carboxylate deaminase (EC 3.5.99.7)
5  Pedo557  CA265_RS14430                                                                                  Alpha-L-fucosidase (EC 3.2.1

In [6]:
# Pull all SEED annotations — manageable size (~200-400K rows for 48 organisms)
# The seedannotation table has `seed_desc` (gene role description), not `subsystem`.
# We match GapMind pathways against these role descriptions via keywords.
seed_all = spark.sql("""
    SELECT orgId, locusId, seed_desc
    FROM kescience_fitnessbrowser.seedannotation
""").toPandas()

print(f'Total SEED annotations: {len(seed_all):,}')
print(f'Unique role descriptions: {seed_all["seed_desc"].nunique():,}')
print(f'Organisms with annotations: {seed_all["orgId"].nunique()}')

print('\nTop 40 most common role descriptions:')
print(seed_all.groupby('seed_desc').size().sort_values(ascending=False).head(40).to_string())

seed_all.to_csv(DATA_DIR / 'seed_annotations.csv', index=False)
print(f'\nSaved to: {DATA_DIR}/seed_annotations.csv')

Total SEED annotations: 177,519
Unique role descriptions: 23,049
Organisms with annotations: 48

Top 40 most common role descriptions:
seed_desc
Mobile element protein                                                                           1494
Transcriptional regulator, LysR family                                                            983
diguanylate cyclase/phosphodiesterase (GGDEF & EAL domains) with PAS/PAC sensor(s)                543
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)                              532
Probable transmembrane protein                                                                    522
3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)                                         501
Permease of the drug/metabolite transporter (DMT) superfamily                                     450
Methyl-accepting chemotaxis protein                                                               401
Permeases of the major facilitator supe

## 5. Build GapMind → SEED Keyword Mapping

The `seed_desc` column contains gene role descriptions (e.g. "Histidinol dehydrogenase (EC 1.1.1.23)").
We keyword-match these against GapMind pathway names. Amino acid biosynthesis genes typically
include the pathway name in their description (e.g. "Arginine biosynthesis protein ArgJ"),
so this proxy works well for those categories.

In [7]:
# Keyword map: GapMind pathway name → list of strings to search in seed_desc values
# Each keyword is checked as a case-insensitive substring of the role description.
# Amino acid biosynthesis descriptions typically contain the amino acid name directly.
PATHWAY_SEED_KEYWORDS = {
    # ── Amino acid biosynthesis ──────────────────────────────────────
    'alanine':     ['alanine biosyn', 'alanine aminotransferase'],
    'arg':         ['arginine biosyn', 'argininosuccinate', 'carbamoyl phosphate synthase', 'ornithine carbamoyltransferase'],
    'arginine':    ['arginine catab', 'arginine deiminase', 'arginine decarboxylase'],
    'asn':         ['asparagine biosyn', 'asparagine synthetase'],
    'asparagine':  ['asparagine biosyn', 'asparagine synthetase'],
    'aspartate':   ['aspartate biosyn', 'aspartate transaminase', 'aspartate aminotransferase'],
    'chorismate':  ['chorismate', 'shikimate', '3-dehydroquinate', 'EPSP synthase', 'aromatic amino acid biosyn'],
    'cys':         ['cysteine biosyn', 'cysteine synthase', 'serine acetyltransferase'],
    'cysteine':    ['cysteine biosyn', 'cysteine synthase', 'serine acetyltransferase'],
    'gln':         ['glutamine synthetase', 'glutamine biosyn'],
    'glu':         ['glutamate biosyn', 'glutamate synthase', 'glutamine oxoglutarate aminotransferase'],
    'glutamate':   ['glutamate biosyn', 'glutamate synthase', 'glutamate dehydrogenase'],
    'glutamine':   ['glutamine synthetase', 'glutamine biosyn'],
    'gly':         ['glycine biosyn', 'serine hydroxymethyltransferase', 'glycine cleavage', 'threonine aldolase'],
    'glycine':     ['glycine biosyn', 'serine hydroxymethyltransferase', 'glycine cleavage'],
    'his':         ['histidine biosyn', 'histidinol', 'imidazoleglycerol', 'phosphoribosyl-atp'],
    'histidine':   ['histidine biosyn', 'histidinol', 'imidazoleglycerol'],
    'ile':         ['isoleucine biosyn', 'threonine dehydratase', 'acetolactate synthase', 'dihydroxyacid dehydratase', 'branched-chain amino acid biosyn'],
    'isoleucine':  ['isoleucine biosyn', 'threonine dehydratase'],
    'leu':         ['leucine biosyn', '2-isopropylmalate', 'branched-chain amino acid biosyn'],
    'leucine':     ['leucine biosyn', '2-isopropylmalate'],
    'lys':         ['lysine biosyn', 'diaminopimelate', 'aspartate kinase', 'aspartate semialdehyde'],
    'lysine':      ['lysine biosyn', 'diaminopimelate', 'lysine aminotransferase'],
    'met':         ['methionine biosyn', 'homocysteine methyltransferase', 'cystathionine', 'O-succinylhomoserine'],
    'methionine':  ['methionine biosyn', 'methionine synthase', 'cystathionine'],
    'phe':         ['phenylalanine biosyn', 'chorismate mutase', 'prephenate dehydratase', 'aromatic amino acid biosyn'],
    'phenylalanine':['phenylalanine biosyn', 'phenylalanine aminotransferase'],
    'pro':         ['proline biosyn', 'gamma-glutamyl kinase', 'pyrroline-5-carboxylate'],
    'proline':     ['proline biosyn', 'gamma-glutamyl kinase', 'pyrroline-5-carboxylate'],
    'ser':         ['serine biosyn', 'phosphoserine', 'phosphoglycerate dehydrogenase'],
    'serine':      ['serine biosyn', 'phosphoserine', 'phosphoglycerate dehydrogenase'],
    'thr':         ['threonine biosyn', 'homoserine kinase', 'threonine synthase', 'aspartate kinase'],
    'threonine':   ['threonine biosyn', 'homoserine kinase', 'threonine synthase'],
    'trp':         ['tryptophan biosyn', 'anthranilate', 'indole-3-glycerol phosphate', 'tryptophan synthase'],
    'tryptophan':  ['tryptophan biosyn', 'anthranilate', 'tryptophan synthase'],
    'tyr':         ['tyrosine biosyn', 'prephenate dehydrogenase', 'chorismate mutase', 'aromatic amino acid biosyn'],
    'tyrosine':    ['tyrosine biosyn', 'prephenate dehydrogenase'],
    'val':         ['valine biosyn', 'acetolactate synthase', 'dihydroxyacid dehydratase', 'branched-chain amino acid biosyn'],
    'valine':      ['valine biosyn', 'acetolactate synthase'],
    # ── Carbon source utilization ─────────────────────────────────────
    '2-oxoglutarate':    ['2-oxoglutarate', 'alpha-ketoglutarate', '2-oxoglutarate dehydrogenase'],
    '4-hydroxybenzoate': ['4-hydroxybenzoate', 'hydroxybenzoate'],
    'acetate':           ['acetate kinase', 'phosphotransacetylase', 'acetyl-coa synthetase'],
    'arabinose':         ['arabinose isomerase', 'ribulokinase', 'l-arabinose', 'arabinose transport'],
    'cellobiose':        ['cellobiose', 'beta-glucosidase', 'phospho-beta-glucosidase'],
    'citrate':           ['citrate lyase', 'citrate synthase', 'citrate transport'],
    'D-alanine':         ['d-alanine', 'alanine racemase'],
    'D-lactate':         ['d-lactate', 'd-lactate dehydrogenase'],
    'D-serine':          ['d-serine', 'd-serine dehydratase', 'd-serine deaminase'],
    'deoxyinosine':      ['purine nucleoside phosphorylase', 'deoxyinosine', 'nucleoside catab'],
    'deoxyribose':       ['deoxyribose', '2-deoxyribose', 'deoxyribose-phosphate aldolase'],
    'ethanol':           ['alcohol dehydrogenase', 'aldehyde dehydrogenase', 'ethanol oxidation'],
    'fructose':          ['fructokinase', 'fructose-bisphosphate', 'fructose pts', 'fructose transport'],
    'galactose':         ['galactokinase', 'galactose-1-phosphate', 'galactose mutarotase', 'gal operon'],
    'glycerol':          ['glycerol kinase', 'glycerol-3-phosphate', 'glycerol facilitator'],
    'L-lactate':         ['l-lactate dehydrogenase', 'l-lactate permease'],
    'L-malate':          ['malate dehydrogenase', 'malate permease', 'malic enzyme'],
    'lactose':           ['lactose permease', 'beta-galactosidase', 'lactose pts'],
    'maltose':           ['maltose-binding protein', 'maltodextrin', 'alpha-glucosidase'],
    'mannose':           ['mannose-6-phosphate isomerase', 'mannose pts', 'mannose transport'],
    'NAG':               ['n-acetylglucosamine', 'glucosamine-6-phosphate', 'nagB', 'nagA'],
    'ribose':            ['ribose transport', 'ribokinase', 'd-ribose'],
    'sorbitol':          ['sorbitol-6-phosphate', 'glucitol', 'l-iditol dehydrogenase'],
    'sucrose':           ['sucrose-6-phosphate', 'sucrose pts', 'invertase', 'sucrose phosphorylase'],
    'thymidine':         ['thymidine phosphorylase', 'thymine permease'],
    'trehalose':         ['trehalose-6-phosphate', 'trehalose pts', 'trehalase'],
    'xylose':            ['xylose isomerase', 'xylulokinase', 'd-xylose'],
    # ── Other ────────────────────────────────────────────────────────
    'citrulline':        ['citrulline', 'ornithine carbamoyltransferase', 'argininosuccinate'],
    'putrescine':        ['putrescine', 'spermidine synthase', 'ornithine decarboxylase', 'agmatinase'],
}


def find_matching_descs(pathway: str, all_descs: np.ndarray) -> list:
    """Return seed_desc values that match keywords for the given GapMind pathway."""
    keywords = PATHWAY_SEED_KEYWORDS.get(pathway, [pathway.lower()])
    matches = [
        desc for desc in all_descs
        if any(kw.lower() in str(desc).lower() for kw in keywords)
    ]
    return matches


# Load GapMind pathway list from NB01 summary
pathway_summary = pd.read_csv(DATA_DIR / 'gapmind_pathway_summary.csv')
pathways_list = pathway_summary['pathway'].tolist()
all_descs = seed_all['seed_desc'].dropna().unique()

# Build the mapping: pathway → matched seed_desc values
pathway_to_descs = {}
no_match = []

for pathway in sorted(pathways_list):
    matched = find_matching_descs(pathway, all_descs)
    pathway_to_descs[pathway] = matched
    if not matched:
        no_match.append(pathway)

matched_count = sum(1 for v in pathway_to_descs.values() if v)
print(f'Pathways with ≥1 seed_desc match: {matched_count} / {len(pathways_list)}')
print(f'Pathways with NO match: {len(no_match)}')
if no_match:
    print('  No-match pathways:', sorted(no_match))

print('\nMapping sample (first 20 pathways, up to 3 matched descs each):')
for pw, descs in sorted(pathway_to_descs.items())[:20]:
    print(f'  {pw} ({len(descs)} matches): {descs[:3]}')

Pathways with ≥1 seed_desc match: 76 / 80
Pathways with NO match: 4
  No-match pathways: ['deoxyribonate', 'myoinositol', 'phenylalanine', 'tyrosine']

Mapping sample (first 20 pathways, up to 3 matched descs each):
  2-oxoglutarate (27 matches): ['Dihydrolipoamide succinyltransferase component (E2) of 2-oxoglutarate dehydrogenase complex (EC 2.3.1.61)', '2-oxoglutarate dehydrogenase E1 component (EC 1.2.4.2)', '4-Hydroxy-2-oxoglutarate aldolase (EC 4.1.3.16) @ 2-dehydro-3-deoxyphosphogluconate aldolase (EC 4.1.2.14)']
  4-hydroxybenzoate (12 matches): ['3-polyprenyl-4-hydroxybenzoate carboxy-lyase UbiX (EC 4.1.1.-)', '4-hydroxybenzoate transporter', 'P-hydroxybenzoate hydroxylase (EC 1.14.13.2)']
  D-alanine (15 matches): ['D-alanyl-D-alanine carboxypeptidase (EC 3.4.16.4)', 'UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate--D-alanyl-D-alanine ligase (EC 6.3.2.10)', 'D-serine/D-alanine/glycine transporter']
  D-lactate (4 matches): ['D-lactate dehydrogenase (EC 1.1.1.28)', 'P

## 6. Extract Gene-Level Fitness Aggregates

Query mean |t-score| and max |t-score| per gene across all conditions.
This takes ~5-10 minutes (aggregates 27M rows → ~150K gene records).

In [8]:
gene_fitness = spark.sql("""
    SELECT
        orgId,
        locusId,
        COUNT(*)                            AS n_conditions,
        AVG(ABS(fit))                       AS mean_abs_fit,
        AVG(ABS(t))                         AS mean_abs_t,
        MAX(ABS(t))                         AS max_abs_t,
        percentile_approx(ABS(t), 0.5)     AS median_abs_t
    FROM kescience_fitnessbrowser.genefitness
    GROUP BY orgId, locusId
""").toPandas()

print(f'Fitness data: {len(gene_fitness):,} gene records')
print(f'Organisms: {gene_fitness["orgId"].nunique()}')
print('\nSample:')
print(gene_fitness.head(10).to_string())
print('\nMean |t| distribution:')
print(gene_fitness['mean_abs_t'].describe())

gene_fitness.to_csv(DATA_DIR / 'gene_fitness_aggregates.csv', index=False)
print(f'\nSaved to: {DATA_DIR}/gene_fitness_aggregates.csv')

Fitness data: 182,447 gene records
Organisms: 48

Sample:
  orgId  locusId  n_conditions  mean_abs_fit  mean_abs_t  max_abs_t  median_abs_t
0  ANA3  7022501           107      0.065629    0.533129   1.597991      0.458542
1  ANA3  7022518           107      0.298570    1.018012   4.382178      0.864578
2  ANA3  7022523           107      0.327679    0.793829   2.290653      0.736008
3  ANA3  7022525           107      0.204495    0.806039   3.869253      0.662241
4  ANA3  7022527           107      0.309969    1.188399   2.761472      1.205314
5  ANA3  7022535           107      0.387406    1.143489   3.471375      1.099596
6  ANA3  7022550           107      0.291871    0.683936   2.526782      0.618861
7  ANA3  7022556           107      0.166246    0.654944   2.304738      0.571546
8  ANA3  7022561           107      0.087301    0.581252   1.901286      0.481979
9  ANA3  7022572           107      0.472985    0.672740   3.883407      0.469636

Mean |t| distribution:
count    182447.

## 7. Identify Essential Genes

Putative essential genes = protein-coding genes (type='1') with no entries in genefitness.
Absence from genefitness means no viable transposon mutants were recovered under library conditions.

In [9]:
essential = spark.sql("""
    SELECT
        g.orgId,
        g.locusId,
        g.desc AS gene_desc
    FROM kescience_fitnessbrowser.gene g
    LEFT JOIN (
        SELECT DISTINCT orgId, locusId
        FROM kescience_fitnessbrowser.genefitness
    ) gf
      ON g.orgId = gf.orgId
     AND g.locusId = gf.locusId
    WHERE g.type = '1'
      AND gf.locusId IS NULL
""").toPandas()

print(f'Putative essential genes: {len(essential):,}')
print(f'Organisms: {essential["orgId"].nunique()}')
print('\nEssential genes per organism:')
print(essential.groupby('orgId').size().sort_values(ascending=False).to_string())

essential.to_csv(DATA_DIR / 'essential_genes.tsv', sep='\t', index=False)
print(f'\nSaved to: {DATA_DIR}/essential_genes.tsv')

Putative essential genes: 41,059
Organisms: 48

Essential genes per organism:
orgId
BFirm                      1760
pseudo1_N1B4               1639
Burk376                    1408
Magneto                    1334
azobra                     1310
RalstoniaGMI1000           1103
WCS417                     1092
RalstoniaUW163             1091
Smeli                      1087
acidovorax_3H11            1040
Dino                       1007
RalstoniaBSBF1503          1007
Cup4G11                     985
pseudo6_N2E2                968
psRCH2                      920
PS                          886
PV4                         861
SyringaeB728a_mexBdelta     852
RalstoniaPSI07              837
HerbieS                     837
Koxy                        824
SyringaeB728a               818
pseudo5_N2C3_1              813
pseudo13_GW456_L13          806
MR1                         805
Putida                      794
Korea                       789
Phaeo                       781
SynE                

## 8. Match FB Organisms to GapMind Species

Use NCBI taxonomy IDs to link FB organisms to pangenome species clades.
GapMind pathway completeness data can then be retrieved for those species.

In [10]:
# Inspect gtdb_metadata schema
print('=== gtdb_metadata schema ===')
spark.sql('DESCRIBE kbase_ke_pangenome.gtdb_metadata').show(50, truncate=False)

print('\nSample rows:')
spark.sql('SELECT * FROM kbase_ke_pangenome.gtdb_metadata LIMIT 5').show(5, truncate=False)

=== gtdb_metadata schema ===
+---------------------------------------+---------+-------+
|col_name                               |data_type|comment|
+---------------------------------------+---------+-------+
|accession                              |string   |NULL   |
|ambiguous_bases                        |string   |NULL   |
|checkm_completeness                    |string   |NULL   |
|checkm_contamination                   |string   |NULL   |
|checkm_marker_count                    |string   |NULL   |
|checkm_marker_lineage                  |string   |NULL   |
|checkm_marker_set_count                |string   |NULL   |
|checkm_strain_heterogeneity            |string   |NULL   |
|coding_bases                           |string   |NULL   |
|coding_density                         |string   |NULL   |
|contig_count                           |string   |NULL   |
|gc_count                               |string   |NULL   |
|gc_percentage                          |string   |NULL   |
|genome_siz

In [11]:
# Get NCBI taxonomy IDs from FB organisms
# Look for column names containing 'tax', 'ncbi', 'id'
print('FB organism columns:', organisms.columns.tolist())
print()
for col in organisms.columns:
    print(f'  {col}: {organisms[col].head(3).tolist()}')

FB organism columns: ['orgId', 'division', 'genus', 'species', 'strain', 'taxonomyId']

  orgId: ['acidovorax_3H11', 'ANA3', 'azobra']
  division: ['Betaproteobacteria', 'Gammaproteobacteria', 'Alphaproteobacteria']
  genus: ['Acidovorax', 'Shewanella', 'Azospirillum']
  species: ['sp.', 'sp.', 'brasilense']
  strain: ['GW101-3H11', 'ANA-3', 'Sp245']
  taxonomyId: ['12916', '94122', '1064539']


In [12]:
# ── Determine the taxid column name from organism table ──────────────────────
# Adjust TAX_COL and GTDB_TAX_COL below if the schema inspection above
# shows different column names.
TAX_COL = next(
    (c for c in organisms.columns if 'tax' in c.lower() or 'ncbi' in c.lower()),
    None
)
print(f'Detected FB taxid column: {TAX_COL}')

# Query GTDB metadata schema to find taxid column
gtdb_schema = spark.sql('DESCRIBE kbase_ke_pangenome.gtdb_metadata').toPandas()
GTDB_TAX_COL = next(
    (r['col_name'] for _, r in gtdb_schema.iterrows()
     if 'tax' in r['col_name'].lower() and 'ncbi' in r['col_name'].lower()),
    None
)
print(f'Detected GTDB taxid column: {GTDB_TAX_COL}')

# Also look for species/clade ID column
GTDB_SPECIES_COL = next(
    (r['col_name'] for _, r in gtdb_schema.iterrows()
     if 'species' in r['col_name'].lower() or 'clade' in r['col_name'].lower()),
    None
)
print(f'Detected GTDB species/clade column: {GTDB_SPECIES_COL}')

Detected FB taxid column: taxonomyId
Detected GTDB taxid column: gtdb_type_designation_ncbi_taxa
Detected GTDB species/clade column: gtdb_type_species_of_genus


In [13]:
# Build organism → GapMind species mapping via taxid join
# If TAX_COL or GTDB_TAX_COL are None, update them manually based on schema output above

if TAX_COL and GTDB_TAX_COL and GTDB_SPECIES_COL:
    # Get distinct species with GapMind data
    gapmind_species_df = pd.read_csv(DATA_DIR / 'gapmind_species_summary.csv')
    gapmind_species_set = set(gapmind_species_df['species'].tolist())

    # Build taxid → clade mapping from GTDB metadata
    gtdb_map = spark.sql(f"""
        SELECT DISTINCT {GTDB_TAX_COL} AS ncbi_taxid, {GTDB_SPECIES_COL} AS clade_name
        FROM kbase_ke_pangenome.gtdb_metadata
        WHERE {GTDB_TAX_COL} IS NOT NULL
    """).toPandas()
    gtdb_map['ncbi_taxid'] = gtdb_map['ncbi_taxid'].astype(str)

    print(f'GTDB taxid→clade mappings: {len(gtdb_map):,}')
    print(gtdb_map.head(10).to_string())

    # Match FB organisms to GTDB clades
    org_map = organisms.copy()
    org_map[TAX_COL] = org_map[TAX_COL].astype(str)
    org_map = org_map.merge(gtdb_map, left_on=TAX_COL, right_on='ncbi_taxid', how='left')

    # Check which clades have GapMind data
    org_map['has_gapmind'] = org_map['clade_name'].isin(gapmind_species_set)

    matched = org_map[org_map['has_gapmind']]
    print(f'\nFB organisms with GapMind species match: {len(matched)} / {len(organisms)}')
    print(matched[['orgId', 'clade_name']].to_string())

else:
    print('WARNING: Could not auto-detect taxid columns.')
    print('Manually set TAX_COL and GTDB_TAX_COL based on schema output above.')
    org_map = organisms.copy()
    org_map['clade_name'] = np.nan
    org_map['has_gapmind'] = False
    matched = org_map.head(0)

# Also try genus-species name matching as fallback
if 'genus' in organisms.columns and 'species' in organisms.columns:
    gapmind_species_df = pd.read_csv(DATA_DIR / 'gapmind_species_summary.csv')
    for _, row in organisms.iterrows():
        genus = str(row.get('genus', '')).strip()
        species = str(row.get('species', '')).strip()
        name_match = gapmind_species_df['species'].str.contains(
            f'{genus}.*{species}', case=False, na=False, regex=True
        )
        if name_match.any() and not org_map.loc[org_map.get('orgId', '') == row.get('orgId', ''), 'has_gapmind'].any():
            clade = gapmind_species_df.loc[name_match, 'species'].iloc[0]
            org_map.loc[org_map.get('orgId', '') == row.get('orgId', ''), 'clade_name'] = clade
            org_map.loc[org_map.get('orgId', '') == row.get('orgId', ''), 'has_gapmind'] = True

# Save organism mapping
org_map.to_csv(DATA_DIR / 'organism_mapping.tsv', sep='\t', index=False)
print(f'\nSaved organism mapping: {DATA_DIR}/organism_mapping.tsv')

GTDB taxid→clade mappings: 5
                           ncbi_taxid clade_name
0              type strain of species          t
1                   not type material          f
2           type strain of subspecies          f
3  type strain of heterotypic synonym          f
4              type strain of species          f

FB organisms with GapMind species match: 0 / 48
Empty DataFrame
Columns: [orgId, clade_name]
Index: []

Saved organism mapping: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/organism_mapping.tsv


## 9. Compute Pathway-Level Fitness Metrics

For each (organism, GapMind pathway) pair, use SEED subsystem gene membership
as a proxy for pathway gene membership and aggregate fitness scores.

In [14]:
# Build gene → pathway assignments via SEED role description matching
# For each pathway, find all genes annotated to matching seed_desc values

# Index: (orgId, locusId) → mean_abs_t, max_abs_t
fitness_idx = gene_fitness.set_index(['orgId', 'locusId'])

# Essential gene set: (orgId, locusId)
essential_set = set(zip(essential['orgId'], essential['locusId']))

pathway_metrics = []

for pathway in sorted(pathways_list):
    matched_descs = pathway_to_descs.get(pathway, [])
    if not matched_descs:
        continue

    # Genes annotated to this pathway via matching seed_desc values
    pathway_genes_seed = seed_all[
        seed_all['seed_desc'].isin(matched_descs)
    ][['orgId', 'locusId']].drop_duplicates()

    if len(pathway_genes_seed) == 0:
        continue

    # Group by organism
    for org_id, org_genes in pathway_genes_seed.groupby('orgId'):
        loci = org_genes['locusId'].tolist()

        # Genes with fitness data
        loci_in_fitness = [
            l for l in loci
            if (org_id, l) in fitness_idx.index
        ]

        # Essential genes in this pathway
        n_essential = sum(1 for l in loci if (org_id, l) in essential_set)

        n_genes = len(loci)
        n_with_fitness = len(loci_in_fitness)
        pct_essential = 100.0 * n_essential / n_genes if n_genes > 0 else np.nan

        if n_with_fitness > 0:
            t_scores = fitness_idx.loc[
                [(org_id, l) for l in loci_in_fitness], 'mean_abs_t'
            ].values
            mean_abs_t = float(np.nanmean(t_scores))
            max_abs_t = float(np.nanmax(t_scores))
            median_abs_t = float(np.nanmedian(t_scores))
        else:
            mean_abs_t = np.nan
            max_abs_t = np.nan
            median_abs_t = np.nan

        pathway_metrics.append({
            'orgId': org_id,
            'pathway': pathway,
            'pathway_category': categorize_pathway(pathway),
            'n_seed_genes': n_genes,
            'n_with_fitness': n_with_fitness,
            'n_essential': n_essential,
            'pct_essential': pct_essential,
            'mean_abs_t': mean_abs_t,
            'max_abs_t': max_abs_t,
            'median_abs_t': median_abs_t,
            'matched_seed_descs': '|'.join(matched_descs),
        })

pathway_metrics_df = pd.DataFrame(pathway_metrics)
print(f'Pathway-level fitness metrics: {len(pathway_metrics_df):,} records')
print(f'Organisms: {pathway_metrics_df["orgId"].nunique()}')
print(f'Pathways covered: {pathway_metrics_df["pathway"].nunique()}')
print('\nSample:')
print(pathway_metrics_df.head(20).to_string())

Pathway-level fitness metrics: 3,065 records
Organisms: 48
Pathways covered: 76

Sample:
       orgId         pathway pathway_category  n_seed_genes  n_with_fitness  n_essential  pct_essential  mean_abs_t  max_abs_t  median_abs_t                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [15]:
# Filter to records with enough genes for reliable metrics
MIN_SEED_GENES = 3

pathway_metrics_filtered = pathway_metrics_df[
    pathway_metrics_df['n_seed_genes'] >= MIN_SEED_GENES
].copy()

print(f'Records with ≥{MIN_SEED_GENES} SEED genes: {len(pathway_metrics_filtered):,}')
print(f'Organisms: {pathway_metrics_filtered["orgId"].nunique()}')
print(f'Pathways: {pathway_metrics_filtered["pathway"].nunique()}')

print('\nPathway coverage:')
print(pathway_metrics_filtered.groupby('pathway_category')[['pathway', 'orgId']].nunique())

print('\nmean_abs_t distribution by category:')
print(pathway_metrics_filtered.groupby('pathway_category')['mean_abs_t'].describe())

Records with ≥3 SEED genes: 2,063
Organisms: 48
Pathways: 74

Pathway coverage:
                  pathway  orgId
pathway_category                
amino_acid             43     48
carbon                 16     48
other                  15     48

mean_abs_t distribution by category:
                   count      mean       std       min       25%       50%  \
pathway_category                                                             
amino_acid        1246.0  1.841079  1.323796  0.443576  0.921716  1.361938   
carbon             383.0  1.232448  0.806703  0.390014  0.792297  0.973844   
other              372.0  1.777302  1.584412  0.496418  0.800839  1.081759   

                       75%        max  
pathway_category                       
amino_acid        2.282729  15.461652  
carbon            1.352725   8.509163  
other             2.180193  15.461652  


In [16]:
# Save all outputs
pathway_metrics_df.to_csv(DATA_DIR / 'pathway_fitness_metrics.csv', index=False)
print(f'Saved: {DATA_DIR}/pathway_fitness_metrics.csv ({len(pathway_metrics_df):,} rows)')

# Summary stats
print('\n=== Completion Summary ===')
for f in sorted(DATA_DIR.glob('*.csv')) + sorted(DATA_DIR.glob('*.tsv')):
    size_mb = f.stat().st_size / 1024**2
    print(f'  {f.name}: {size_mb:.2f} MB')

Saved: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/pathway_fitness_metrics.csv (3,065 rows)

=== Completion Summary ===
  gapmind_genome_pathways.csv: 1669.98 MB
  gapmind_pathway_summary.csv: 0.01 MB
  gapmind_species_summary.csv: 1.45 MB
  gene_fitness_aggregates.csv: 16.96 MB
  organism_metadata.csv: 0.00 MB
  pathway_fitness_metrics.csv: 3.47 MB
  seed_annotations.csv: 11.36 MB
  essential_genes.tsv: 2.19 MB
  organism_mapping.tsv: 0.01 MB


## Completion

**Outputs generated**:
1. `data/organism_metadata.csv` — All 48 FB organisms with metadata
2. `data/organism_mapping.tsv` — FB org → GapMind species clade mapping
3. `data/seed_annotations.csv` — SEED subsystem annotations per gene
4. `data/gene_fitness_aggregates.csv` — Mean/max |t-score| per gene
5. `data/essential_genes.tsv` — Putative essential genes
6. `data/pathway_fitness_metrics.csv` — Per-organism per-pathway fitness metrics

**Limitation**: Gene-pathway assignment uses SEED subsystems as a proxy. Pathways with
no SEED match or poorly annotated SEED subsystems will have fewer genes and noisier metrics.
Inspect `matched_subsystems` column to see which SEED subsystems were used for each pathway.

**Next step**: Run NB03 to classify pathways as active dependencies vs latent capabilities.