# NB01: Data Extraction and Metabolite Name Harmonization

Extract metabolic data for *Pseudomonas fluorescens* FW300-N2E3 from four BERDL databases and build a unified metabolite crosswalk table.

**Databases:**
- Web of Microbes (`kescience_webofmicrobes`) — exometabolomic profile
- Fitness Browser (`kescience_fitnessbrowser`) — carbon/nitrogen source experiments
- BacDive (`kescience_bacdive`) — species-level metabolite utilization
- Pangenome (`kbase_ke_pangenome`) — GapMind pathway predictions

**Outputs:**
- `data/wom_profile.tsv` — full WoM exometabolomic profile
- `data/fb_experiments.tsv` — FB carbon/nitrogen source experiments
- `data/bacdive_utilization.tsv` — BacDive P. fluorescens utilization
- `data/gapmind_pathways.tsv` — GapMind pathway predictions for FW300-N2E3
- `data/metabolite_crosswalk.tsv` — unified metabolite mapping table

In [1]:
import os
import pandas as pd
import numpy as np

# Spark session
spark = get_spark_session()

DATA_DIR = '../data'
os.makedirs(DATA_DIR, exist_ok=True)

# Strain identifiers
WOM_NAME = 'Pseudomonas sp. (FW300-N2E3)'
FB_ORG = 'pseudo3_N2E3'
BACDIVE_SPECIES = 'Pseudomonas fluorescens'
PANGENOME_GENOME = 'RS_GCF_001307155.1'
PANGENOME_CLADE = 's__Pseudomonas_E_fluorescens_E--RS_GCF_001307155.1'

## 1. Web of Microbes — Exometabolomic Profile

In [2]:
wom_df = spark.sql(f"""
SELECT c.compound_name, obs.action, c.formula, c.pubchem_id,
       c.inchi_key, c.smiles_string, c.metabolite_atlas_id,
       e.env_name, p.project_name
FROM kescience_webofmicrobes.observation obs
JOIN kescience_webofmicrobes.organism o ON o.id = obs.organism_id
JOIN kescience_webofmicrobes.compound c ON c.id = obs.compound_id
JOIN kescience_webofmicrobes.environment e ON e.id = obs.environment_id
LEFT JOIN kescience_webofmicrobes.project p ON p.id = obs.project_id
WHERE o.common_name = '{WOM_NAME}'
ORDER BY obs.action, c.compound_name
""").toPandas()

print(f"WoM observations: {len(wom_df)}")
print(f"\nAction counts:")
print(wom_df['action'].value_counts().to_string())
print(f"\nEnvironment: {wom_df['env_name'].unique()}")
print(f"\nProduced metabolites (E+I): {len(wom_df[wom_df['action'].isin(['E','I'])])}")

wom_df.to_csv(f'{DATA_DIR}/wom_profile.tsv', sep='\t', index=False)
wom_df[wom_df['action'].isin(['E','I'])].head(20)

WoM observations: 105

Action counts:
action
N    47
I    31
E    27

Environment: <ArrowStringArray>
['R2A']
Length: 1, dtype: str

Produced metabolites (E+I): 58


Unnamed: 0,compound_name,action,formula,pubchem_id,inchi_key,smiles_string,metabolite_atlas_id,env_name,project_name
0,2-hydroxy-4-(methylthio)butyric acid,E,C5H10O3S,,,,,R2A,ENIGMA_SK_BEMC2016
1,3-hydroxybenzoate,E,C7H6O3,,,,,R2A,ENIGMA_SK_BEMC2016
2,4-acetamidobutanoate,E,C6H11NO3,,,,,R2A,ENIGMA_SK_BEMC2016
3,4-hydroxy-2-quinolinecarboxylic acid,E,C10H7NO3,,,,,R2A,ENIGMA_SK_BEMC2016
4,4-imidazoleacetic acid,E,C5H6N2O2,,,,,R2A,ENIGMA_SK_BEMC2016
5,5-aminopentanoate,E,C5H11NO2,,,,,R2A,ENIGMA_SK_BEMC2016
6,5-hydroxylysine,E,C6H14N2O3,,,,,R2A,ENIGMA_SK_BEMC2016
7,Cytosine,E,,,,,,R2A,ENIGMA_SK_BEMC2016
8,N-acetyl-alanine,E,C5H9NO3,,,,,R2A,ENIGMA_SK_BEMC2016
9,N-acetyl-glutamic acid,E,C7H11NO5,,,,,R2A,ENIGMA_SK_BEMC2016


## 2. Fitness Browser — Carbon/Nitrogen Source Experiments

In [3]:
# Get all carbon/nitrogen source experiments for FW300-N2E3
fb_exps = spark.sql(f"""
SELECT expName, expGroup, condition_1
FROM kescience_fitnessbrowser.experiment
WHERE orgId = '{FB_ORG}'
AND expGroup IN ('carbon source', 'nitrogen source')
ORDER BY expGroup, condition_1
""").toPandas()

print(f"FB experiments: {len(fb_exps)}")
print(f"  Carbon source: {len(fb_exps[fb_exps['expGroup']=='carbon source'])}")
print(f"  Nitrogen source: {len(fb_exps[fb_exps['expGroup']=='nitrogen source'])}")
print(f"\nUnique conditions: {fb_exps['condition_1'].nunique()}")

# Deduplicate conditions (some have replicates)
fb_conditions = fb_exps.groupby(['condition_1', 'expGroup']).size().reset_index(name='n_replicates')
print(f"\nUnique condition × type pairs: {len(fb_conditions)}")

fb_exps.to_csv(f'{DATA_DIR}/fb_experiments.tsv', sep='\t', index=False)
fb_conditions.sort_values('condition_1')

FB experiments: 120
  Carbon source: 82
  Nitrogen source: 38

Unique conditions: 62

Unique condition × type pairs: 76


Unnamed: 0,condition_1,expGroup,n_replicates
0,Adenine hydrochloride hydrate,nitrogen source,1
1,Adenosine,nitrogen source,1
2,Ammonium chloride,nitrogen source,1
3,Carnitine Hydrochloride,carbon source,1
4,Carnitine Hydrochloride,nitrogen source,1
...,...,...,...
71,Uridine,carbon source,1
72,Uridine,nitrogen source,1
73,a-Ketoglutaric acid disodium salt hydrate,carbon source,2
74,casamino acids,carbon source,1


In [4]:
# Get gene annotations for FW300-N2E3
fb_genes = spark.sql(f"""
SELECT g.locusId, g.sysName, g.gene, g.desc, g.type
FROM kescience_fitnessbrowser.gene g
WHERE g.orgId = '{FB_ORG}' AND g.type = '1'
""").toPandas()

print(f"Protein-coding genes: {len(fb_genes)}")
fb_genes.head()

Protein-coding genes: 5766


Unnamed: 0,locusId,sysName,gene,desc,type
0,AO353_00005,AO353_00005,,isoprenylcysteine carboxyl methyltransferase,1
1,AO353_00010,AO353_00010,,hypothetical protein,1
2,AO353_00015,AO353_00015,,FAD-dependent oxidoreductase,1
3,AO353_00020,AO353_00020,,delta-aminolevulinic acid dehydratase,1
4,AO353_00025,AO353_00025,,phenazine biosynthesis protein PhzF,1


## 3. BacDive — P. fluorescens Metabolite Utilization

In [5]:
bacdive_df = spark.sql(f"""
SELECT mu.compound_name, mu.chebi_id, mu.utilization,
       s.bacdive_id, s.strain_designation, s.type_strain
FROM kescience_bacdive.metabolite_utilization mu
JOIN kescience_bacdive.strain s ON mu.bacdive_id = s.bacdive_id
JOIN kescience_bacdive.taxonomy t ON t.bacdive_id = s.bacdive_id
WHERE t.species = '{BACDIVE_SPECIES}'
ORDER BY mu.compound_name
""").toPandas()

print(f"BacDive records: {len(bacdive_df)}")
print(f"Unique compounds: {bacdive_df['compound_name'].nunique()}")
print(f"Unique strains: {bacdive_df['bacdive_id'].nunique()}")

# Show all distinct utilization values — BacDive uses +, -, produced, +/-
print(f"\nUtilization value counts:")
print(bacdive_df['utilization'].value_counts().to_string())

# Per-strain consensus: deduplicate multiple records per strain per compound
# Rule: for each strain-compound pair, take majority vote among +/- tests
# If tied, count as ambiguous (+/-)
def strain_consensus(group):
    """Compute one consensus utilization per strain per compound."""
    n_pos = (group['utilization'] == '+').sum()
    n_neg = (group['utilization'] == '-').sum()
    n_prod = (group['utilization'] == 'produced').sum()
    n_ambig = (group['utilization'] == '+/-').sum()
    
    # If any +/- tests exist, take majority
    if n_pos + n_neg > 0:
        if n_pos > n_neg:
            return '+'
        elif n_neg > n_pos:
            return '-'
        else:
            return '+/-'  # tied
    elif n_prod > 0:
        return 'produced'
    elif n_ambig > 0:
        return '+/-'
    return group['utilization'].iloc[0]  # fallback

strain_level = bacdive_df.groupby(['compound_name', 'bacdive_id']).apply(
    strain_consensus, include_groups=False
).reset_index(name='strain_consensus')

n_raw = len(bacdive_df)
n_deduped = len(strain_level)
print(f"\nPer-strain deduplication: {n_raw} raw records → {n_deduped} strain-compound pairs")

# Example: D-glucose had 104 records from 51 strains
glucose = bacdive_df[bacdive_df['compound_name'] == 'D-glucose']
glucose_dedup = strain_level[strain_level['compound_name'] == 'D-glucose']
print(f"  D-glucose: {len(glucose)} records → {len(glucose_dedup)} strain consensus values")

# Summarize per compound using strain-level consensus
bacdive_summary = strain_level.groupby('compound_name').agg(
    n_strains=('bacdive_id', 'nunique'),
    n_positive=('strain_consensus', lambda x: (x == '+').sum()),
    n_negative=('strain_consensus', lambda x: (x == '-').sum()),
    n_produced=('strain_consensus', lambda x: (x == 'produced').sum()),
    n_ambiguous=('strain_consensus', lambda x: (x == '+/-').sum()),
    n_total=('strain_consensus', 'count')
).reset_index()

# pct_positive is calculated only from explicit +/- tests (not 'produced' or '+/-')
n_tested = bacdive_summary['n_positive'] + bacdive_summary['n_negative']
bacdive_summary['n_utilization_tested'] = n_tested
bacdive_summary['pct_positive'] = (bacdive_summary['n_positive'] / n_tested).round(3)
# NaN where n_tested == 0 (e.g., compound only has 'produced' entries)

bacdive_summary.to_csv(f'{DATA_DIR}/bacdive_utilization.tsv', sep='\t', index=False)
print(f"\nTop compounds by test frequency (per-strain consensus):")
bacdive_summary.sort_values('n_total', ascending=False).head(20)

BacDive records: 1262
Unique compounds: 83
Unique strains: 105

Utilization value counts:
utilization
-           633
+           513
produced    115
+/-           1



Per-strain deduplication: 1262 raw records → 1095 strain-compound pairs
  D-glucose: 104 records → 51 strain consensus values

Top compounds by test frequency (per-strain consensus):


Unnamed: 0,compound_name,n_strains,n_positive,n_negative,n_produced,n_ambiguous,n_total,n_utilization_tested,pct_positive
49,indole,53,0,1,52,0,53,1,0.0
36,esculin,52,2,50,0,0,52,52,0.038
64,nitrate,52,32,20,0,0,52,52,0.615
22,N-acetylglucosamine,51,33,16,0,2,51,49,0.673
16,L-arabinose,51,35,14,0,2,51,49,0.714
8,D-glucose,51,3,6,0,42,51,9,0.333
55,maltose,51,2,47,0,2,51,49,0.041
10,D-mannitol,51,43,6,0,2,51,49,0.878
11,D-mannose,50,45,3,0,2,50,48,0.938
77,tryptophan,50,0,50,0,0,50,50,0.0


## 4. GapMind — Pathway Predictions for FW300-N2E3

In [6]:
# GapMind: take best score per genome-pathway pair (multiple rows per pair)
gapmind_df = spark.sql(f"""
WITH scored AS (
    SELECT pathway, genome_id, score_category,
           CASE score_category
               WHEN 'complete' THEN 5
               WHEN 'likely_complete' THEN 4
               WHEN 'steps_missing_low' THEN 3
               WHEN 'steps_missing_medium' THEN 2
               WHEN 'not_present' THEN 1
               ELSE 0
           END as score_value
    FROM kbase_ke_pangenome.gapmind_pathways
    WHERE clade_name = '{PANGENOME_CLADE}'
)
SELECT pathway, genome_id, MAX(score_value) as best_score
FROM scored
GROUP BY pathway, genome_id
ORDER BY pathway, genome_id
""").toPandas()

# Map numeric scores back to category names
score_map = {5: 'complete', 4: 'likely_complete', 3: 'steps_missing_low',
             2: 'steps_missing_medium', 1: 'not_present', 0: 'unknown'}
gapmind_df['best_category'] = gapmind_df['best_score'].map(score_map)

print(f"GapMind pathway-genome pairs: {len(gapmind_df)}")
print(f"Unique pathways: {gapmind_df['pathway'].nunique()}")
print(f"Genomes in clade: {gapmind_df['genome_id'].nunique()}")

# Debug: show sample genome_id values to understand format
sample_ids = sorted(gapmind_df['genome_id'].unique())[:5]
print(f"\nSample genome IDs: {sample_ids}")
print(f"Looking for: {PANGENOME_GENOME}")

# Try matching with and without RS_ prefix
gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == PANGENOME_GENOME].copy()
if len(gapmind_n2e3) == 0:
    # Try without RS_ prefix
    alt_id = PANGENOME_GENOME.replace('RS_', '')
    gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == alt_id].copy()
    if len(gapmind_n2e3) > 0:
        print(f"Found with alternate ID: {alt_id}")
if len(gapmind_n2e3) == 0:
    # Try partial match
    matches = gapmind_df[gapmind_df['genome_id'].str.contains('001307155', na=False)]
    if len(matches) > 0:
        matched_id = matches['genome_id'].iloc[0]
        gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == matched_id].copy()
        print(f"Found with partial match: {matched_id}")

print(f"\nPathways for FW300-N2E3: {len(gapmind_n2e3)}")
if len(gapmind_n2e3) == 0:
    print("WARNING: FW300-N2E3 genome not in GapMind results, using clade-level data")
    # Fall back to clade-level summary (median score across genomes)
    gapmind_n2e3 = gapmind_df.groupby('pathway').agg(
        best_score=('best_score', lambda x: int(x.median())),
        n_genomes=('genome_id', 'nunique')
    ).reset_index()
    gapmind_n2e3['best_category'] = gapmind_n2e3['best_score'].map(score_map)

gapmind_df.to_csv(f'{DATA_DIR}/gapmind_pathways.tsv', sep='\t', index=False)

print("\nPathway completeness for FW300-N2E3:")
print(gapmind_n2e3['best_category'].value_counts().to_string())
gapmind_n2e3.head(20)

GapMind pathway-genome pairs: 3200
Unique pathways: 80
Genomes in clade: 40

Sample genome IDs: ['GCA_002883675.1', 'GCA_002883775.1', 'GCF_001307155.1', 'GCF_002883595.1', 'GCF_002883635.1']
Looking for: RS_GCF_001307155.1
Found with alternate ID: GCF_001307155.1

Pathways for FW300-N2E3: 80

Pathway completeness for FW300-N2E3:
best_category
complete                73
steps_missing_low        5
steps_missing_medium     1
likely_complete          1


Unnamed: 0,pathway,genome_id,best_score,best_category
2,2-oxoglutarate,GCF_001307155.1,5,complete
42,4-hydroxybenzoate,GCF_001307155.1,5,complete
82,D-alanine,GCF_001307155.1,5,complete
122,D-lactate,GCF_001307155.1,5,complete
162,D-serine,GCF_001307155.1,5,complete
202,L-lactate,GCF_001307155.1,5,complete
242,L-malate,GCF_001307155.1,5,complete
282,NAG,GCF_001307155.1,5,complete
322,acetate,GCF_001307155.1,5,complete
362,alanine,GCF_001307155.1,5,complete


## 5. Metabolite Name Harmonization

The hardest part: mapping metabolite names across four databases with different nomenclature.

**Strategy:**
1. Exact case-insensitive name match
2. Manual curation for known aliases (e.g., "Sodium D,L-Lactate" → "lactate")
3. Formula match as fallback for unmatched compounds

In [7]:
import re

def normalize_compound_name(name):
    """Normalize compound names for cross-database matching."""
    if pd.isna(name):
        return ''
    s = name.lower().strip()
    # Remove salt forms, stereochemistry prefixes, and common suffixes
    s = re.sub(r'\b(sodium|potassium|calcium|di?sodium|dibasic|monohydrate|hexahydrate|hydrochloride|dihydrate|monobasic|salt)\b', '', s)
    s = re.sub(r'^(l-|d-|dl-|d,l-|d/l-)', '', s)
    s = re.sub(r'\b(acid|monopotassium)\b', '', s)
    # Clean up whitespace and punctuation
    s = re.sub(r'[,;()]+', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s


# Manual crosswalk for known correspondences between WoM and FB condition names
MANUAL_WOM_TO_FB = {
    'lactate': ['Sodium D,L-Lactate', 'Sodium D-Lactate', 'Sodium L-Lactate'],
    'carnitine': ['Carnitine Hydrochloride'],
    'valine': ['L-Valine'],
    'alanine': ['L-Alanine', 'D-Alanine'],
    'arginine': ['L-Arginine'],
    'proline': ['L-Proline'],
    'phenylalanine': ['L-Phenylalanine'],
    'tryptophan': ['L-Tryptophan'],
    'trehalose': ['D-Trehalose dihydrate'],
    'Malate': ['L-Malic acid disodium salt monohydrate'],
    'glutamic acid': ['L-Glutamic acid monopotassium salt monohydrate'],
    'Adenine': ['Adenine hydrochloride hydrate'],
    'Adenosine': ['Adenosine'],
    'inosine': ['Inosine'],
    'aspartate': ['L-Aspartic Acid'],
    'glycine': ['Glycine'],
    'Guanine': ['Guanine'],
    'Cytosine': ['Cytidine'],  # Note: cytosine vs cytidine — close but not identical
    'Uracil': ['Uridine'],  # Note: uracil vs uridine — base vs nucleoside
    'thymine': ['Thymine'],
    'nicotinamide': ['Nicotinamide'],
    '4-aminobutanoate': ['4-aminobutanoate'],  # GABA
    'sarcosine': ['Sarcosine'],
    'trans-aconitate': ['trans-Aconitate'],
    '5-oxo-proline': ['5-oxo-proline'],
    'betaine': ['Betaine'],
}

# Manual crosswalk for WoM to BacDive
MANUAL_WOM_TO_BACDIVE = {
    'lysine': 'lysine',
    'valine': 'valine',
    'Malate': 'malate',
    'arginine': 'arginine',
    'trehalose': 'trehalose',
    'tryptophan': 'tryptophan',
    'glycine': 'glycine',
    'alanine': 'D-alanine',  # BacDive may test D-alanine separately
}

# Manual crosswalk for WoM/FB to GapMind pathway names
MANUAL_TO_GAPMIND = {
    'lactate': ['D-lactate', 'L-lactate'],
    'alanine': ['alanine', 'D-alanine'],
    'arginine': ['arginine', 'arg'],
    'proline': ['proline'],
    'phenylalanine': ['phenylalanine', 'phe'],
    'tryptophan': ['tryptophan', 'trp'],
    'trehalose': ['trehalose'],
    'malate': ['L-malate'],
    'glutamic acid': ['glutamate'],
    'valine': ['valine', 'val'],
    'carnitine': ['carnitine'],
    'aspartate': ['aspartate'],
    'glycine': ['glycine', 'gly'],
    'serine': ['serine', 'ser', 'D-serine'],
    'histidine': ['histidine', 'his'],
    'isoleucine': ['isoleucine', 'ile'],
    'leucine': ['leucine', 'leu'],
    'lysine': ['lysine', 'lys'],
    'asparagine': ['asparagine', 'asn'],
    'glutamine': ['glutamine', 'gln'],
    'citrulline': ['citrulline'],
    'ornithine': ['ornithine'],
    'acetate': ['acetate'],
    'citrate': ['citrate'],
    'fumarate': ['fumarate'],
    'succinate': ['succinate'],
    'pyruvate': ['pyruvate'],
    'glucose': ['glucose'],
    'fructose': ['fructose'],
    'ribose': ['ribose'],
    'mannose': ['mannose'],
    'gluconate': ['gluconate'],
    'glycerol': ['glycerol'],
    'ethanol': ['ethanol'],
    'inositol': ['myo-inositol'],
}

print(f"Manual WoM→FB mappings: {len(MANUAL_WOM_TO_FB)}")
print(f"Manual WoM→BacDive mappings: {len(MANUAL_WOM_TO_BACDIVE)}")
print(f"Manual metabolite→GapMind mappings: {len(MANUAL_TO_GAPMIND)}")

Manual WoM→FB mappings: 26
Manual WoM→BacDive mappings: 8
Manual metabolite→GapMind mappings: 35


In [8]:
# Build the unified crosswalk table
# Start with WoM metabolites that are produced (E or I)
wom_produced = wom_df[wom_df['action'].isin(['E', 'I'])].copy()

# Approximate FB matches: base→nucleoside mappings that are related but not identical
APPROXIMATE_FB_MATCHES = {'Cytosine', 'Uracil'}  # Cytosine→Cytidine, Uracil→Uridine

crosswalk_rows = []
for _, row in wom_produced.iterrows():
    wom_name = row['compound_name']
    wom_norm = normalize_compound_name(wom_name)
    
    # FB match
    fb_matches = MANUAL_WOM_TO_FB.get(wom_name, [])
    if not fb_matches:
        # Try normalized matching against FB conditions
        for cond in fb_conditions['condition_1'].unique():
            if normalize_compound_name(cond) == wom_norm:
                fb_matches.append(cond)
    
    # Flag match quality: approximate for base→nucleoside mappings
    fb_match_quality = None
    if fb_matches:
        fb_match_quality = 'approximate' if wom_name in APPROXIMATE_FB_MATCHES else 'exact'
    
    # BacDive match
    bd_match = MANUAL_WOM_TO_BACDIVE.get(wom_name)
    if bd_match is None:
        # Try exact case-insensitive
        bd_candidates = bacdive_summary[
            bacdive_summary['compound_name'].str.lower() == wom_name.lower()
        ]
        if len(bd_candidates) > 0:
            bd_match = bd_candidates.iloc[0]['compound_name']
    
    bd_util = None
    bd_pct = None
    bd_n_tested = None
    bd_n_positive = None
    bd_n_negative = None
    bd_n_produced = None
    if bd_match:
        bd_row = bacdive_summary[bacdive_summary['compound_name'] == bd_match]
        if len(bd_row) > 0:
            r = bd_row.iloc[0]
            bd_n_tested = int(r['n_utilization_tested'])
            bd_n_positive = int(r['n_positive'])
            bd_n_negative = int(r['n_negative'])
            bd_n_produced = int(r['n_produced'])
            bd_pct = r['pct_positive']
            if bd_n_tested > 0:
                bd_util = '+' if bd_pct > 0.5 else '-'
            elif bd_n_produced > 0:
                bd_util = 'produced'  # only 'produced' entries, no +/- tests
            # else: no usable data
    
    # GapMind match
    gm_matches = MANUAL_TO_GAPMIND.get(wom_name.lower(), [])
    gm_best = None
    if gm_matches and len(gapmind_n2e3) > 0:
        for gm_path in gm_matches:
            gm_row = gapmind_n2e3[gapmind_n2e3['pathway'] == gm_path]
            if len(gm_row) > 0:
                score = gm_row.iloc[0].get('best_score', None)
                if score is not None:
                    gm_best = score_map.get(int(score), str(score))
                break
    
    crosswalk_rows.append({
        'wom_compound': wom_name,
        'wom_action': row['action'],
        'wom_formula': row['formula'],
        'fb_condition': '; '.join(fb_matches) if fb_matches else None,
        'fb_matched': len(fb_matches) > 0,
        'fb_match_quality': fb_match_quality,
        'bacdive_compound': bd_match,
        'bacdive_consensus': bd_util,
        'bacdive_pct_positive': bd_pct,
        'bacdive_n_tested': bd_n_tested,
        'bacdive_n_positive': bd_n_positive,
        'bacdive_n_negative': bd_n_negative,
        'bacdive_n_produced': bd_n_produced,
        'gapmind_pathway': '; '.join(gm_matches) if gm_matches else None,
        'gapmind_prediction': gm_best,
    })

crosswalk = pd.DataFrame(crosswalk_rows)

print(f"Crosswalk table: {len(crosswalk)} metabolites")
print(f"  Matched to FB: {crosswalk['fb_matched'].sum()}")
print(f"    Exact matches: {(crosswalk['fb_match_quality'] == 'exact').sum()}")
print(f"    Approximate matches: {(crosswalk['fb_match_quality'] == 'approximate').sum()}")
print(f"      (Cytosine→Cytidine, Uracil→Uridine: base→nucleoside, related but not identical)")
print(f"  Matched to BacDive: {crosswalk['bacdive_compound'].notna().sum()}")
print(f"  Matched to GapMind: {crosswalk['gapmind_prediction'].notna().sum()}")

# Show BacDive matches with full detail
bd_matched = crosswalk[crosswalk['bacdive_compound'].notna()][
    ['wom_compound', 'wom_action', 'bacdive_compound', 'bacdive_consensus',
     'bacdive_n_tested', 'bacdive_n_positive', 'bacdive_n_negative', 'bacdive_n_produced']
]
print(f"\nBacDive matches (detailed):")
print(bd_matched.to_string(index=False))

crosswalk.to_csv(f'{DATA_DIR}/metabolite_crosswalk.tsv', sep='\t', index=False)
crosswalk

Crosswalk table: 58 metabolites
  Matched to FB: 28
    Exact matches: 26
    Approximate matches: 2
      (Cytosine→Cytidine, Uracil→Uridine: base→nucleoside, related but not identical)
  Matched to BacDive: 8
  Matched to GapMind: 13

BacDive matches (detailed):
wom_compound wom_action bacdive_compound bacdive_consensus  bacdive_n_tested  bacdive_n_positive  bacdive_n_negative  bacdive_n_produced
      lysine          E           lysine                 -               3.0                 0.0                 3.0                 0.0
      valine          E           valine                 +               1.0                 1.0                 0.0                 0.0
      Malate          I           malate                 +              49.0                49.0                 0.0                 0.0
     alanine          I        D-alanine               NaN               NaN                 NaN                 NaN                 NaN
    arginine          I         arginine          

Unnamed: 0,wom_compound,wom_action,wom_formula,fb_condition,fb_matched,fb_match_quality,bacdive_compound,bacdive_consensus,bacdive_pct_positive,bacdive_n_tested,bacdive_n_positive,bacdive_n_negative,bacdive_n_produced,gapmind_pathway,gapmind_prediction
0,2-hydroxy-4-(methylthio)butyric acid,E,C5H10O3S,,False,,,,,,,,,,
1,3-hydroxybenzoate,E,C7H6O3,,False,,,,,,,,,,
2,4-acetamidobutanoate,E,C6H11NO3,,False,,,,,,,,,,
3,4-hydroxy-2-quinolinecarboxylic acid,E,C10H7NO3,,False,,,,,,,,,,
4,4-imidazoleacetic acid,E,C5H6N2O2,,False,,,,,,,,,,
5,5-aminopentanoate,E,C5H11NO2,,False,,,,,,,,,,
6,5-hydroxylysine,E,C6H14N2O3,,False,,,,,,,,,,
7,Cytosine,E,,Cytidine,True,approximate,,,,,,,,,
8,N-acetyl-alanine,E,C5H9NO3,,False,,,,,,,,,,
9,N-acetyl-glutamic acid,E,C7H11NO5,,False,,,,,,,,,,


In [9]:
# Also build crosswalk for FB conditions that are NOT in WoM
# (compounds tested in FB but not detected as produced by WoM)
fb_only = []
wom_names_lower = set(wom_produced['compound_name'].str.lower())
matched_fb_conditions = set()
for matches in MANUAL_WOM_TO_FB.values():
    matched_fb_conditions.update(matches)

for _, row in fb_conditions.iterrows():
    cond = row['condition_1']
    if cond not in matched_fb_conditions:
        norm = normalize_compound_name(cond)
        if norm not in {normalize_compound_name(n) for n in wom_names_lower}:
            # Check GapMind
            gm_matches = MANUAL_TO_GAPMIND.get(norm, [])
            gm_best = None
            if gm_matches and len(gapmind_n2e3) > 0:
                for gm_path in gm_matches:
                    gm_row = gapmind_n2e3[gapmind_n2e3['pathway'] == gm_path]
                    if len(gm_row) > 0:
                        score = gm_row.iloc[0].get('best_score', None)
                        if score is not None:
                            gm_best = score_map.get(int(score), str(score))
                        break
            
            fb_only.append({
                'fb_condition': cond,
                'fb_expGroup': row['expGroup'],
                'wom_action': 'not_detected',
                'gapmind_pathway': '; '.join(gm_matches) if gm_matches else None,
                'gapmind_prediction': gm_best,
            })

fb_only_df = pd.DataFrame(fb_only)
print(f"\nFB conditions NOT in WoM produced set: {len(fb_only_df)}")
fb_only_df


FB conditions NOT in WoM produced set: 45


Unnamed: 0,fb_condition,fb_expGroup,wom_action,gapmind_pathway,gapmind_prediction
0,Ammonium chloride,nitrogen source,not_detected,,
1,D-Fructose,carbon source,not_detected,fructose,complete
2,D-Gluconic Acid sodium salt,carbon source,not_detected,,
3,D-Glucosamine Hydrochloride,carbon source,not_detected,,
4,D-Glucosamine Hydrochloride,nitrogen source,not_detected,,
5,D-Glucose,carbon source,not_detected,glucose,complete
6,D-Mannose,carbon source,not_detected,mannose,likely_complete
7,D-Ribose,carbon source,not_detected,ribose,complete
8,D-Serine,carbon source,not_detected,serine; ser; D-serine,complete
9,D-Serine,nitrogen source,not_detected,serine; ser; D-serine,complete


## 6. Summary Statistics

In [10]:
print("=" * 60)
print("DATA EXTRACTION SUMMARY")
print("=" * 60)
print(f"\nStrain: Pseudomonas fluorescens FW300-N2E3")
print(f"GTDB: Pseudomonas_E fluorescens_E (RS_GCF_001307155.1)")
print(f"\n--- Web of Microbes ---")
print(f"  Total observations: {len(wom_df)}")
print(f"  Emerged (de novo): {(wom_df['action']=='E').sum()}")
print(f"  Increased: {(wom_df['action']=='I').sum()}")
print(f"  No change: {(wom_df['action']=='N').sum()}")
print(f"  Medium: R2A")
print(f"\n--- Fitness Browser ---")
print(f"  Carbon source experiments: {len(fb_exps[fb_exps['expGroup']=='carbon source'])}")
print(f"  Nitrogen source experiments: {len(fb_exps[fb_exps['expGroup']=='nitrogen source'])}")
print(f"  Unique C/N conditions: {fb_conditions['condition_1'].nunique()}")
print(f"  Protein-coding genes: {len(fb_genes)}")
print(f"\n--- BacDive ---")
print(f"  P. fluorescens strains tested: {bacdive_df['bacdive_id'].nunique()}")
print(f"  Compounds tested: {bacdive_summary['compound_name'].nunique()}")
print(f"\n--- GapMind ---")
print(f"  Pathways predicted: {gapmind_n2e3['pathway'].nunique() if len(gapmind_n2e3) > 0 else 'N/A'}")
print(f"  Genomes in clade: {gapmind_df['genome_id'].nunique()}")
print(f"\n--- Cross-Database Overlap ---")
print(f"  WoM metabolites matched to FB: {crosswalk['fb_matched'].sum()} / {len(crosswalk)}")
print(f"  WoM metabolites matched to BacDive: {crosswalk['bacdive_compound'].notna().sum()} / {len(crosswalk)}")
print(f"  WoM metabolites matched to GapMind: {crosswalk['gapmind_prediction'].notna().sum()} / {len(crosswalk)}")

DATA EXTRACTION SUMMARY

Strain: Pseudomonas fluorescens FW300-N2E3
GTDB: Pseudomonas_E fluorescens_E (RS_GCF_001307155.1)

--- Web of Microbes ---
  Total observations: 105
  Emerged (de novo): 27
  Increased: 31
  No change: 47
  Medium: R2A

--- Fitness Browser ---
  Carbon source experiments: 82
  Nitrogen source experiments: 38
  Unique C/N conditions: 62
  Protein-coding genes: 5766

--- BacDive ---
  P. fluorescens strains tested: 105
  Compounds tested: 83

--- GapMind ---
  Pathways predicted: 80
  Genomes in clade: 40

--- Cross-Database Overlap ---
  WoM metabolites matched to FB: 28 / 58
  WoM metabolites matched to BacDive: 8 / 58
  WoM metabolites matched to GapMind: 13 / 58


In [11]:
spark.stop()
print("Spark session closed. Data files saved to ../data/")

Spark session closed. Data files saved to ../data/
