# NB02: WoM ↔ Fitness Browser Gene-Level Integration

For each WoM-produced metabolite that has a corresponding FB carbon/nitrogen source experiment,
identify genes with significant fitness effects and annotate them functionally.

**Key Question**: When FW300-N2E3 produces compound X on rich medium (WoM), do genes important
for growing on X as a sole carbon/nitrogen source (FB) reveal the underlying biosynthetic or
catabolic pathways?

**Inputs:**
- `data/metabolite_crosswalk.tsv` — WoM↔FB metabolite mapping from NB01
- `data/fb_experiments.tsv` — FB experiment metadata
- BERDL: `kescience_fitnessbrowser.genefitness` — per-gene fitness scores
- BERDL: `kescience_fitnessbrowser.seedannotation` — SEED functional annotations

**Outputs:**
- `data/wom_fb_gene_table.tsv` — genes with significant fitness for overlapping metabolites
- `data/wom_fb_summary.tsv` — per-metabolite summary of fitness hits
- `figures/fitness_hits_per_metabolite.png` — bar chart of gene counts per metabolite

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

spark = get_spark_session()

DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(FIG_DIR, exist_ok=True)

FB_ORG = 'pseudo3_N2E3'

# Load crosswalk from NB01
crosswalk = pd.read_csv(f'{DATA_DIR}/metabolite_crosswalk.tsv', sep='\t')
fb_exps = pd.read_csv(f'{DATA_DIR}/fb_experiments.tsv', sep='\t')

# Filter to WoM metabolites matched to FB
matched = crosswalk[crosswalk['fb_matched'] == True].copy()
print(f"WoM metabolites with FB matches: {len(matched)}")
print(matched[['wom_compound', 'wom_action', 'fb_condition']].to_string(index=False))

WoM metabolites with FB matches: 28
    wom_compound wom_action                                           fb_condition
        Cytosine          E                                               Cytidine
         betaine          E                                                Betaine
       carnitine          E                                Carnitine Hydrochloride
         lactate          E Sodium D,L-Lactate; Sodium D-Lactate; Sodium L-Lactate
          lysine          E                                               L-Lysine
       sarcosine          E                                              Sarcosine
         thymine          E                                                Thymine
 trans-aconitate          E                                        trans-Aconitate
        tyrosine          E                               L-tyrosine disodium salt
          valine          E                                               L-Valine
4-aminobutanoate          I                        

## 1. Extract Gene Fitness for Overlapping Metabolites

Pull per-gene fitness scores for all experiments matching the overlapping conditions.
A gene is considered a significant fitness hit if |fitness| > 1 and |t| > 4.

In [2]:
# Build the list of FB condition names to query
all_fb_conditions = set()
for conds in matched['fb_condition']:
    for c in str(conds).split('; '):
        all_fb_conditions.add(c.strip())

print(f"FB conditions to query: {len(all_fb_conditions)}")
for c in sorted(all_fb_conditions):
    print(f"  {c}")

FB conditions to query: 31
  4-aminobutanoate
  5-oxo-proline
  Adenine hydrochloride hydrate
  Adenosine
  Betaine
  Carnitine Hydrochloride
  Cytidine
  D-Alanine
  D-Trehalose dihydrate
  Glycine
  Guanine
  Inosine
  L-Alanine
  L-Arginine
  L-Aspartic Acid
  L-Glutamic acid monopotassium salt monohydrate
  L-Lysine
  L-Malic acid disodium salt monohydrate
  L-Phenylalanine
  L-Proline
  L-Tryptophan
  L-Valine
  L-tyrosine disodium salt
  Nicotinamide
  Sarcosine
  Sodium D,L-Lactate
  Sodium D-Lactate
  Sodium L-Lactate
  Thymine
  Uridine
  trans-Aconitate


In [3]:
# Query gene fitness for all relevant experiments
# Use a single query with IN clause for efficiency
# CAST fit and t to double — they are stored as strings in BERDL
condition_list = "', '".join(all_fb_conditions)

fitness_df = spark.sql(f"""
SELECT gf.locusId, gf.expName,
       CAST(gf.fit AS DOUBLE) as fit,
       CAST(gf.t AS DOUBLE) as t,
       e.condition_1, e.expGroup,
       g.sysName, g.gene, g.desc
FROM kescience_fitnessbrowser.genefitness gf
JOIN kescience_fitnessbrowser.experiment e
    ON gf.orgId = e.orgId AND gf.expName = e.expName
JOIN kescience_fitnessbrowser.gene g
    ON gf.orgId = g.orgId AND gf.locusId = g.locusId
WHERE gf.orgId = '{FB_ORG}'
AND e.condition_1 IN ('{condition_list}')
AND e.expGroup IN ('carbon source', 'nitrogen source')
AND g.type = '1'
""").toPandas()

print(f"Total gene-experiment records: {len(fitness_df)}")
print(f"Unique experiments: {fitness_df['expName'].nunique()}")
print(f"Unique genes: {fitness_df['locusId'].nunique()}")
print(f"Unique conditions: {fitness_df['condition_1'].nunique()}")
print(f"\nColumn dtypes: fit={fitness_df['fit'].dtype}, t={fitness_df['t'].dtype}")

Total gene-experiment records: 221056
Unique experiments: 44
Unique genes: 5024
Unique conditions: 24

Column dtypes: fit=float64, t=float64


In [4]:
# Diagnose which queried conditions returned no data
conditions_with_data = set(fitness_df['condition_1'].unique())
conditions_queried = all_fb_conditions
missing_conditions = sorted(conditions_queried - conditions_with_data)

print(f"Conditions queried: {len(conditions_queried)}")
print(f"Conditions with data: {len(conditions_with_data)}")
print(f"Conditions with NO data: {len(missing_conditions)}")

if missing_conditions:
    print(f"\n--- Missing conditions diagnostic ---")
    # Check if these conditions exist in the experiment table at all
    for cond in missing_conditions:
        exp_match = fb_exps[fb_exps['condition_1'] == cond]
        if len(exp_match) == 0:
            print(f"  {cond}: NO experiments exist for {FB_ORG} with this condition name")
        else:
            n_exps = len(exp_match)
            groups = exp_match['expGroup'].unique()
            print(f"  {cond}: {n_exps} experiment(s) exist ({', '.join(groups)}) "
                  f"but 0 genes met |fit|>1 & |t|>4 threshold")
    
    print(f"\n  Explanation: These {len(missing_conditions)} conditions were mapped in the")
    print(f"  WoM→FB crosswalk but have no experiments in the Fitness Browser for {FB_ORG}.")
    print(f"  This means these compounds were not tested as C/N sources for this organism,")
    print(f"  or the condition names in the crosswalk don't exactly match the FB experiment names.")

Conditions queried: 31
Conditions with data: 24
Conditions with NO data: 7

--- Missing conditions diagnostic ---
  4-aminobutanoate: NO experiments exist for pseudo3_N2E3 with this condition name
  5-oxo-proline: NO experiments exist for pseudo3_N2E3 with this condition name
  Betaine: NO experiments exist for pseudo3_N2E3 with this condition name
  Guanine: NO experiments exist for pseudo3_N2E3 with this condition name
  Nicotinamide: NO experiments exist for pseudo3_N2E3 with this condition name
  Sarcosine: NO experiments exist for pseudo3_N2E3 with this condition name
  trans-Aconitate: NO experiments exist for pseudo3_N2E3 with this condition name

  Explanation: These 7 conditions were mapped in the
  WoM→FB crosswalk but have no experiments in the Fitness Browser for pseudo3_N2E3.
  This means these compounds were not tested as C/N sources for this organism,
  or the condition names in the crosswalk don't exactly match the FB experiment names.


In [5]:
# Filter for significant fitness effects
sig = fitness_df[(fitness_df['fit'].abs() > 1) & (fitness_df['t'].abs() > 4)].copy()
sig['direction'] = np.where(sig['fit'] > 0, 'beneficial', 'detrimental')

print(f"Significant fitness hits (|fit|>1, |t|>4): {len(sig)}")
print(f"  Detrimental (gene important for growth): {(sig['direction']=='detrimental').sum()}")
print(f"  Beneficial (gene inhibits growth): {(sig['direction']=='beneficial').sum()}")
print(f"\nUnique genes with significant fitness: {sig['locusId'].nunique()}")
print(f"Conditions with hits: {sig['condition_1'].nunique()}")

Significant fitness hits (|fit|>1, |t|>4): 4764
  Detrimental (gene important for growth): 4438
  Beneficial (gene inhibits growth): 326

Unique genes with significant fitness: 601
Conditions with hits: 24


In [6]:
# Map FB conditions back to WoM compound names
fb_to_wom = {}
for _, row in matched.iterrows():
    for c in str(row['fb_condition']).split('; '):
        fb_to_wom[c.strip()] = row['wom_compound']

sig['wom_compound'] = sig['condition_1'].map(fb_to_wom)

# Summary per metabolite
met_summary = sig.groupby(['wom_compound', 'condition_1', 'expGroup']).agg(
    n_genes=('locusId', 'nunique'),
    n_detrimental=('direction', lambda x: (x == 'detrimental').sum()),
    n_beneficial=('direction', lambda x: (x == 'beneficial').sum()),
    mean_fit_detrimental=('fit', lambda x: x[x < 0].mean() if (x < 0).any() else np.nan),
    min_fit=('fit', 'min'),
    max_fit=('fit', 'max'),
).reset_index().sort_values('n_genes', ascending=False)

print("Significant fitness genes per metabolite:")
print(met_summary.to_string(index=False))

Significant fitness genes per metabolite:
 wom_compound                                    condition_1        expGroup  n_genes  n_detrimental  n_beneficial  mean_fit_detrimental   min_fit   max_fit
    carnitine                        Carnitine Hydrochloride   carbon source      168            156            12             -2.669588 -4.880531  4.075280
       valine                                       L-Valine   carbon source      160            136            24             -2.455440 -4.091335  2.540146
      lactate                               Sodium D-Lactate   carbon source      153            241            21             -2.837218 -5.090104  3.310396
    trehalose                          D-Trehalose dihydrate   carbon source      152            231            31             -2.439370 -4.516183  2.677892
     arginine                                     L-Arginine nitrogen source      140            237             6             -2.369874 -4.506504  1.279048
    Adenosine   

## 2. SEED Functional Annotations

Annotate fitness-significant genes with SEED subsystem categories to identify pathway associations.

In [7]:
# Get SEED annotations for significant genes
# Note: seedannotation table has columns: orgId, locusId, seed_desc (no seed_subsystem)
sig_loci = sig['locusId'].unique()

# Use a join via temp view instead of huge IN clause
seed_df = spark.sql(f"""
SELECT locusId, seed_desc
FROM kescience_fitnessbrowser.seedannotation
WHERE orgId = '{FB_ORG}'
""").toPandas()

# Filter to just our significant genes
seed_df = seed_df[seed_df['locusId'].isin(sig_loci)].copy()

print(f"SEED annotations for significant genes: {len(seed_df)}")
print(f"Unique genes with SEED annotation: {seed_df['locusId'].nunique()} / {len(sig_loci)}")

# Top descriptions
if len(seed_df) > 0:
    print(f"\nTop SEED descriptions:")
    print(seed_df['seed_desc'].value_counts().head(20).to_string())

SEED annotations for significant genes: 565
Unique genes with SEED annotation: 565 / 601

Top SEED descriptions:
seed_desc
Cytochrome c oxidase subunit CcoN (EC 1.9.3.1)                                                                                                3
3-ketoacyl-CoA thiolase (EC 2.3.1.16) @ Acetyl-CoA acetyltransferase (EC 2.3.1.9)                                                             2
Cytochrome c oxidase subunit CcoP (EC 1.9.3.1)                                                                                                2
Transcriptional regulator, GntR family domain / Aspartate aminotransferase (EC 2.6.1.1)                                                       2
Phosphoserine aminotransferase (EC 2.6.1.52)                                                                                                  2
Sensory box histidine kinase/response regulator                                                                                               2
Phosphoglucon

In [8]:
# Merge SEED annotations into fitness hits
# A gene can have multiple SEED annotations; keep all
sig_annotated = sig.merge(seed_df, on='locusId', how='left')

# Save full gene table
sig_annotated.to_csv(f'{DATA_DIR}/wom_fb_gene_table.tsv', sep='\t', index=False)
met_summary.to_csv(f'{DATA_DIR}/wom_fb_summary.tsv', sep='\t', index=False)

print(f"Saved {len(sig_annotated)} annotated fitness hits")
print(f"\nSample rows:")
sig_annotated[['wom_compound', 'locusId', 'gene', 'desc', 'fit', 't',
               'direction', 'seed_desc']].sort_values('fit').head(20)

Saved 4764 annotated fitness hits

Sample rows:


Unnamed: 0,wom_compound,locusId,gene,desc,fit,t,direction,seed_desc
4499,lactate,AO353_20695,,O-succinylhomoserine sulfhydrylase,-5.268548,-5.128211,detrimental,O-acetylhomoserine sulfhydrylase (EC 2.5.1.49)...
4523,tryptophan,AO353_20695,,O-succinylhomoserine sulfhydrylase,-5.136272,-7.020057,detrimental,O-acetylhomoserine sulfhydrylase (EC 2.5.1.49)...
1166,lactate,AO353_05705,,oxidoreductase,-5.090104,-8.49517,detrimental,"Predicted L-lactate dehydrogenase, Fe-S oxidor..."
1568,Uracil,AO353_07220,,anthranilate synthase,-5.05369,-6.015652,detrimental,"Anthranilate synthase, amidotransferase compon..."
232,lactate,AO353_02070,,prephenate dehydratase,-5.044139,-4.913809,detrimental,Chorismate mutase I (EC 5.4.99.5) / Prephenate...
4690,Uracil,AO353_26580,,dihydropyrimidine dehydrogenase,-5.043094,-10.207538,detrimental,Dihydropyrimidine dehydrogenase [NADP+] (EC 1....
3373,tyrosine,AO353_13070,,phosphoserine phosphatase,-5.036103,-7.69882,detrimental,Phosphoserine phosphatase (EC 3.1.3.3)
3349,lactate,AO353_13070,,phosphoserine phosphatase,-4.979465,-4.853628,detrimental,Phosphoserine phosphatase (EC 3.1.3.3)
178,alanine,AO353_01375,,phosphate acyltransferase,-4.959222,-8.904415,detrimental,Phosphate:acyl-ACP acyltransferase PlsX
4537,glutamic acid,AO353_20695,,O-succinylhomoserine sulfhydrylase,-4.949864,-6.759177,detrimental,O-acetylhomoserine sulfhydrylase (EC 2.5.1.49)...


## 3. Per-Metabolite Fitness Landscape

For each WoM-produced metabolite, show the genes with strongest fitness effects.

In [9]:
# Show top fitness hits per metabolite (top 5 most detrimental genes each)
print("Top 5 most important genes per metabolite (most negative fitness):")
print("=" * 100)

for compound in sorted(sig['wom_compound'].dropna().unique()):
    subset = sig_annotated[sig_annotated['wom_compound'] == compound].copy()
    top_det = subset.nsmallest(5, 'fit')
    if len(top_det) == 0:
        continue
    wom_action = matched[matched['wom_compound'] == compound]['wom_action'].iloc[0]
    print(f"\n{compound} (WoM: {wom_action}) — {len(subset)} significant genes")
    print("-" * 80)
    for _, g in top_det.iterrows():
        gene_name = g['gene'] if pd.notna(g['gene']) else g['sysName']
        seed = g['seed_desc'] if pd.notna(g.get('seed_desc')) else 'no SEED'
        print(f"  {gene_name:15s}  fit={g['fit']:+.2f}  t={g['t']:+.1f}  {g['desc'][:50]:50s}  [{seed[:40]}]")

Top 5 most important genes per metabolite (most negative fitness):

Adenine (WoM: I) — 109 significant genes
--------------------------------------------------------------------------------
  AO353_01375      fit=-4.80  t=-9.2  phosphate acyltransferase                           [Phosphate:acyl-ACP acyltransferase PlsX]
  AO353_20695      fit=-4.74  t=-8.5  O-succinylhomoserine sulfhydrylase                  [O-acetylhomoserine sulfhydrylase (EC 2.5]
  AO353_20665      fit=-4.66  t=-10.8  N-(5'-phosphoribosyl)anthranilate isomerase         [Phosphoribosylanthranilate isomerase (EC]
  AO353_20635      fit=-4.58  t=-12.5  3-isopropylmalate dehydrogenase                     [3-isopropylmalate dehydrogenase (EC 1.1.]
  AO353_07220      fit=-4.46  t=-8.0  anthranilate synthase                               [Anthranilate synthase, amidotransferase ]

Adenosine (WoM: I) — 134 significant genes
--------------------------------------------------------------------------------
  AO353_20695      

## 4. Visualization

In [10]:
# Bar chart: number of significant fitness genes per metabolite
plot_data = met_summary.groupby('wom_compound').agg(
    n_genes=('n_genes', 'sum'),
    n_detrimental=('n_detrimental', 'sum'),
    n_beneficial=('n_beneficial', 'sum'),
).reset_index().sort_values('n_genes', ascending=True)

fig, ax = plt.subplots(figsize=(10, max(6, len(plot_data) * 0.35)))

y_pos = range(len(plot_data))
ax.barh(y_pos, plot_data['n_detrimental'], color='#d62728', label='Detrimental (gene needed)', alpha=0.8)
ax.barh(y_pos, plot_data['n_beneficial'], left=plot_data['n_detrimental'],
        color='#2ca02c', label='Beneficial (gene inhibits)', alpha=0.8)

ax.set_yticks(y_pos)
ax.set_yticklabels(plot_data['wom_compound'])
ax.set_xlabel('Number of significant genes (|fit|>1, |t|>4)')
ax.set_title('Fitness Browser Gene Hits for WoM-Produced Metabolites\n(FW300-N2E3)')
ax.legend(loc='lower right')
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig(f'{FIG_DIR}/fitness_hits_per_metabolite.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved to {FIG_DIR}/fitness_hits_per_metabolite.png")

Saved to ../figures/fitness_hits_per_metabolite.png


In [11]:
# Heatmap: metabolite × average fitness per experiment
# Use average fitness across all genes as an indicator of growth quality
avg_fit = fitness_df.groupby(['condition_1']).agg(
    mean_fit=('fit', 'mean'),
    median_fit=('fit', 'median'),
    n_sig_genes=('fit', lambda x: ((x.abs() > 1)).sum()),
    n_total_genes=('fit', 'count'),
).reset_index()

avg_fit['wom_compound'] = avg_fit['condition_1'].map(fb_to_wom)
avg_fit = avg_fit.dropna(subset=['wom_compound'])
avg_fit['pct_sig'] = (avg_fit['n_sig_genes'] / avg_fit['n_total_genes'] * 100).round(1)

print("Experiment-level fitness summary:")
print(avg_fit[['wom_compound', 'condition_1', 'mean_fit', 'median_fit', 
               'n_sig_genes', 'pct_sig']].sort_values('n_sig_genes', ascending=False).to_string(index=False))

Experiment-level fitness summary:
 wom_compound                                    condition_1  mean_fit  median_fit  n_sig_genes  pct_sig
     arginine                                     L-Arginine -0.104687   -0.018210          771      3.8
glutamic acid L-Glutamic acid monopotassium salt monohydrate -0.092822   -0.020009          607      3.0
    trehalose                          D-Trehalose dihydrate -0.109553   -0.018642          458      4.6
phenylalanine                                L-Phenylalanine -0.098087   -0.017988          433      4.3
   tryptophan                                   L-Tryptophan -0.102204   -0.012934          413      4.1
    carnitine                        Carnitine Hydrochloride -0.101921   -0.005891          390      3.9
      alanine                                      D-Alanine -0.115532   -0.020967          387      3.9
      glycine                                        Glycine -0.099377   -0.018285          363      3.6
       valine        

## 5. Production vs. Utilization Gene Overlap

**Key analysis**: For each metabolite that FW300-N2E3 *produces* (WoM), which genes are essential
when growing *on* that metabolite (FB)? If the same genes appear in both biosynthesis and catabolism
contexts, they may be bifunctional or central metabolic genes.

In [12]:
# Find genes that are fitness-important across multiple metabolites
gene_metabolite_counts = sig.groupby('locusId').agg(
    n_metabolites=('wom_compound', 'nunique'),
    metabolites=('wom_compound', lambda x: ', '.join(sorted(x.dropna().unique()))),
    mean_fit=('fit', 'mean'),
).reset_index()

# Merge with gene descriptions
gene_info = sig[['locusId', 'sysName', 'gene', 'desc']].drop_duplicates('locusId')
gene_metabolite_counts = gene_metabolite_counts.merge(gene_info, on='locusId')

# Pleiotropic genes (important for 3+ metabolites)
pleiotropic = gene_metabolite_counts[gene_metabolite_counts['n_metabolites'] >= 3].sort_values(
    'n_metabolites', ascending=False
)

print(f"Genes with significant fitness in 3+ metabolite conditions: {len(pleiotropic)}")
if len(pleiotropic) > 0:
    print("\nTop pleiotropic genes:")
    for _, g in pleiotropic.head(20).iterrows():
        gene_name = g['gene'] if pd.notna(g['gene']) else g['sysName']
        print(f"  {gene_name:15s} ({g['n_metabolites']} metabolites, mean fit={g['mean_fit']:+.2f})")
        print(f"    {g['desc'][:70]}")
        print(f"    Metabolites: {g['metabolites']}")

Genes with significant fitness in 3+ metabolite conditions: 231

Top pleiotropic genes:
  AO353_08180     (21 metabolites, mean fit=-3.43)
    homoserine O-acetyltransferase
    Metabolites: Adenine, Adenosine, Cytosine, Malate, Uracil, alanine, arginine, aspartate, carnitine, glutamic acid, glycine, inosine, lactate, lysine, phenylalanine, proline, thymine, trehalose, tryptophan, tyrosine, valine
  AO353_08345     (21 metabolites, mean fit=-3.53)
    dihydroxy-acid dehydratase
    Metabolites: Adenine, Adenosine, Cytosine, Malate, Uracil, alanine, arginine, aspartate, carnitine, glutamic acid, glycine, inosine, lactate, lysine, phenylalanine, proline, thymine, trehalose, tryptophan, tyrosine, valine
  AO353_08015     (21 metabolites, mean fit=-2.71)
    5,10-methylenetetrahydrofolate reductase
    Metabolites: Adenine, Adenosine, Cytosine, Malate, Uracil, alanine, arginine, aspartate, carnitine, glutamic acid, glycine, inosine, lactate, lysine, phenylalanine, proline, thymine, trehalo

In [13]:
# Summary statistics
print("=" * 60)
print("NB02 SUMMARY: WoM ↔ FB Integration")
print("=" * 60)
print(f"\nWoM-produced metabolites with FB match: {len(matched)}")
print(f"Total gene-experiment records queried: {len(fitness_df)}")
print(f"Significant fitness hits: {len(sig)}")
print(f"Unique genes with significant fitness: {sig['locusId'].nunique()}")
print(f"Conditions with any significant hit: {sig['condition_1'].nunique()}")
if len(pleiotropic) > 0:
    print(f"Pleiotropic genes (3+ metabolites): {len(pleiotropic)}")
print(f"\nGenes annotated with SEED subsystem: {seed_df['locusId'].nunique()} / {len(sig_loci)}")
print(f"\nFiles saved:")
print(f"  {DATA_DIR}/wom_fb_gene_table.tsv")
print(f"  {DATA_DIR}/wom_fb_summary.tsv")
print(f"  {FIG_DIR}/fitness_hits_per_metabolite.png")

NB02 SUMMARY: WoM ↔ FB Integration

WoM-produced metabolites with FB match: 28
Total gene-experiment records queried: 221056
Significant fitness hits: 4764
Unique genes with significant fitness: 601
Conditions with any significant hit: 24
Pleiotropic genes (3+ metabolites): 231

Genes annotated with SEED subsystem: 565 / 601

Files saved:
  ../data/wom_fb_gene_table.tsv
  ../data/wom_fb_summary.tsv
  ../figures/fitness_hits_per_metabolite.png


In [14]:
spark.stop()
print("Spark session closed.")

Spark session closed.
