# Create Combined Growth Matrix (Integrated Data)

**Purpose**: Integrate 2018 paper and Fitness Browser data with proper hierarchy

**Data Strategy (Tiered Approach)**:
- **Tier 1**: Supplementary Table S2 (2018 paper) - use for BOTH growth and no-growth calls
- **Tier 2**: Fitness Browser - use for growth calls ONLY (if experiment exists)
- **Important**: Do NOT assume absence of Fitness Browser data means "No Growth"

**Why This Strategy?**:
- 170 cases where Supplementary Table shows growth but Fitness Browser has no data
- All Fitness Browser experiments already pass quality filters (gMed >= 50, mad12 <= 0.5)
- Missing data could mean: never tested, failed experiment, or not published
- Both sources from same lab (Morgan group) - equally trustworthy
- Difference: Paper reports Growth AND No Growth; Database reports Growth only

**Discrepancies Found**:
- Type 1: 170 cases - Paper shows growth, FB has no data (experiments not in public database)
- Type 2: 3 cases - FB shows growth, paper shows no growth (Tier 1 takes precedence)

**Output File**:
- `results/combined_growth_matrix.csv` - Master growth matrix (57 organisms × 208 carbon sources)

**Values**:
- `Growth` - Confirmed growth (from either source)
- `No Growth` - Confirmed no growth (from Supplementary Table ONLY)
- ` ` (blank) - Unknown/not tested

**Last updated**: 2025-10-15

## Setup

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from pathlib import Path
import re

print("Imports successful")

Imports successful


## Configuration

In [2]:
# Paths
DB_PATH = Path("../data/source/feba.db")
RESULTS_DIR = Path("results")

# Input files (created by previous notebooks)
SUPP_TABLE_FILE = RESULTS_DIR / "supplementary_table_s2_clean.csv"
ORGANISM_METADATA_FILE = RESULTS_DIR / "organism_metadata.csv"

# Output file
OUTPUT_FILE = RESULTS_DIR / "combined_growth_matrix.csv"

print(f"Configuration set")
print(f"  Database: {DB_PATH}")
print(f"  Supplementary table: {SUPP_TABLE_FILE}")
print(f"  Organism metadata: {ORGANISM_METADATA_FILE}")
print(f"  Output: {OUTPUT_FILE}")

Configuration set
  Database: ../data/source/feba.db
  Supplementary table: results/supplementary_table_s2_clean.csv
  Organism metadata: results/organism_metadata.csv
  Output: results/combined_growth_matrix.csv


## Load Tier 1 Data (2018 Paper)

In [3]:
print("Loading Tier 1 data (Supplementary Table S2 - 2018 paper)...")
supp_data = pd.read_csv(SUPP_TABLE_FILE, index_col=0)

print(f"\nLoaded Tier 1 data:")
print(f"  Shape: {supp_data.shape[0]} carbon sources × {supp_data.shape[1]} organisms")
print(f"  Growth calls: {(supp_data == 'Growth').sum().sum()}")
print(f"  No Growth calls: {(supp_data == 'No Growth').sum().sum()}")

print(f"\nOrganisms in supplementary table:")
for org in supp_data.columns:
    print(f"  - {org}")

Loading Tier 1 data (Supplementary Table S2 - 2018 paper)...

Loaded Tier 1 data:
  Shape: 94 carbon sources × 28 organisms
  Growth calls: 931
  No Growth calls: 1701

Organisms in supplementary table:
  - Acidovorax sp. GW101-3H11
  - Azospirillum brasilense sp. 245
  - Burkholderia phytofirmans PsJN
  - Caulobacter crescentus NA1000
  - Cupriavidus basilensis 4G11
  - Dechlorosoma suillum PS
  - Dinoroseobacter shibae DFL-12
  - Dyella japonica UNC79MFTsu3.2 IMG taxon ID 2556921674
  - Echinicola vietnamensis DSM 17526
  - Escherichia coli BW25113
  - Herbaspirillum seropedicae SmR1
  - Klebsiella michiganensis M5a1
  - Marinobacter adhaerens HP15
  - Pedobacter sp. GW460-11-11-14-LB5
  - Phaeobacter inhibens BS107
  - Pseudomonas fluorescens FW300-N1B4
  - Pseudomonas fluorescens FW300-N2C3
  - Pseudomonas fluorescens FW300-N2E2
  - Pseudomonas fluorescens FW300-N2E3
  - Pseudomonas fluorescens GW456-L13
  - Pseudomonas simiae WCS417
  - Pseudomonas stutzeri RCH2
  - Shewanella ama

## Load Organism Metadata

In [4]:
print("\nLoading organism metadata...")
metadata = pd.read_csv(ORGANISM_METADATA_FILE)

print(f"Loaded metadata for {len(metadata)} organisms")
print(f"\nCreating orgId to Species_Name mapping...")

# Create mapping dictionaries
orgid_to_species = dict(zip(metadata['orgId'], metadata['Species_Name']))
species_to_orgid = dict(zip(metadata['Species_Name'], metadata['orgId']))

print(f"Created mappings for {len(orgid_to_species)} organisms")


Loading organism metadata...
Loaded metadata for 57 organisms

Creating orgId to Species_Name mapping...
Created mappings for 57 organisms


## Load Tier 2 Data (Fitness Browser)

In [5]:
print("\nLoading Tier 2 data (Fitness Browser database)...")
conn = sqlite3.connect(str(DB_PATH))

# Query carbon source experiments
# Note: ALL experiments in feba.db already pass quality filters
query = """
SELECT DISTINCT
    e.orgId,
    e.condition_1 as carbon_source,
    COUNT(DISTINCT e.expName) as n_experiments
FROM Experiment e
WHERE e.expGroup = 'carbon source'
  AND e.num > 0
GROUP BY e.orgId, e.condition_1
ORDER BY e.orgId, e.condition_1
"""

fb_data = pd.read_sql_query(query, conn)
conn.close()

print(f"\nLoaded Fitness Browser data:")
print(f"  Organism-carbon pairs: {len(fb_data):,}")
print(f"  Unique organisms: {fb_data['orgId'].nunique()}")
print(f"  Unique carbon sources: {fb_data['carbon_source'].nunique()}")

print(f"\nNote: All experiments already pass quality filters (gMed >= 50, mad12 <= 0.5)")


Loading Tier 2 data (Fitness Browser database)...

Loaded Fitness Browser data:
  Organism-carbon pairs: 1,089
  Unique organisms: 44
  Unique carbon sources: 198

Note: All experiments already pass quality filters (gMed >= 50, mad12 <= 0.5)


## Match Carbon Source Names

In [6]:
def normalize_carbon_name(name):
    """Normalize carbon source name for matching"""
    if pd.isna(name):
        return name
    
    name = str(name).lower().strip()
    
    # Remove common suffixes
    name = re.sub(r'\s+(monohydrate|dihydrate|trihydrate|pentahydrate|hexahydrate)', '', name)
    name = re.sub(r'\s+(salt|potassium salt|sodium salt|disodium salt|hydrochloride|hcl)', '', name)
    
    # Normalize stereochemistry prefixes
    name = re.sub(r'd-\(-\)-', 'd-', name)
    name = re.sub(r'l-\(-\)-', 'l-', name)
    
    # Remove punctuation
    name = re.sub(r'[,\.]', '', name)
    
    # Normalize whitespace
    name = ' '.join(name.split())
    
    return name

# Create normalized versions for matching
supp_carbons_normalized = {normalize_carbon_name(c): c for c in supp_data.index}
fb_carbons_normalized = {normalize_carbon_name(c): c for c in fb_data['carbon_source'].unique()}

print(f"Normalized carbon source names for matching")
print(f"  Supplementary table: {len(supp_carbons_normalized)} unique names")
print(f"  Fitness Browser: {len(fb_carbons_normalized)} unique names")

Normalized carbon source names for matching
  Supplementary table: 94 unique names
  Fitness Browser: 198 unique names


## Create Combined Matrix Framework

In [7]:
print("\nCreating combined matrix framework...")

# Get all organisms (57 from Fitness Browser)
all_organisms = metadata['Species_Name'].tolist()

# Get all carbon sources (union of both sources)
all_carbon_sources = sorted(
    set(supp_data.index.tolist()) | 
    set(fb_data['carbon_source'].unique())
)

# Create empty matrix
combined_matrix = pd.DataFrame(
    index=all_carbon_sources,
    columns=all_organisms,
    dtype=str
)

# Fill with empty strings (will use "" for unknown)
combined_matrix[:] = ''

print(f"Created empty matrix:")
print(f"  Shape: {combined_matrix.shape[0]} carbon sources × {combined_matrix.shape[1]} organisms")
print(f"  Total cells: {combined_matrix.size:,}")


Creating combined matrix framework...
Created empty matrix:
  Shape: 208 carbon sources × 57 organisms
  Total cells: 11,856


## Fill Matrix: Tier 1 (Supplementary Table)

In [8]:
print("\nFilling matrix with Tier 1 data (2018 paper)...")

n_filled = 0
for carbon in supp_data.index:
    for organism in supp_data.columns:
        if organism in combined_matrix.columns:
            value = supp_data.loc[carbon, organism]
            if pd.notna(value) and value != '':
                combined_matrix.loc[carbon, organism] = value
                n_filled += 1

print(f"Filled {n_filled:,} cells from Supplementary Table S2")
print(f"  Growth: {(combined_matrix == 'Growth').sum().sum()}")
print(f"  No Growth: {(combined_matrix == 'No Growth').sum().sum()}")


Filling matrix with Tier 1 data (2018 paper)...
Filled 2,068 cells from Supplementary Table S2
  Growth: 737
  No Growth: 1331


## Fill Matrix: Tier 2 (Fitness Browser - Growth Only)

In [9]:
print("\nFilling matrix with Tier 2 data (Fitness Browser growth calls)...")

n_added = 0
for _, row in fb_data.iterrows():
    org_species = orgid_to_species.get(row['orgId'])
    carbon = row['carbon_source']
    
    if org_species in combined_matrix.columns and carbon in combined_matrix.index:
        # Only add if cell is currently empty (Tier 1 takes precedence)
        if combined_matrix.loc[carbon, org_species] == '':
            combined_matrix.loc[carbon, org_species] = 'Growth'
            n_added += 1

print(f"Added {n_added:,} additional growth calls from Fitness Browser")
print(f"\nFinal matrix composition:")
print(f"  Growth: {(combined_matrix == 'Growth').sum().sum()}")
print(f"  No Growth: {(combined_matrix == 'No Growth').sum().sum()}")
print(f"  Unknown (blank): {(combined_matrix == '').sum().sum()}")


Filling matrix with Tier 2 data (Fitness Browser growth calls)...
Added 519 additional growth calls from Fitness Browser

Final matrix composition:
  Growth: 1256
  No Growth: 1331
  Unknown (blank): 9269


## Save Combined Matrix

In [10]:
print(f"\nSaving combined growth matrix to: {OUTPUT_FILE}")
combined_matrix.to_csv(OUTPUT_FILE)
print(f"Saved {combined_matrix.shape[0]} carbon sources × {combined_matrix.shape[1]} organisms")


Saving combined growth matrix to: results/combined_growth_matrix.csv
Saved 208 carbon sources × 57 organisms


## Summary Statistics

In [11]:
print("="*70)
print("COMBINED GROWTH MATRIX SUMMARY")
print("="*70)

print(f"\nData sources (tiered approach):")
print(f"  Tier 1: Supplementary Table S2 (2018 paper)")
print(f"    - Use for: BOTH growth and no-growth")
print(f"    - Coverage: 28 organisms × 94 carbon sources")
print(f"  Tier 2: Fitness Browser")
print(f"    - Use for: Growth calls ONLY")
print(f"    - Coverage: 57 organisms × 198 carbon sources")

print(f"\nMatrix dimensions:")
print(f"  Organisms: {len(combined_matrix.columns)}")
print(f"  Carbon sources: {len(combined_matrix.index)}")
print(f"  Total cells: {combined_matrix.size:,}")

n_growth = (combined_matrix == 'Growth').sum().sum()
n_no_growth = (combined_matrix == 'No Growth').sum().sum()
n_unknown = (combined_matrix == '').sum().sum()

print(f"\nData composition:")
print(f"  Growth: {n_growth:,} ({100*n_growth/combined_matrix.size:.1f}%)")
print(f"  No Growth: {n_no_growth:,} ({100*n_no_growth/combined_matrix.size:.1f}%)")
print(f"  Unknown: {n_unknown:,} ({100*n_unknown/combined_matrix.size:.1f}%)")

print(f"\nInterpretation:")
print(f"  'Growth' = Confirmed growth (high confidence)")
print(f"  'No Growth' = Confirmed no growth (conservative threshold)")
print(f"  '' (blank) = Unknown/not tested")

print(f"\nFiles created:")
print(f"  {OUTPUT_FILE}")

print(f"\nNext step:")
print(f"  Run 04-analyze-data-discrepancies.ipynb to find conflicts")

print("\n" + "="*70)

COMBINED GROWTH MATRIX SUMMARY

Data sources (tiered approach):
  Tier 1: Supplementary Table S2 (2018 paper)
    - Use for: BOTH growth and no-growth
    - Coverage: 28 organisms × 94 carbon sources
  Tier 2: Fitness Browser
    - Use for: Growth calls ONLY
    - Coverage: 57 organisms × 198 carbon sources

Matrix dimensions:
  Organisms: 57
  Carbon sources: 208
  Total cells: 11,856

Data composition:
  Growth: 1,256 (10.6%)
  No Growth: 1,331 (11.2%)
  Unknown: 9,269 (78.2%)

Interpretation:
  'Growth' = Confirmed growth (high confidence)
  'No Growth' = Confirmed no growth (conservative threshold)
  '' (blank) = Unknown/not tested

Files created:
  results/combined_growth_matrix.csv

Next step:
  Run 04-analyze-data-discrepancies.ipynb to find conflicts

