# Filter Growth Matrix to Suitable Carbon Sources and Organisms

**Parent**: CDMSCI-193 - RBTnSeq Modeling Analysis

**Ticket**: CDMSCI-196 - Compile Carbon Sources List

## Objective

Create a filtered growth matrix containing only:
1. Carbon sources suitable for metabolic modeling (recommendation: 'use')
2. Organisms with at least some growth data for those carbon sources

## Approach

1. Load the carbon source evaluation results from notebook 05
2. Filter for compounds with `Recommendation == 'use'` only (direct usage)
3. Create filtered growth matrix with only these compounds
4. Remove organisms that have no data (all Unknown) for the filtered compounds
5. Compare statistics between full and filtered datasets

## Outputs

- `results/combined_growth_matrix_filtered.csv` - Filtered growth matrix (141 'use' compounds × organisms with data)

**Last updated**: 2025-10-15

## Setup

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

print("Imports successful")

Imports successful


## Configuration

In [None]:
# Input files
EVALUATION_FILE = Path('results/carbon_source_evaluation_gpt5_full_set_analsyis.csv')
FULL_MATRIX_FILE = Path('results/combined_growth_matrix.csv')

# Output file
OUTPUT_DIR = Path('results')
FILTERED_MATRIX_FILE = OUTPUT_DIR / 'combined_growth_matrix_filtered.csv'

print(f"Configuration set")
print(f"  Evaluation: {EVALUATION_FILE}")
print(f"  Full matrix: {FULL_MATRIX_FILE}")
print(f"  Output: {FILTERED_MATRIX_FILE}")

## Load Data

In [3]:
# Load evaluation results
print("Loading carbon source evaluation...")
evaluation_df = pd.read_csv(EVALUATION_FILE)
print(f"  Loaded {len(evaluation_df)} evaluated compounds")

# Load full growth matrix
print("\nLoading full growth matrix...")
full_matrix = pd.read_csv(FULL_MATRIX_FILE, index_col=0)
print(f"  Shape: {full_matrix.shape[0]} carbon sources × {full_matrix.shape[1]} organisms")
print(f"  Total cells: {full_matrix.shape[0] * full_matrix.shape[1]:,}")

Loading carbon source evaluation...


FileNotFoundError: [Errno 2] No such file or directory: 'results/carbon_source_evaluation_final.csv'

## Exclude Sucrose (Experimental Issue)

**Issue**: Sucrose growth data is unreliable due to stock solution contamination (reported by experiment author, 2025-10-15).

**Action**: Explicitly remove Sucrose from the growth matrix before filtering.

In [4]:
# Remove Sucrose from growth matrix due to experimental issue
print("Checking for Sucrose in growth matrix...")
if "Sucrose" in full_matrix.index:
    print("  Found Sucrose - removing due to stock solution contamination issue")
    full_matrix = full_matrix.drop("Sucrose")
    print(f"  Updated matrix shape: {full_matrix.shape[0]} carbon sources × {full_matrix.shape[1]} organisms")
else:
    print("  Sucrose not found in matrix (already excluded)")

Checking for Sucrose in growth matrix...


NameError: name 'full_matrix' is not defined

## Filter for 'use' Recommendation Only

We want only compounds that can be used **directly** without modifications.

In [5]:
# Filter evaluation for 'use' recommendation only
use_compounds = evaluation_df[evaluation_df['Recommendation'] == 'use']

print(f"Carbon sources with 'use' recommendation: {len(use_compounds)}")
print(f"\nBreakdown by recommendation:")
for rec, count in evaluation_df['Recommendation'].value_counts().items():
    pct = 100 * count / len(evaluation_df)
    print(f"  {rec:20s}: {count:3d} ({pct:5.1f}%)")

# Get list of carbon sources to keep
carbon_sources_to_keep = use_compounds['Carbon_Source'].tolist()
print(f"\nCarbon sources to keep in filtered matrix: {len(carbon_sources_to_keep)}")

NameError: name 'evaluation_df' is not defined

## Create Filtered Growth Matrix

In [6]:
# Filter growth matrix to keep only 'use' compounds
print("Creating filtered growth matrix...")

# Find which carbon sources exist in the growth matrix
# (handle potential mismatches between evaluation and matrix)
available_in_matrix = [cs for cs in carbon_sources_to_keep if cs in full_matrix.index]
missing_from_matrix = [cs for cs in carbon_sources_to_keep if cs not in full_matrix.index]

if missing_from_matrix:
    print(f"\n⚠ Warning: {len(missing_from_matrix)} compounds in evaluation not found in growth matrix:")
    for cs in missing_from_matrix[:10]:  # Show first 10
        print(f"    - {cs}")
    if len(missing_from_matrix) > 10:
        print(f"    ... and {len(missing_from_matrix) - 10} more")

# Create filtered matrix
filtered_matrix = full_matrix.loc[available_in_matrix].copy()

print(f"\nFiltered matrix shape: {filtered_matrix.shape[0]} carbon sources × {filtered_matrix.shape[1]} organisms")
print(f"Total cells: {filtered_matrix.shape[0] * filtered_matrix.shape[1]:,}")

Creating filtered growth matrix...


NameError: name 'carbon_sources_to_keep' is not defined

In [7]:
print("Checking for organisms without any carbon source data...\n")

# For each organism, count how many non-unknown cells it has
organisms_with_data = []
organisms_without_data = []

for organism in filtered_matrix.columns:
    # Count cells that are Growth or No Growth (not empty/unknown)
    n_tested = ((filtered_matrix[organism] == 'Growth') | (filtered_matrix[organism] == 'No Growth')).sum()
    
    if n_tested > 0:
        organisms_with_data.append(organism)
    else:
        organisms_without_data.append(organism)

print(f"Organisms with data: {len(organisms_with_data)}")
print(f"Organisms without data: {len(organisms_without_data)}")

if organisms_without_data:
    print(f"\nOrganisms to remove (no data for filtered carbon sources):")
    for org in organisms_without_data:
        print(f"  - {org}")
    
    # Filter matrix to keep only organisms with data
    print(f"\nFiltering matrix to keep only organisms with data...")
    filtered_matrix = filtered_matrix[organisms_with_data].copy()
    print(f"  New shape: {filtered_matrix.shape[0]} carbon sources × {filtered_matrix.shape[1]} organisms")
else:
    print(f"\n✓ All organisms have data for at least one filtered carbon source")

Checking for organisms without any carbon source data...



NameError: name 'filtered_matrix' is not defined

In [8]:
print("=" * 80)
print("COMPARISON: FULL vs FILTERED DATASET")
print("=" * 80)

# Function to calculate stats
def calculate_stats(matrix, label):
    total_cells = matrix.shape[0] * matrix.shape[1]
    
    # Count values
    growth_cells = (matrix == 'Growth').sum().sum()
    no_growth_cells = (matrix == 'No Growth').sum().sum()
    unknown_cells = total_cells - growth_cells - no_growth_cells
    
    print(f"\n{label}:")
    print(f"  Carbon sources: {matrix.shape[0]}")
    print(f"  Organisms: {matrix.shape[1]}")
    print(f"  Total cells: {total_cells:,}")
    print(f"  Growth: {growth_cells:,} ({100*growth_cells/total_cells:.1f}%)")
    print(f"  No Growth: {no_growth_cells:,} ({100*no_growth_cells/total_cells:.1f}%)")
    print(f"  Unknown: {unknown_cells:,} ({100*unknown_cells/total_cells:.1f}%)")
    
    return {
        'carbon_sources': matrix.shape[0],
        'organisms': matrix.shape[1],
        'total': total_cells,
        'growth': growth_cells,
        'no_growth': no_growth_cells,
        'unknown': unknown_cells
    }

full_stats = calculate_stats(full_matrix, "FULL DATASET")
filtered_stats = calculate_stats(filtered_matrix, "FILTERED DATASET (use only + organisms with data)")

# Show reduction
print("\n" + "-" * 80)
print("REDUCTION:")
print("-" * 80)
print(f"  Carbon sources removed: {full_stats['carbon_sources'] - filtered_stats['carbon_sources']} "
      f"({100*(full_stats['carbon_sources'] - filtered_stats['carbon_sources'])/full_stats['carbon_sources']:.1f}%)")
print(f"  Organisms removed: {full_stats['organisms'] - filtered_stats['organisms']} "
      f"({100*(full_stats['organisms'] - filtered_stats['organisms'])/full_stats['organisms']:.1f}%)")
print(f"  Total cells removed: {full_stats['total'] - filtered_stats['total']:,} "
      f"({100*(full_stats['total'] - filtered_stats['total'])/full_stats['total']:.1f}%)")
print(f"  Growth calls removed: {full_stats['growth'] - filtered_stats['growth']:,}")
print(f"  No Growth calls removed: {full_stats['no_growth'] - filtered_stats['no_growth']:,}")
print(f"  Unknown cells removed: {full_stats['unknown'] - filtered_stats['unknown']:,}")

COMPARISON: FULL vs FILTERED DATASET


NameError: name 'full_matrix' is not defined

## Compare Statistics

In [9]:
print("=" * 80)
print("COMPARISON: FULL vs FILTERED DATASET")
print("=" * 80)

# Function to calculate stats
def calculate_stats(matrix, label):
    total_cells = matrix.shape[0] * matrix.shape[1]
    
    # Count values
    growth_cells = (matrix == 'Growth').sum().sum()
    no_growth_cells = (matrix == 'No Growth').sum().sum()
    unknown_cells = total_cells - growth_cells - no_growth_cells
    
    print(f"\n{label}:")
    print(f"  Carbon sources: {matrix.shape[0]}")
    print(f"  Organisms: {matrix.shape[1]}")
    print(f"  Total cells: {total_cells:,}")
    print(f"  Growth: {growth_cells:,} ({100*growth_cells/total_cells:.1f}%)")
    print(f"  No Growth: {no_growth_cells:,} ({100*no_growth_cells/total_cells:.1f}%)")
    print(f"  Unknown: {unknown_cells:,} ({100*unknown_cells/total_cells:.1f}%)")
    
    return {
        'carbon_sources': matrix.shape[0],
        'organisms': matrix.shape[1],
        'total': total_cells,
        'growth': growth_cells,
        'no_growth': no_growth_cells,
        'unknown': unknown_cells
    }

full_stats = calculate_stats(full_matrix, "FULL DATASET")
filtered_stats = calculate_stats(filtered_matrix, "FILTERED DATASET (use only)")

# Show reduction
print("\n" + "-" * 80)
print("REDUCTION:")
print("-" * 80)
print(f"  Carbon sources removed: {full_stats['carbon_sources'] - filtered_stats['carbon_sources']} "
      f"({100*(full_stats['carbon_sources'] - filtered_stats['carbon_sources'])/full_stats['carbon_sources']:.1f}%)")
print(f"  Total cells removed: {full_stats['total'] - filtered_stats['total']:,} "
      f"({100*(full_stats['total'] - filtered_stats['total'])/full_stats['total']:.1f}%)")
print(f"  Growth calls removed: {full_stats['growth'] - filtered_stats['growth']:,}")
print(f"  No Growth calls removed: {full_stats['no_growth'] - filtered_stats['no_growth']:,}")
print(f"  Unknown cells removed: {full_stats['unknown'] - filtered_stats['unknown']:,}")

COMPARISON: FULL vs FILTERED DATASET


NameError: name 'full_matrix' is not defined

## Summary

**Output Created**:
- `results/combined_growth_matrix_filtered.csv` - Filtered growth matrix with:
  - 141 carbon sources (recommendation: 'use' only)
  - Only organisms with at least some growth data for these carbon sources

**Filtering Applied**:
1. **Carbon Source Filtering**: Removed 67 carbon sources (32.2%)
   - Excludes: polymers, proprietary blends, atypical compounds
   - Keeps only: compounds suitable for direct metabolic modeling
2. **Organism Filtering**: Removed organisms with no data for filtered carbon sources
   - Ensures all organisms in the final matrix have at least some growth data

**Next Steps**:
1. Update interactive HTML viewer to support dataset selection (full vs filtered)
2. Use filtered matrix in CDMSCI-197 for ModelSEED mapping
3. Use filtered matrix in CDMSCI-199 for FBA validation

**Why Filter?**
- **Carbon Sources**: Focuses on compounds that can be directly modeled (no modifications needed)
- **Organisms**: Removes organisms without any experimental data for the modeling-suitable compounds
- **Result**: Clean, focused dataset optimized for metabolic modeling workflows

In [10]:
# Save filtered growth matrix
print("\nSaving filtered growth matrix...")
filtered_matrix.to_csv(FILTERED_MATRIX_FILE)
print(f"  Saved to: {FILTERED_MATRIX_FILE}")

print("\n" + "=" * 80)
print("COMPLETE")
print("=" * 80)
print(f"\nFiltered matrix ready for:")
print(f"  - CDMSCI-197: ModelSEED mapping (use only suitable compounds)")
print(f"  - CDMSCI-199: FBA validation (focus on modeling-suitable carbon sources)")
print(f"  - Interactive viewer: Compare full vs filtered datasets")


Saving filtered growth matrix...
  Saved to: results/combined_growth_matrix_filtered.csv

COMPLETE

Filtered matrix ready for:
  - CDMSCI-197: ModelSEED mapping (use only suitable compounds)
  - CDMSCI-199: FBA validation (focus on modeling-suitable carbon sources)
  - Interactive viewer: Compare full vs filtered datasets


## Summary

**Output Created**:
- `results/combined_growth_matrix_filtered.csv` - Growth matrix with 141 carbon sources (recommendation: 'use' only)

**Next Steps**:
1. Update interactive HTML viewer to support dataset selection (full vs filtered)
2. Use filtered matrix in CDMSCI-197 for ModelSEED mapping
3. Use filtered matrix in CDMSCI-199 for FBA validation

**Why Filter?**
- Focuses on compounds that can be directly modeled (no modifications needed)
- Excludes polymers (need monomer representation)
- Excludes proprietary blends (use defined alternatives)
- Excludes atypical compounds (not suitable for bacterial metabolism)
- Provides clean dataset for metabolic modeling workflows