# Evaluate Carbon Sources for Metabolic Modeling

**Parent**: CDMSCI-193 - RBTnSeq Modeling Analysis

**Ticket**: CDMSCI-196 - Compile Carbon Sources List

## Objective

Evaluate which carbon sources from our dataset are suitable for genome-scale metabolic modeling.

## Motivation

Not all experimental carbon sources are suitable for metabolic modeling:
- **Polymers** need to be represented by monomers (e.g., Amylose → Glucose)
- **Proprietary blends** are undefined mixtures (e.g., commercial prebiotics)
- **Complex mixtures** can't be mapped to single metabolites (e.g., casamino acids)
- **Atypical compounds** may not be in metabolic databases (e.g., nucleosides)

## Two-Step Approach

**Step 1: Comprehensive Analysis (GPT-4o)**
- Fast evaluation of all 207 carbon sources (~3-4 minutes)
- Categorizes compounds into: simple_metabolite, polymer, proprietary, complex_mixture, unclear
- Recommendations: use, use_monomer, exclude, manual_review

**Step 2: Deep Dive (GPT-5)**
- Detailed reasoning for compounds flagged as "manual_review"
- Extended analysis with biochemical pathway knowledge
- Final recommendations for edge cases

## Outputs

1. `results/carbon_source_evaluation_gpt4o.csv` - Full GPT-4o analysis
2. `results/carbon_source_evaluation_gpt5.csv` - GPT-5 deep dive for manual review cases
3. `results/carbon_source_evaluation_final.csv` - Curated final list with recommendations

**Last updated**: 2025-10-15

## Setup

In [1]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
import requests
import time

print("Imports successful")

Imports successful


## Configuration

In [10]:
# Input path
CARBON_SOURCES_FILE = Path('results/combined_growth_matrix.csv')

# Output paths
OUTPUT_DIR = Path('results')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

GPT4O_OUTPUT = OUTPUT_DIR / 'carbon_source_evaluation_gpt4o.csv'
#GPT4O_OUTPUT = OUTPUT_DIR / 'carbon_source_evaluation_gpt5_full_set_analsyis.csv'

GPT5_OUTPUT = OUTPUT_DIR / 'carbon_source_evaluation_gpt5.csv'
FINAL_OUTPUT = OUTPUT_DIR / 'carbon_source_evaluation_final.csv'

# Argo proxy for LLM
ARGO_BASE_URL = 'http://localhost:8000/v1'
GPT4O_MODEL = 'gpt4o'  # Fast, accurate for classification
#GPT4O_MODEL = 'gpt5'  # Fast, accurate for classification
GPT5_MODEL = 'gpt5'    # Slower, deep reasoning

print(f"Configuration set")
print(f"  Carbon sources: {CARBON_SOURCES_FILE}")
print(f"  GPT-4o output: {GPT4O_OUTPUT}")
print(f"  GPT-5 output: {GPT5_OUTPUT}")
print(f"  Final output: {FINAL_OUTPUT}")

Configuration set
  Carbon sources: results/combined_growth_matrix.csv
  GPT-4o output: results/carbon_source_evaluation_gpt5_full_set_analsyis.csv
  GPT-5 output: results/carbon_source_evaluation_gpt5.csv
  Final output: results/carbon_source_evaluation_final.csv


## Load Carbon Sources

In [11]:
print("Loading carbon sources from combined growth matrix...")
growth_matrix = pd.read_csv(CARBON_SOURCES_FILE, index_col=0)

# Filter out NaN values from index
carbon_sources = [cs for cs in growth_matrix.index.tolist() if pd.notna(cs)]

print(f"\nLoaded {len(carbon_sources)} carbon sources")
print(f"\nFirst 10 carbon sources:")
for i, cs in enumerate(carbon_sources[:10], 1):
    print(f"  {i:3d}. {cs}")

Loading carbon sources from combined growth matrix...

Loaded 207 carbon sources

First 10 carbon sources:
    1. (+)-Arabinogalactan
    2. 1,2-Propanediol
    3. 1,3-Butandiol
    4. 1,4-B-D-Galactobiose
    5. 1,4-Butanediol
    6. 1,5-Pentanediol
    7. 1-Pentanol
    8. 2'-Deoxycytidine
    9. 2'-Deoxyinosine
   10. 2-Deoxy-D-Ribose


## Step 1: Comprehensive Analysis with GPT-4o

Fast, accurate evaluation of all carbon sources.

In [12]:
def evaluate_carbon_source_gpt4o(compound_name):
    """Use GPT-4o to evaluate if carbon source is suitable for metabolic modeling"""
    
    prompt = f"""You are a metabolic modeling expert evaluating carbon sources for genome-scale metabolic models (GEMs).

Compound name: "{compound_name}"

Task: Evaluate if this compound is suitable as a carbon source in metabolic modeling.

Evaluation criteria:
1. Is this a defined chemical compound (not a complex mixture or proprietary blend)?
2. Can it be represented by a single metabolite in a metabolic model?
3. Is it a typical carbon source that bacteria can metabolize?
4. Can it be mapped to a biochemical database (e.g., ModelSEED, KEGG)?

Response format (JSON):
{{
  "suitable": true/false,
  "category": "simple_metabolite" | "polymer" | "complex_mixture" | "proprietary" | "unclear",
  "recommendation": "use" | "use_monomer" | "exclude" | "manual_review",
  "reasoning": "One-line explanation",
  "suggested_alternative": "Alternative compound if needed (or empty string)"
}}

Categories:
- simple_metabolite: Single defined compound (e.g., D-Glucose, Glycerol)
- polymer: Polysaccharide/polymer (suggest monomer, e.g., Amylose → Glucose)
- complex_mixture: Undefined mixture (e.g., "yeast extract")
- proprietary: Commercial product (e.g., "Actilight", "FiberGum")
- unclear: Cannot determine from name alone

Recommendations:
- use: Directly usable
- use_monomer: Use the monomer unit instead
- exclude: Not suitable for modeling
- manual_review: Needs expert evaluation
"""
    
    try:
        response = requests.post(
            f"{ARGO_BASE_URL}/chat/completions",
            json={
                "model": GPT4O_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,
                "max_tokens": 300
            },
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            content = result['choices'][0]['message']['content']
            
            if '```json' in content:
                content = content.split('```json')[1].split('```')[0].strip()
            elif '```' in content:
                content = content.split('```')[1].split('```')[0].strip()
            
            evaluation = json.loads(content)
            return evaluation
        else:
            return None
    except Exception as e:
        print(f"    Error: {e}")
        return None

print("GPT-4o evaluation function defined")

GPT-4o evaluation function defined


In [13]:
print("=" * 80)
print("STEP 1: COMPREHENSIVE ANALYSIS WITH GPT-4o")
print("=" * 80)
print(f"\nEvaluating {len(carbon_sources)} carbon sources...")
print(f"(Expected time: ~{len(carbon_sources) * 0.5 / 60:.0f} minutes)\n")

evaluations_gpt4o = []

for i, carbon_source in enumerate(carbon_sources, 1):
    # Print progress for every compound
    print(f"  [{i}/{len(carbon_sources)}] {carbon_source}")
    
    eval_result = evaluate_carbon_source_gpt4o(carbon_source)
    
    if eval_result:
        evaluations_gpt4o.append({
            'Carbon_Source': carbon_source,
            'Suitable': eval_result.get('suitable', False),
            'Category': eval_result.get('category', 'unclear'),
            'Recommendation': eval_result.get('recommendation', 'manual_review'),
            'Reasoning': eval_result.get('reasoning', ''),
            'Suggested_Alternative': eval_result.get('suggested_alternative', '')
        })
    else:
        evaluations_gpt4o.append({
            'Carbon_Source': carbon_source,
            'Suitable': None,
            'Category': 'error',
            'Recommendation': 'manual_review',
            'Reasoning': 'GPT-4o evaluation failed',
            'Suggested_Alternative': ''
        })
    
    time.sleep(0.5)  # Rate limiting

# Save GPT-4o results
gpt4o_df = pd.DataFrame(evaluations_gpt4o)
gpt4o_df.to_csv(GPT4O_OUTPUT, index=False)

print(f"\n{'=' * 80}")
print("STEP 1 COMPLETE")
print(f"{'=' * 80}")

# Statistics
suitable = gpt4o_df['Suitable'].sum()
unsuitable = len(gpt4o_df) - suitable
use_directly = (gpt4o_df['Recommendation'] == 'use').sum()
use_monomer = (gpt4o_df['Recommendation'] == 'use_monomer').sum()
exclude = (gpt4o_df['Recommendation'] == 'exclude').sum()
manual = (gpt4o_df['Recommendation'] == 'manual_review').sum()

print(f"\nSuitable: {suitable} ({100*suitable/len(gpt4o_df):.1f}%)")
print(f"Unsuitable: {unsuitable} ({100*unsuitable/len(gpt4o_df):.1f}%)")
print(f"\nRecommendations:")
print(f"  Use directly: {use_directly}")
print(f"  Use monomer: {use_monomer}")
print(f"  Exclude: {exclude}")
print(f"  Manual review: {manual}")
print(f"\nSaved to: {GPT4O_OUTPUT}")

STEP 1: COMPREHENSIVE ANALYSIS WITH GPT-4o

Evaluating 207 carbon sources...
(Expected time: ~2 minutes)

  [1/207] (+)-Arabinogalactan
  [2/207] 1,2-Propanediol
  [3/207] 1,3-Butandiol
  [4/207] 1,4-B-D-Galactobiose
  [5/207] 1,4-Butanediol
  [6/207] 1,5-Pentanediol
  [7/207] 1-Pentanol
  [8/207] 2'-Deoxycytidine
  [9/207] 2'-Deoxyinosine
  [10/207] 2-Deoxy-D-Ribose
  [11/207] 2-Deoxy-D-ribonic acid lithium salt
  [12/207] 2-Deoxyadenosine 5-monophosphate
  [13/207] 2-Deoxyadenosine monohydrate
  [14/207] 2-Piperidinone
  [15/207] 2-methyl-1-butanol
  [16/207] 3-Methyl-2-Oxobutanoic Acid
  [17/207] 3-methyl-1-butanol
  [18/207] 3-methyl-2-oxopentanoic acid
  [19/207] 3-methyl-3-butenol
  [20/207] 4-Aminobutyric acid
  [21/207] 4-Hydroxybenzoic Acid
  [22/207] 4-Hydroxyvalerate
  [23/207] 4-Methyl-2-oxovaleric acid
  [24/207] 5-Aminovaleric acid
  [25/207] 5-Keto-D-Gluconic Acid potassium salt
  [26/207] 6-O-Acetyl-D-glucose
  [27/207] Acetylated xylan
  [28/207] Actilight
  [29/207] A

## Review GPT-4o Results

Examine the distribution and identify manual review cases.

In [6]:
print("\n" + "=" * 80)
print("GPT-4o ANALYSIS SUMMARY")
print("=" * 80)

print("\n" + "-" * 80)
print("BREAKDOWN BY CATEGORY")
print("-" * 80)
for category, count in gpt4o_df['Category'].value_counts().items():
    pct = 100 * count / len(gpt4o_df)
    print(f"  {category:20s}: {count:3d} ({pct:5.1f}%)")

print("\n" + "-" * 80)
print("COMPOUNDS FLAGGED FOR MANUAL REVIEW")
print("-" * 80)

manual_review = gpt4o_df[gpt4o_df['Recommendation'] == 'manual_review']
print(f"\nTotal: {len(manual_review)} compounds\n")

for _, row in manual_review.iterrows():
    print(f"  - {row['Carbon_Source']}")
    print(f"    Category: {row['Category']}")
    print(f"    Reason: {row['Reasoning']}")
    print()


GPT-4o ANALYSIS SUMMARY

--------------------------------------------------------------------------------
BREAKDOWN BY CATEGORY
--------------------------------------------------------------------------------
  simple_metabolite   : 155 ( 74.9%)
  polymer             :  29 ( 14.0%)
  proprietary         :  18 (  8.7%)
  complex_mixture     :   3 (  1.4%)
  unclear             :   2 (  1.0%)

--------------------------------------------------------------------------------
COMPOUNDS FLAGGED FOR MANUAL REVIEW
--------------------------------------------------------------------------------

Total: 24 compounds

  - 2-Deoxy-D-ribonic acid lithium salt
    Category: simple_metabolite
    Reason: 2-Deoxy-D-ribonic acid is a defined compound but not a typical carbon source for bacteria.

  - Actilight
    Category: proprietary
    Reason: Actilight is a commercial product and likely a proprietary blend, not a single defined compound.

  - Avantafiber
    Category: proprietary
    Reason: Avan

## Step 2: Deep Dive with GPT-5

Extended reasoning for compounds flagged as "manual_review".

In [7]:
def evaluate_carbon_source_gpt5(compound_name, gpt4o_reasoning):
    """Use GPT-5 for deep analysis of edge cases"""
    
    prompt = f"""You are a senior metabolic modeling expert performing a detailed evaluation of a carbon source compound.

Compound name: "{compound_name}"

Initial assessment (GPT-4o): {gpt4o_reasoning}

Task: Provide a definitive recommendation on whether this compound can be used in genome-scale metabolic modeling.

Consider:
1. Biochemical databases: Is it in KEGG, ModelSEED, BiGG, MetaCyc?
2. Metabolic pathways: Which bacterial pathways could metabolize this?
3. Literature evidence: Is there experimental data on bacterial metabolism of this compound?
4. Modeling feasibility: Can it be represented as a single exchange reaction?
5. Alternatives: If unsuitable, what's the best proxy compound?

For proprietary/unclear compounds:
- Research the likely composition (e.g., commercial prebiotics are often FOS/inulin)
- Suggest concrete alternatives if available
- State confidence level in your recommendation

Response format (JSON):
{{
  "recommendation": "use" | "use_alternative" | "exclude",
  "confidence": "high" | "medium" | "low",
  "reasoning": "2-3 sentence detailed explanation with specific references",
  "suggested_compound": "Specific compound to use instead (or empty if 'use')",
  "database_ids": "Any known KEGG/ModelSEED IDs (or empty string)",
  "metabolic_pathway": "Known pathway if applicable (or empty string)"
}}
"""
    
    try:
        response = requests.post(
            f"{ARGO_BASE_URL}/chat/completions",
            json={
                "model": GPT5_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,
                "max_tokens": 800
            },
            timeout=90  # GPT-5 needs more time for reasoning
        )
        
        if response.status_code == 200:
            result = response.json()
            content = result['choices'][0]['message']['content']
            
            if '```json' in content:
                content = content.split('```json')[1].split('```')[0].strip()
            elif '```' in content:
                content = content.split('```')[1].split('```')[0].strip()
            
            evaluation = json.loads(content)
            return evaluation
        else:
            return None
    except Exception as e:
        print(f"    Error: {e}")
        return None

print("GPT-5 deep dive function defined")

GPT-5 deep dive function defined


In [8]:
if len(manual_review) > 0:
    print("\n" + "=" * 80)
    print("STEP 2: DEEP DIVE WITH GPT-5")
    print("=" * 80)
    print(f"\nAnalyzing {len(manual_review)} compounds with extended reasoning...")
    print(f"(Expected time: ~{len(manual_review) * 1.5 / 60:.0f} minutes)\n")
    
    evaluations_gpt5 = []
    
    for i, (_, row) in enumerate(manual_review.iterrows(), 1):
        carbon_source = row['Carbon_Source']
        gpt4o_reasoning = row['Reasoning']
        
        print(f"  [{i}/{len(manual_review)}] {carbon_source}")
        
        gpt5_result = evaluate_carbon_source_gpt5(carbon_source, gpt4o_reasoning)
        
        if gpt5_result:
            print(f"    → Recommendation: {gpt5_result.get('recommendation', 'unknown')}")
            print(f"    → Confidence: {gpt5_result.get('confidence', 'unknown')}")
            
            evaluations_gpt5.append({
                'Carbon_Source': carbon_source,
                'GPT4o_Reasoning': gpt4o_reasoning,
                'GPT5_Recommendation': gpt5_result.get('recommendation', 'exclude'),
                'GPT5_Confidence': gpt5_result.get('confidence', 'low'),
                'GPT5_Reasoning': gpt5_result.get('reasoning', ''),
                'Suggested_Compound': gpt5_result.get('suggested_compound', ''),
                'Database_IDs': gpt5_result.get('database_ids', ''),
                'Metabolic_Pathway': gpt5_result.get('metabolic_pathway', '')
            })
        else:
            print(f"    → GPT-5 request failed")
            evaluations_gpt5.append({
                'Carbon_Source': carbon_source,
                'GPT4o_Reasoning': gpt4o_reasoning,
                'GPT5_Recommendation': 'exclude',
                'GPT5_Confidence': 'low',
                'GPT5_Reasoning': 'GPT-5 evaluation failed',
                'Suggested_Compound': '',
                'Database_IDs': '',
                'Metabolic_Pathway': ''
            })
        
        time.sleep(1)  # Rate limiting for GPT-5
    
    # Save GPT-5 results
    gpt5_df = pd.DataFrame(evaluations_gpt5)
    gpt5_df.to_csv(GPT5_OUTPUT, index=False)
    
    print(f"\n{'=' * 80}")
    print("STEP 2 COMPLETE")
    print(f"{'=' * 80}")
    print(f"\nSaved to: {GPT5_OUTPUT}")
    
    # Statistics
    use = (gpt5_df['GPT5_Recommendation'] == 'use').sum()
    use_alt = (gpt5_df['GPT5_Recommendation'] == 'use_alternative').sum()
    exclude = (gpt5_df['GPT5_Recommendation'] == 'exclude').sum()
    
    print(f"\nGPT-5 Recommendations:")
    print(f"  Use as-is: {use}")
    print(f"  Use alternative: {use_alt}")
    print(f"  Exclude: {exclude}")
else:
    print("\n✓ No manual review cases - skipping GPT-5 deep dive")
    gpt5_df = pd.DataFrame()


STEP 2: DEEP DIVE WITH GPT-5

Analyzing 24 compounds with extended reasoning...
(Expected time: ~1 minutes)

  [1/24] 2-Deoxy-D-ribonic acid lithium salt
    → Recommendation: use_alternative
    → Confidence: medium
  [2/24] Actilight
    → Recommendation: use_alternative
    → Confidence: medium
  [3/24] Avantafiber
    → Recommendation: use_alternative
    → Confidence: low
  [4/24] Bimuno-prebiotic
    → Recommendation: use_alternative
    → Confidence: high
  [5/24] Bioecolians-prebiotic
    → Recommendation: use_alternative
    → Confidence: medium
  [6/24] Carnitine Hydrochloride
    → Recommendation: use_alternative
    → Confidence: high
  [7/24] CravingZGone-prebiotic
    → Recommendation: use_alternative
    → Confidence: medium
  [8/24] D-Leucrose
    → Recommendation: use_alternative
    → Confidence: medium
  [9/24] Fibersol-2-AG-fiber
    → Recommendation: use_alternative
    → Confidence: high
  [10/24] Glucuronamide
    → Recommendation: use_alternative
    → Confiden

## Create Final Curated List

Combine GPT-4o and GPT-5 results into a single actionable dataset.

In [9]:
print("\n" + "=" * 80)
print("CREATING FINAL CURATED LIST")
print("=" * 80)

# Start with GPT-4o results
final_df = gpt4o_df.copy()

# Merge GPT-5 recommendations for manual review cases
if len(gpt5_df) > 0:
    # Create a mapping of GPT-5 decisions
    gpt5_decisions = gpt5_df.set_index('Carbon_Source').to_dict('index')
    
    # Update final recommendations based on GPT-5
    for idx, row in final_df.iterrows():
        if row['Recommendation'] == 'manual_review':
            carbon_source = row['Carbon_Source']
            if carbon_source in gpt5_decisions:
                gpt5_rec = gpt5_decisions[carbon_source]['GPT5_Recommendation']
                gpt5_conf = gpt5_decisions[carbon_source]['GPT5_Confidence']
                gpt5_reasoning = gpt5_decisions[carbon_source]['GPT5_Reasoning']
                suggested = gpt5_decisions[carbon_source]['Suggested_Compound']
                
                # Map GPT-5 recommendation to our schema
                if gpt5_rec == 'use':
                    final_df.at[idx, 'Recommendation'] = 'use'
                    final_df.at[idx, 'Suitable'] = True
                elif gpt5_rec == 'use_alternative' and suggested:
                    final_df.at[idx, 'Recommendation'] = 'use_alternative'
                    final_df.at[idx, 'Suggested_Alternative'] = suggested
                    final_df.at[idx, 'Suitable'] = False
                else:
                    final_df.at[idx, 'Recommendation'] = 'exclude'
                    final_df.at[idx, 'Suitable'] = False
                
                # Append GPT-5 reasoning
                final_df.at[idx, 'Reasoning'] = f"{row['Reasoning']} | GPT-5: {gpt5_reasoning} (confidence: {gpt5_conf})"

# Save final curated list
final_df.to_csv(FINAL_OUTPUT, index=False)

print(f"\nSaved final curated list to: {FINAL_OUTPUT}")

# Final statistics
total = len(final_df)
use = (final_df['Recommendation'] == 'use').sum()
use_monomer = (final_df['Recommendation'] == 'use_monomer').sum()
use_alt = (final_df['Recommendation'] == 'use_alternative').sum()
exclude = (final_df['Recommendation'] == 'exclude').sum()
manual_remain = (final_df['Recommendation'] == 'manual_review').sum()

print(f"\n{'=' * 80}")
print("FINAL SUMMARY")
print(f"{'=' * 80}")
print(f"\nTotal carbon sources: {total}")
print(f"\nRecommendations:")
print(f"  Use directly: {use} ({100*use/total:.1f}%)")
print(f"  Use monomer: {use_monomer} ({100*use_monomer/total:.1f}%)")
print(f"  Use alternative: {use_alt} ({100*use_alt/total:.1f}%)")
print(f"  Exclude: {exclude} ({100*exclude/total:.1f}%)")
print(f"  Still needs review: {manual_remain} ({100*manual_remain/total:.1f}%)")

print(f"\n{'=' * 80}")
print("EVALUATION COMPLETE")
print(f"{'=' * 80}")


CREATING FINAL CURATED LIST

Saved final curated list to: results/carbon_source_evaluation_final.csv

FINAL SUMMARY

Total carbon sources: 207

Recommendations:
  Use directly: 141 (68.1%)
  Use monomer: 29 (14.0%)
  Use alternative: 23 (11.1%)
  Exclude: 14 (6.8%)
  Still needs review: 0 (0.0%)

EVALUATION COMPLETE


## Summary

**Files Created**:
1. `results/carbon_source_evaluation_gpt4o.csv` - Full GPT-4o analysis of all 207 compounds
2. `results/carbon_source_evaluation_gpt5.csv` - GPT-5 deep dive for manual review cases
3. `results/carbon_source_evaluation_final.csv` - Final curated recommendations

**Next Steps**:
1. Review the final curated list
2. Use this list in CDMSCI-197 for ModelSEED mapping
3. Only map compounds with recommendation: "use", "use_monomer", or "use_alternative"
4. Skip compounds with recommendation: "exclude"

**Workflow Integration**:
- This notebook should be run BEFORE starting CDMSCI-197
- CDMSCI-197 will use the curated list from this analysis