# Map Carbon Sources to ModelSEED Compound IDs

**Parent**: CDMSCI-193 - RBTnSeq Modeling Analysis

**Ticket**: CDMSCI-197 - Translate to Computational Media Formulations

## Objective

Map 141 filtered carbon sources from CDMSCI-196 to ModelSEED compound IDs (cpd#####) for metabolic modeling.

## Input

Using the filtered growth matrix from CDMSCI-196 which contains:
- 141 carbon sources (after removing unsuitable compounds)
- 44 organisms (after filtering organisms with no growth data)

## Mapping Strategy

**Round 1: Automated Search**
1. Search local template (GramNegModelTemplateV6.json)
2. Search ModelSEED local database (offline)
3. Handle duplicates by choosing lower compound ID

**Round 2: AI-Assisted Mapping**
1. Use GPT-4o (via Argo proxy) for unmapped compounds
2. Provide compound name + chemical context
3. Get ModelSEED ID suggestion with explanation

## Outputs

1. `carbon_source_mapping.csv` - Complete mapping table
2. `media/` directory - Individual media JSON files for each carbon source

**Last updated**: 2025-10-15

## Setup

In [1]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from urllib.request import urlopen, URLError
from urllib.parse import quote
import requests
import time

print("Imports successful")

Imports successful


## Configuration

In [2]:
# Paths
CARBON_SOURCES_FILE = Path('../CDMSCI-196-carbon-sources/results/combined_growth_matrix_filtered.csv')
TEMPLATE_PATH = Path('../references/build_metabolic_model/GramNegModelTemplateV6.json')

# Output paths
OUTPUT_DIR = Path('results')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

MEDIA_DIR = Path('media')
MEDIA_DIR.mkdir(parents=True, exist_ok=True)

MAPPING_FILE = OUTPUT_DIR / 'carbon_source_mapping.csv'

# Argo proxy for LLM
ARGO_BASE_URL = 'http://localhost:8000/v1'
ARGO_MODEL = 'gpt4o'

# Local ModelSEED Database files
MODELSEED_ALIASES = Path('../data/modelseed_database/Unique_ModelSEED_Compound_Aliases.txt')
MODELSEED_NAMES = Path('../data/modelseed_database/Unique_ModelSEED_Compound_Names.txt')

print(f"Configuration set")
print(f"  Carbon sources: {CARBON_SOURCES_FILE}")
print(f"  Template: {TEMPLATE_PATH}")
print(f"  Output: {MAPPING_FILE}")
print(f"  Media directory: {MEDIA_DIR}")

Configuration set
  Carbon sources: ../CDMSCI-196-carbon-sources/results/combined_growth_matrix_filtered.csv
  Template: ../references/build_metabolic_model/GramNegModelTemplateV6.json
  Output: results/carbon_source_mapping.csv
  Media directory: media


## Load Filtered Carbon Sources

In [3]:
print("Loading filtered carbon sources from CDMSCI-196...")
growth_matrix = pd.read_csv(CARBON_SOURCES_FILE, index_col=0)

# Filter out NaN values from index
carbon_sources = [cs for cs in growth_matrix.index.tolist() if pd.notna(cs)]

print(f"\nLoaded {len(carbon_sources)} filtered carbon sources")
print(f"(These are the 141 carbon sources selected for modeling in CDMSCI-196)")
print(f"\nFirst 10 carbon sources:")
for i, cs in enumerate(carbon_sources[:10], 1):
    print(f"  {i:3d}. {cs}")

Loading filtered carbon sources from CDMSCI-196...

Loaded 140 filtered carbon sources
(These are the 141 carbon sources selected for modeling in CDMSCI-196)

First 10 carbon sources:
    1. 1,2-Propanediol
    2. 1,3-Butandiol
    3. 1,4-Butanediol
    4. 1,5-Pentanediol
    5. 1-Pentanol
    6. 2-Deoxy-D-Ribose
    7. 2-methyl-1-butanol
    8. 3-Methyl-2-Oxobutanoic Acid
    9. 3-methyl-1-butanol
   10. 3-methyl-2-oxopentanoic acid


## Load Template

In [4]:
print(f"Loading ModelSEED template: {TEMPLATE_PATH}")
with open(TEMPLATE_PATH) as f:
    template = json.load(f)

print(f"\nTemplate loaded:")
print(f"  Compounds: {len(template['compounds'])}")
print(f"  Reactions: {len(template['reactions'])}")

# Create compound index for fast lookup
template_compounds = template['compounds']
print(f"\nIndexed {len(template_compounds)} compounds for searching")

Loading ModelSEED template: ../references/build_metabolic_model/GramNegModelTemplateV6.json

Template loaded:
  Compounds: 6573
  Reactions: 8584

Indexed 6573 compounds for searching


In [5]:
# Load ModelSEED alias files (local, no internet needed)
print("Loading ModelSEED alias files...")

# Load compound names
compound_names = {}
with open(MODELSEED_NAMES) as f:
    next(f)  # Skip header
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 2:
            cpd_id = parts[0]
            name = parts[1].lower()
            if name not in compound_names:
                compound_names[name] = []
            compound_names[name].append(cpd_id)

print(f"  Loaded {len(compound_names):,} compound names")

# Load compound aliases
compound_aliases = {}
with open(MODELSEED_ALIASES) as f:
    next(f)  # Skip header
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 3:
            cpd_id = parts[0]
            alias = parts[1].lower()
            source = parts[2]
            if alias not in compound_aliases:
                compound_aliases[alias] = []
            compound_aliases[alias].append(cpd_id)

print(f"  Loaded {len(compound_aliases):,} compound aliases")
print("ModelSEED database ready for offline searching")

Loading ModelSEED alias files...
  Loaded 130,196 compound names
  Loaded 108,286 compound aliases
ModelSEED database ready for offline searching


## Search Functions

In [6]:
def search_template(compound_name):
    """Search for compound in local template"""
    matches = []
    search_lower = compound_name.lower()
    
    for compound in template_compounds:
        # Search in name
        if search_lower == compound['name'].lower():
            matches.append(compound)
            continue
        
        # Search in abbreviation
        abbr = compound.get('abbreviation', '')
        if abbr and search_lower == abbr.lower():
            matches.append(compound)
            continue
        
        # Search in aliases
        for alias in compound.get('aliases', []):
            if search_lower in alias.lower():
                matches.append(compound)
                break
    
    return matches


def search_template_by_id(compound_id):
    """Search for compound by ID in local template"""
    for compound in template_compounds:
        if compound['id'] == compound_id:
            return compound
    return None


def search_modelseed_local(compound_name):
    """Search ModelSEED using local alias files (offline)"""
    search_lower = compound_name.lower()
    found_ids = set()

    # Search in compound names
    if search_lower in compound_names:
        found_ids.update(compound_names[search_lower])

    # Search in aliases
    if search_lower in compound_aliases:
        found_ids.update(compound_aliases[search_lower])

    # Get compound details from template
    matches = []
    for cpd_id in found_ids:
        compound = search_template_by_id(cpd_id)
        if compound:
            matches.append({
                'id': cpd_id,
                'name': compound['name'],
                'formula': compound.get('formula', ''),
                'charge': compound.get('defaultCharge', 0),
                'mass': compound.get('mass', 0),
                'source': 'modelseed_local'
            })

    return matches


def search_compound_round1(compound_name):
    """Round 1: Search template and local ModelSEED database"""
    # Try template first (faster, offline)
    template_matches = search_template(compound_name)
    
    # Try local ModelSEED if no template matches
    modelseed_matches = search_modelseed_local(compound_name) if not template_matches else []
    
    # Combine and deduplicate
    all_matches = []
    seen_ids = set()
    
    for match in template_matches:
        cpd_id = match['id']
        if cpd_id not in seen_ids:
            all_matches.append({
                'id': cpd_id,
                'name': match['name'],
                'formula': match.get('formula', ''),
                'charge': match.get('defaultCharge', 0),
                'mass': match.get('mass', 0),
                'source': 'template'
            })
            seen_ids.add(cpd_id)
    
    for match in modelseed_matches:
        cpd_id = match['id']
        if cpd_id not in seen_ids:
            all_matches.append(match)
            seen_ids.add(cpd_id)
    
    # Sort by ID (lower IDs first)
    all_matches.sort(key=lambda x: x['id'])
    
    return all_matches

print("Search functions defined")

Search functions defined


## Round 1: Automated Mapping

Search for ModelSEED compound IDs using local template and database files.

In [7]:
print("="*80)
print("ROUND 1: AUTOMATED MAPPING")
print("="*80)

mappings = []
unmapped = []

for i, carbon_source in enumerate(carbon_sources, 1):
    print(f"\n[{i}/{len(carbon_sources)}] {carbon_source}")
    
    matches = search_compound_round1(carbon_source)
    
    if matches:
        # Found matches
        best_match = matches[0]  # Lowest ID (already sorted)
        
        if len(matches) > 1:
            # Report duplicates
            duplicate_ids = [m['id'] for m in matches]
            print(f"  DUPLICATE: Found {len(matches)} matches: {duplicate_ids}")
            print(f"  Selected: {best_match['id']} (lowest ID)")
        else:
            print(f"  Mapped: {best_match['id']} - {best_match['name']}")
        
        mappings.append({
            'Carbon_Source_Original': carbon_source,
            'ModelSEED_ID': best_match['id'],
            'ModelSEED_Name': best_match['name'],
            'Formula': best_match['formula'],
            'Mass': best_match['mass'],
            'Charge': best_match['charge'],
            'Mapping_Method': f"round1_{best_match['source']}",
            'Confidence': 'High',
            'AI_Explanation': '',
            'Duplicate_IDs': ';'.join([m['id'] for m in matches[1:]]) if len(matches) > 1 else ''
        })
    else:
        # No matches found
        print(f"  NOT FOUND - will try LLM in Round 2")
        unmapped.append(carbon_source)

print(f"\n{'='*80}")
print(f"ROUND 1 COMPLETE")
print(f"  Mapped: {len(mappings)} ({100*len(mappings)/len(carbon_sources):.1f}%)")
print(f"  Unmapped: {len(unmapped)} ({100*len(unmapped)/len(carbon_sources):.1f}%)")
print(f"{'='*80}")

# Save Round 1 intermediate results
print(f"\nSaving Round 1 intermediate results...")
round1_df = pd.DataFrame(mappings)
round1_file = OUTPUT_DIR / 'round1_mapped.csv'
round1_df.to_csv(round1_file, index=False)
print(f"  Mapped compounds: {round1_file}")

unmapped_file = OUTPUT_DIR / 'round1_unmapped.txt'
with open(unmapped_file, 'w') as f:
    for cs in unmapped:
        f.write(f"{cs}\n")
print(f"  Unmapped compounds: {unmapped_file}")
print(f"\nIntermediate files saved. Proceeding to Round 2...")

ROUND 1: AUTOMATED MAPPING

[1/140] 1,2-Propanediol
  Mapped: cpd00453 - 1,2-Propanediol

[2/140] 1,3-Butandiol
  NOT FOUND - will try LLM in Round 2

[3/140] 1,4-Butanediol
  NOT FOUND - will try LLM in Round 2

[4/140] 1,5-Pentanediol
  NOT FOUND - will try LLM in Round 2

[5/140] 1-Pentanol
  Mapped: cpd16586 - 1-Pentanol

[6/140] 2-Deoxy-D-Ribose
  Mapped: cpd01242 - Thyminose

[7/140] 2-methyl-1-butanol
  Mapped: cpd16873 - 2-methyl-1-butanol

[8/140] 3-Methyl-2-Oxobutanoic Acid
  Mapped: cpd00123 - 3-Methyl-2-oxobutanoate

[9/140] 3-methyl-1-butanol
  Mapped: cpd04533 - Isoamyl alcohol

[10/140] 3-methyl-2-oxopentanoic acid
  NOT FOUND - will try LLM in Round 2

[11/140] 4-Aminobutyric acid
  Mapped: cpd00281 - GABA

[12/140] 4-Hydroxybenzoic Acid
  Mapped: cpd00136 - 4-Hydroxybenzoate

[13/140] 4-Hydroxyvalerate
  NOT FOUND - will try LLM in Round 2

[14/140] 4-Methyl-2-oxovaleric acid
  NOT FOUND - will try LLM in Round 2

[15/140] 5-Aminovaleric acid
  Mapped: cpd00339 - 5-Ami

## Round 2: AI-Assisted Mapping

In [8]:
def ask_llm_for_mapping(compound_name):
    """Use LLM to suggest ModelSEED compound ID"""
    
    prompt = f"""You are a biochemistry expert helping map compound names to ModelSEED database IDs.

Compound name: "{compound_name}"

Task: Suggest the most likely ModelSEED compound ID (format: cpd#####) for this compound.

Context:
- This is a carbon source from bacterial growth experiments
- ModelSEED uses standardized compound IDs (e.g., cpd00027 = D-Glucose)
- Common carbon sources: glucose (cpd00027), glycerol (cpd00100), acetate (cpd00029)
- For complex names, try to identify the base metabolite
- For salts/hydrates, use the base compound (e.g., "Citric Acid" → cpd00137 = Citrate)
- For polymers, suggest the monomer (e.g., "Amylose" → cpd00027 = Glucose)

Response format (JSON):
{{
  "compound_id": "cpd#####",
  "compound_name": "Official ModelSEED name",
  "explanation": "One-line explanation of mapping rationale",
  "confidence": "high/medium/low"
}}

If you cannot confidently map this compound, return:
{{
  "compound_id": "UNMAPPED",
  "compound_name": "",
  "explanation": "Reason why mapping is not possible",
  "confidence": "low"
}}
"""
    
    try:
        response = requests.post(
            f"{ARGO_BASE_URL}/chat/completions",
            json={
                "model": ARGO_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,  # Low temperature for consistent reasoning
                "max_tokens": 500
            },
            timeout=90
        )
        
        if response.status_code == 200:
            result = response.json()
            content = result['choices'][0]['message']['content']
            
            # Try to parse JSON from response
            # Handle markdown code blocks if present
            if '```json' in content:
                content = content.split('```json')[1].split('```')[0].strip()
            elif '```' in content:
                content = content.split('```')[1].split('```')[0].strip()
            
            mapping = json.loads(content)
            return mapping
        else:
            print(f"    WARNING: LLM request failed with status {response.status_code}")
            return None
            
    except Exception as e:
        print(f"    WARNING: LLM request error: {e}")
        return None

print("LLM mapping function defined")

LLM mapping function defined


In [9]:
if unmapped:
    print("\n" + "="*80)
    print("ROUND 2: AI-ASSISTED MAPPING")
    print("="*80)
    print(f"\nAttempting to map {len(unmapped)} compounds using LLM...")
    
    for i, carbon_source in enumerate(unmapped, 1):
        print(f"\n[{i}/{len(unmapped)}] {carbon_source}")
        
        llm_result = ask_llm_for_mapping(carbon_source)
        
        if llm_result:
            cpd_id = llm_result.get('compound_id', 'UNMAPPED')
            cpd_name = llm_result.get('compound_name', '')
            explanation = llm_result.get('explanation', '')
            confidence = llm_result.get('confidence', 'low')
            
            if cpd_id != 'UNMAPPED':
                print(f"  LLM Mapped: {cpd_id} - {cpd_name}")
                print(f"     Explanation: {explanation}")
                print(f"     Confidence: {confidence}")
                
                # Verify LLM suggestion exists in template
                verification = search_template_by_id(cpd_id)
                if verification:
                    print(f"  Verified in template")
                    mappings.append({
                        'Carbon_Source_Original': carbon_source,
                        'ModelSEED_ID': cpd_id,
                        'ModelSEED_Name': verification['name'],
                        'Formula': verification.get('formula', ''),
                        'Mass': verification.get('mass', 0),
                        'Charge': verification.get('defaultCharge', 0),
                        'Mapping_Method': 'round2_llm',
                        'Confidence': confidence.capitalize(),
                        'AI_Explanation': explanation,
                        'Duplicate_IDs': ''
                    })
                else:
                    print(f"  NOT VERIFIED - LLM suggested ID not in template")
                    mappings.append({
                        'Carbon_Source_Original': carbon_source,
                        'ModelSEED_ID': 'UNMAPPED',
                        'ModelSEED_Name': '',
                        'Formula': '',
                        'Mass': 0,
                        'Charge': 0,
                        'Mapping_Method': 'round2_llm_unverified',
                        'Confidence': 'Low',
                        'AI_Explanation': f"LLM suggested {cpd_id} but not found in template",
                        'Duplicate_IDs': ''
                    })
            else:
                print(f"  LLM could not map: {explanation}")
                mappings.append({
                    'Carbon_Source_Original': carbon_source,
                    'ModelSEED_ID': 'UNMAPPED',
                    'ModelSEED_Name': '',
                    'Formula': '',
                    'Mass': 0,
                    'Charge': 0,
                    'Mapping_Method': 'round2_llm_failed',
                    'Confidence': 'Low',
                    'AI_Explanation': explanation,
                    'Duplicate_IDs': ''
                })
        else:
            print(f"  LLM request failed")
            mappings.append({
                'Carbon_Source_Original': carbon_source,
                'ModelSEED_ID': 'UNMAPPED',
                'ModelSEED_Name': '',
                'Formula': '',
                'Mass': 0,
                'Charge': 0,
                'Mapping_Method': 'round2_llm_error',
                'Confidence': 'Low',
                'AI_Explanation': 'LLM request error',
                'Duplicate_IDs': ''
            })
        
        # Rate limit: small delay between requests
        time.sleep(0.5)
    
    print(f"\n{'='*80}")
    print("ROUND 2 COMPLETE")
    print(f"{'='*80}")
    
    # Save Round 2 results
    print(f"\nSaving Round 2 intermediate results...")
    round2_df = pd.DataFrame(mappings)
    round2_file = OUTPUT_DIR / 'round2_all_mappings.csv'
    round2_df.to_csv(round2_file, index=False)
    print(f"  All mappings so far: {round2_file}")
    
    # Save still unmapped after Round 2
    still_unmapped = round2_df[round2_df['ModelSEED_ID'] == 'UNMAPPED']
    if len(still_unmapped) > 0:
        unmapped_round2_file = OUTPUT_DIR / 'round2_still_unmapped.csv'
        still_unmapped.to_csv(unmapped_round2_file, index=False)
        print(f"  Still unmapped compounds: {unmapped_round2_file}")
        print(f"  {len(still_unmapped)} compounds still need manual curation")
else:
    print("\nAll compounds mapped in Round 1 - skipping Round 2")


ROUND 2: AI-ASSISTED MAPPING

Attempting to map 54 compounds using LLM...

[1/54] 1,3-Butandiol
  LLM Mapped: cpd00738 - 1,3-Butanediol
     Explanation: 1,3-Butanediol is a known compound in the ModelSEED database, used as a carbon source in bacterial growth experiments.
     Confidence: high
  Verified in template

[2/54] 1,4-Butanediol
  LLM Mapped: cpd00751 - 1,4-Butanediol
     Explanation: 1,4-Butanediol is a known compound in the ModelSEED database with the ID cpd00751.
     Confidence: high
  Verified in template

[3/54] 1,5-Pentanediol
  LLM Mapped: cpd19020 - 1,5-Pentanediol
     Explanation: 1,5-Pentanediol is a diol with a five-carbon chain, and it is directly listed in the ModelSEED database.
     Confidence: high
  Verified in template

[4/54] 3-methyl-2-oxopentanoic acid
  LLM Mapped: cpd11493 - 3-Methyl-2-oxopentanoate
     Explanation: The compound '3-methyl-2-oxopentanoic acid' corresponds to the keto acid '3-Methyl-2-oxopentanoate' in the ModelSEED database.
     Co

## Round 3: Validate Duplicate Mappings (GPT-5)

**Purpose**: Identify and flag problematic duplicate ModelSEED mappings from Round 2.

**Why GPT-5**: Better reasoning about chemical structures and biochemical relationships compared to GPT-4o.

**What it does**:
1. Finds all ModelSEED compound IDs mapped to multiple carbon sources
2. Uses GPT-5 to deeply analyze each duplicate group
3. Checks if formulas match expected structures
4. Flags incorrect mappings (e.g., LLM hallucinations)
5. Generates validation report

**Why this matters for CDMSCI-199**:
- Round 2 LLM may hallucinate incorrect mappings
- Different experimental carbon sources → same ModelSEED ID = identical in silico predictions
- Incorrect mappings directly inflate false positive/negative rates in FBA analysis
- Can't distinguish model quality from mapping quality

In [10]:
def validate_duplicate_mapping_gpt5(cpd_id, cpd_formula, carbon_sources, mappings_list):
    """
    Use GPT-5 to validate whether multiple carbon sources correctly map to same ModelSEED compound
    
    Args:
        cpd_id: ModelSEED compound ID (e.g., 'cpd00751')
        cpd_formula: Chemical formula of the ModelSEED compound
        carbon_sources: List of carbon source names mapped to this compound
        mappings_list: List of mapping dict entries for each carbon source
    
    Returns:
        dict with validation results
    """
    
    # Build detailed context
    mapping_details = []
    for i, (cs, mapping) in enumerate(zip(carbon_sources, mappings_list), 1):
        mapping_details.append(
            f"{i}. '{cs}'\n"
            f"   Method: {mapping.get('Mapping_Method', 'unknown')}\n"
            f"   LLM Explanation: {mapping.get('AI_Explanation', 'none')}"
        )
    
    mapping_details_str = '\n'.join(mapping_details)
    
    prompt = f"""You are a biochemistry expert with deep knowledge of metabolic databases and chemical structures.

**CRITICAL TASK**: Analyze if these {len(carbon_sources)} different carbon sources correctly map to the SAME ModelSEED compound.

**ModelSEED Compound**:
- ID: {cpd_id}
- Formula: {cpd_formula}

**Carbon Sources Mapped to This Compound**:
{mapping_details_str}

**Deep Analysis Required**:
1. For EACH carbon source, determine the expected chemical formula
2. Compare expected formula to ModelSEED compound formula ({cpd_formula})
3. Check if they are:
   - IDENTICAL compound
   - Salt/hydrate forms (same base compound)
   - Stereoisomers (same formula, different stereochemistry)
   - COMPLETELY DIFFERENT (wrong formula → LLM hallucination)

4. Flag mappings as:
   - CORRECT: Same compound, salt forms, stereoisomers, reasonable derivatives
   - QUESTIONABLE: Unclear relationship, may need manual review
   - INCORRECT: Completely wrong formula, unrelated structure, LLM hallucination

**Common Issues to Watch For**:
- LLM claiming a compound exists in ModelSEED when formula doesn't match
- Dicarboxylic acids with different chain lengths mapped to same ID
- Diols with different structures mapped together
- Sugars vs non-sugars mixed together

**Response Format (JSON)**:
{{
  "verdict": "all_correct" | "mixed" | "all_incorrect",
  "explanation": "Brief analysis of the mapping group",
  "correct_sources": ["list of correctly mapped sources"],
  "incorrect_sources": ["list of incorrectly mapped sources"],
  "questionable_sources": ["list of questionable mappings"],
  "incorrect_reasons": {{"source_name": "expected formula vs actual formula, why it's wrong"}},
  "confidence": "high" | "medium" | "low"
}}

Be strict: if formula doesn't match, flag as INCORRECT.
"""
    
    try:
        response = requests.post(
            f"{ARGO_BASE_URL}/chat/completions",
            json={
                "model": "gpt5",  # Use GPT-5 for better reasoning
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.05,  # Very low temperature for accurate analysis
                "max_tokens": 2000
            },
            timeout=180
        )
        
        if response.status_code == 200:
            result = response.json()
            content = result['choices'][0]['message']['content']
            
            # Parse JSON response
            if '```json' in content:
                content = content.split('```json')[1].split('```')[0].strip()
            elif '```' in content:
                content = content.split('```')[1].split('```')[0].strip()
            
            validation = json.loads(content)
            return validation
        else:
            print(f"    WARNING: GPT-5 validation failed with status {response.status_code}")
            return None
            
    except Exception as e:
        print(f"    WARNING: GPT-5 validation error: {e}")
        return None

print("GPT-5 duplicate validation function defined")

GPT-5 duplicate validation function defined


In [11]:
print("\n" + "="*80)print("ROUND 3: VALIDATE DUPLICATE MAPPINGS (GPT-5)")print("="*80)# Create DataFrame from mappingsmapping_df_r3 = pd.DataFrame(mappings)# Find all duplicate ModelSEED compound IDs (excluding UNMAPPED)mapped_only = mapping_df_r3[mapping_df_r3['ModelSEED_ID'] != 'UNMAPPED'].copy()duplicate_cpds = mapped_only[mapped_only.duplicated('ModelSEED_ID', keep=False)]['ModelSEED_ID'].unique()print(f"\nFound {len(duplicate_cpds)} ModelSEED compounds with duplicate mappings")if len(duplicate_cpds) == 0:    print("  No duplicates to validate - all carbon sources map to unique compounds")    validation_results = []else:    print(f"  These compounds are mapped to multiple carbon sources")    print(f"  Will validate each group using GPT-5 for deep chemical analysis\n")        validation_results = []        for i, cpd_id in enumerate(duplicate_cpds, 1):        # Get all mappings for this compound        cpd_mappings = mapped_only[mapped_only['ModelSEED_ID'] == cpd_id]        carbon_sources = cpd_mappings['Carbon_Source_Original'].tolist()        cpd_name = cpd_mappings.iloc[0]['ModelSEED_Name']        cpd_formula = cpd_mappings.iloc[0]['Formula']                print(f"\n[{i}/{len(duplicate_cpds)}] {cpd_id} ({cpd_name}, {cpd_formula})")        print(f"  Mapped to {len(carbon_sources)} carbon sources:")        for cs in carbon_sources:            method = cpd_mappings[cpd_mappings['Carbon_Source_Original'] == cs]['Mapping_Method'].iloc[0]            print(f"    - {cs} ({method})")                # Skip if all from Round 1 (these are reliable)        all_round1 = all(cpd_mappings['Mapping_Method'].str.startswith('round1'))        if all_round1:            print(f"  ✓ All from Round 1 (reliable) - skipping validation")            continue                # Validate using GPT-5        print(f"  Validating with GPT-5 (deep chemical analysis)...")        mappings_list = cpd_mappings.to_dict('records')        validation = validate_duplicate_mapping_gpt5(cpd_id, cpd_formula, carbon_sources, mappings_list)                if validation:            verdict = validation.get('verdict', 'unknown')            explanation = validation.get('explanation', '')            incorrect_sources = validation.get('incorrect_sources', [])            questionable_sources = validation.get('questionable_sources', [])            incorrect_reasons = validation.get('incorrect_reasons', {})            confidence = validation.get('confidence', 'low')                        print(f"  Verdict: {verdict.upper()} (confidence: {confidence})")            print(f"  Explanation: {explanation}")                        if len(incorrect_sources) > 0:                print(f"  ⚠️  FLAGGED INCORRECT: {len(incorrect_sources)} mapping(s)")                for src in incorrect_sources:                    reason = incorrect_reasons.get(src, 'No reason provided')                    print(f"     - {src}: {reason}")                        if len(questionable_sources) > 0:                print(f"  ⚠️  QUESTIONABLE: {len(questionable_sources)} mapping(s) need manual review")                for src in questionable_sources:                    print(f"     - {src}")                        if len(incorrect_sources) == 0 and len(questionable_sources) == 0:                print(f"  ✓ All mappings validated as correct")                        # Store validation result            validation_results.append({                'ModelSEED_ID': cpd_id,                'ModelSEED_Name': cpd_name,                'Formula': cpd_formula,                'Total_Sources': len(carbon_sources),                'Carbon_Sources': '; '.join(carbon_sources),                'Verdict': verdict,                'Confidence': confidence,                'Incorrect_Count': len(incorrect_sources),                'Incorrect_Sources': '; '.join(incorrect_sources),                'Questionable_Count': len(questionable_sources),                'Questionable_Sources': '; '.join(questionable_sources),                'Explanation': explanation,                'Incorrect_Reasons': json.dumps(incorrect_reasons)            })        else:            print(f"  ERROR: GPT-5 validation failed")                # Rate limit        time.sleep(1)        print(f"\n{'='*80}")    print("ROUND 3 COMPLETE")    print(f"{'='*80}")    # Save validation resultsif len(validation_results) > 0:    validation_df = pd.DataFrame(validation_results)    validation_file = OUTPUT_DIR / 'round3_duplicate_validation.csv'    validation_df.to_csv(validation_file, index=False)    print(f"\nValidation results saved: {validation_file}")        # Summary statistics    total_validated = len(validation_df)    all_correct = len(validation_df[validation_df['Verdict'] == 'all_correct'])    mixed = len(validation_df[validation_df['Verdict'] == 'mixed'])    all_incorrect = len(validation_df[validation_df['Verdict'] == 'all_incorrect'])    total_incorrect_mappings = validation_df['Incorrect_Count'].sum()    total_questionable_mappings = validation_df['Questionable_Count'].sum()        print(f"\nRound 3 Summary:")    print(f"  Duplicate groups validated: {total_validated}")    print(f"  All correct: {all_correct}")    print(f"  Mixed (some incorrect): {mixed}")    print(f"  All incorrect: {all_incorrect}")    print(f"  Total incorrect mappings flagged: {total_incorrect_mappings}")    print(f"  Total questionable mappings: {total_questionable_mappings}")elif len(duplicate_cpds) > 0:    print("\nAll duplicate groups were from Round 1 (reliable) - no validation needed")

SyntaxError: invalid syntax (1784673984.py, line 1)

In [12]:
print("\n" + "="*80)
print("ROUND 3: VALIDATE DUPLICATE MAPPINGS (GPT-5)")
print("="*80)

# Create DataFrame from mappings
mapping_df_r3 = pd.DataFrame(mappings)

# Find all duplicate ModelSEED compound IDs (excluding UNMAPPED)
mapped_only = mapping_df_r3[mapping_df_r3['ModelSEED_ID'] != 'UNMAPPED'].copy()
duplicate_cpds = mapped_only[mapped_only.duplicated('ModelSEED_ID', keep=False)]['ModelSEED_ID'].unique()

print(f"\nFound {len(duplicate_cpds)} ModelSEED compounds with duplicate mappings")

if len(duplicate_cpds) == 0:
    print("  No duplicates to validate - all carbon sources map to unique compounds")
    validation_results = []
else:
    print(f"  These compounds are mapped to multiple carbon sources")
    print(f"  Will validate each group using GPT-5 for deep chemical analysis\n")

    validation_results = []
    for i, cpd_id in enumerate(duplicate_cpds, 1):
        # Get all mappings for this compound
        cpd_mappings = mapped_only[mapped_only['ModelSEED_ID'] == cpd_id]
        carbon_sources = cpd_mappings['Carbon_Source_Original'].tolist()
        cpd_name = cpd_mappings.iloc[0]['ModelSEED_Name']
        cpd_formula = cpd_mappings.iloc[0]['Formula']

        print(f"\n[{i}/{len(duplicate_cpds)}] {cpd_id} ({cpd_name}, {cpd_formula})")
        print(f"  Mapped to {len(carbon_sources)} carbon sources:")
        for cs in carbon_sources:
            method = cpd_mappings[cpd_mappings['Carbon_Source_Original'] == cs]['Mapping_Method'].iloc[0]
            print(f"    - {cs} ({method})")

        # Skip if all from Round 1 (these are reliable)
        all_round1 = all(cpd_mappings['Mapping_Method'].str.startswith('round1'))
        if all_round1:
            print(f"  ✓ All from Round 1 (reliable) - skipping validation")
            continue

        # Validate using GPT-5
        print(f"  Validating with GPT-5 (deep chemical analysis)...")
        mappings_list = cpd_mappings.to_dict('records')
        validation = validate_duplicate_mapping_gpt5(cpd_id, cpd_formula, carbon_sources, mappings_list)

        if validation:
            verdict = validation.get('verdict', 'unknown')
            explanation = validation.get('explanation', '')
            incorrect_sources = validation.get('incorrect_sources', [])
            questionable_sources = validation.get('questionable_sources', [])
            incorrect_reasons = validation.get('incorrect_reasons', {})
            confidence = validation.get('confidence', 'low')

            print(f"  Verdict: {verdict.upper()} (confidence: {confidence})")
            print(f"  Explanation: {explanation}")

            if len(incorrect_sources) > 0:
                print(f"  ⚠️  FLAGGED INCORRECT: {len(incorrect_sources)} mapping(s)")
                for src in incorrect_sources:
                    reason = incorrect_reasons.get(src, 'No reason provided')
                    print(f"     - {src}: {reason}")

            if len(questionable_sources) > 0:
                print(f"  ⚠️  QUESTIONABLE: {len(questionable_sources)} mapping(s) need manual review")
                for src in questionable_sources:
                    print(f"     - {src}")

            if len(incorrect_sources) == 0 and len(questionable_sources) == 0:
                print(f"  ✓ All mappings validated as correct")

            # Store validation result
            validation_results.append({
                'ModelSEED_ID': cpd_id,
                'ModelSEED_Name': cpd_name,
                'Formula': cpd_formula,
                'Total_Sources': len(carbon_sources),
                'Carbon_Sources': '; '.join(carbon_sources),
                'Verdict': verdict,
                'Confidence': confidence,
                'Incorrect_Count': len(incorrect_sources),
                'Incorrect_Sources': '; '.join(incorrect_sources),
                'Questionable_Count': len(questionable_sources),
                'Questionable_Sources': '; '.join(questionable_sources),
                'Explanation': explanation,
                'Incorrect_Reasons': json.dumps(incorrect_reasons)
            })
        else:
            print(f"  ERROR: GPT-5 validation failed")

        # Rate limit
        time.sleep(1)

        print("\n" + "="*80)

    print("ROUND 3 COMPLETE")
    print("="*80)

    # Save validation results
    if len(validation_results) > 0:
        validation_df = pd.DataFrame(validation_results)
        validation_file = OUTPUT_DIR / 'round3_duplicate_validation.csv'
        validation_df.to_csv(validation_file, index=False)
        print(f"\nValidation results saved: {validation_file}")

        # Summary statistics
        total_validated = len(validation_df)
        all_correct = len(validation_df[validation_df['Verdict'] == 'all_correct'])
        mixed = len(validation_df[validation_df['Verdict'] == 'mixed'])
        all_incorrect = len(validation_df[validation_df['Verdict'] == 'all_incorrect'])
        total_incorrect_mappings = validation_df['Incorrect_Count'].sum()
        total_questionable_mappings = validation_df['Questionable_Count'].sum()

        print(f"\nRound 3 Summary:")
        print(f"  Duplicate groups validated: {total_validated}")
        print(f"  All correct: {all_correct}")
        print(f"  Mixed (some incorrect): {mixed}")
        print(f"  All incorrect: {all_incorrect}")
        print(f"  Total incorrect mappings flagged: {total_incorrect_mappings}")
        print(f"  Total questionable mappings: {total_questionable_mappings}")
    elif len(duplicate_cpds) > 0:
        print("\nAll duplicate groups were from Round 1 (reliable) - no validation needed")


ROUND 3: VALIDATE DUPLICATE MAPPINGS (GPT-5)

Found 14 ModelSEED compounds with duplicate mappings
  These compounds are mapped to multiple carbon sources
  Will validate each group using GPT-5 for deep chemical analysis


[1/14] cpd00137 (Citrate, C6H5O7)
  Mapped to 2 carbon sources:
    - Citric Acid (round1_modelseed_local)
    - Trisodium citrate dihydrate (round2_llm)
  Validating with GPT-5 (deep chemical analysis)...
  Verdict: ALL_CORRECT (confidence: high)
  Explanation: ModelSEED cpd00137 has formula C6H5O7, which corresponds to the fully deprotonated citrate trianion. Citric acid has formula C6H8O7 (the fully protonated acid) and is the same base compound differing only by three protons. Trisodium citrate dihydrate has formula Na3C6H5O7·2H2O (salt/hydrate of citrate; underlying anion C6H5O7³⁻). Both sources are salt/acid–base/hydrate forms of citrate and appropriately map to the citrate entity in ModelSEED.
  ✓ All mappings validated as correct


[2/14] cpd00108 (Galactose

## Round 4: Deep Dive for Unmapped Compounds (GPT-5)

**Purpose**: Make final attempt to map remaining UNMAPPED compounds using GPT-5's deeper analysis.

**Why NON-OPTIONAL**: Every unmapped compound means lost experimental data in CDMSCI-199.

**Why GPT-5**: More sophisticated reasoning about:
- Alternative compound names and synonyms
- Biochemical transformations and derivatives
- Structural relationships
- Stereochemistry considerations

**What it does**:
1. Takes all compounds still UNMAPPED after Round 2
2. Provides GPT-5 with context about why previous attempts failed
3. Asks for comprehensive structural and biochemical analysis
4. Suggests best available mapping or clearly documents why it's impossible

**Output**: Final decision on each unmapped compound for manual review

In [13]:
print("\n" + "="*80)
print("ROUND 4: DEEP DIVE FOR UNMAPPED COMPOUNDS (GPT-5)")
print("="*80)

# Get unmapped compounds
mapping_df_r4 = pd.DataFrame(mappings)
still_unmapped = mapping_df_r4[mapping_df_r4['ModelSEED_ID'] == 'UNMAPPED']

if len(still_unmapped) == 0:
    print("\n✓ All compounds successfully mapped - no deep dive needed")
else:
    print(f"\nDeep dive analysis for {len(still_unmapped)} unmapped compounds...")
    print("Using GPT-5 for comprehensive biochemical analysis\n")
    
    round4_updates = []
    
    for idx, row in still_unmapped.iterrows():
        carbon_source = row['Carbon_Source_Original']
        prev_method = row['Mapping_Method']
        prev_explanation = row['AI_Explanation']
        
        print(f"\n[{idx+1}/{len(still_unmapped)}] {carbon_source}")
        print(f"  Previous attempt: {prev_method}")
        print(f"  Reason unmapped: {prev_explanation[:100]}...")
        
        # More detailed prompt for GPT-5
        prompt = f"""You are a biochemistry expert with deep knowledge of metabolic databases.

**Compound**: "{carbon_source}"
**Previous attempt**: {prev_method}
**Why it failed**: {prev_explanation}

**COMPREHENSIVE ANALYSIS TASK**:

1. **Chemical Structure Identification**:
   - Determine the exact chemical structure
   - Calculate expected molecular formula
   - Identify functional groups

2. **Database Search Strategy**:
   - Search for exact name
   - Check common synonyms and alternative names
   - Look for base compound (remove salts, hydrates)
   - Consider stereoisomers
   - Check parent compounds or related metabolites

3. **ModelSEED Mapping**:
   - If exact match exists: provide ID
   - If parent compound exists: suggest with caveat
   - If structurally similar compound exists: explain relationship
   - If truly absent: clearly state why it's not mappable

4. **Decision**:
   - MAPPED: Found a valid ModelSEED ID
   - APPROXIMATE: Found close match (document differences)
   - UNMAPPABLE: Not in ModelSEED (provide clear justification)

**Response Format (JSON)**:
{{
  "compound_id": "cpd#####" or "UNMAPPED",
  "compound_name": "ModelSEED name",
  "formula": "Expected formula",
  "decision": "mapped" | "approximate" | "unmappable",
  "explanation": "Detailed reasoning (200+ words)",
  "confidence": "high" | "medium" | "low",
  "caveats": "Important limitations of this mapping",
  "manual_review_needed": true | false
}}

Be thorough and honest about limitations.
"""
        
        try:
            response = requests.post(
                f"{ARGO_BASE_URL}/chat/completions",
                json={
                    "model": "gpt5",  # Use GPT-5 for deep analysis
                    "messages": [{"role": "user", "content": prompt}],
                    "temperature": 0.05,
                    "max_tokens": 2000
                },
                timeout=180
            )
            
            if response.status_code == 200:
                result = response.json()
                content = result['choices'][0]['message']['content']
                
                if '```json' in content:
                    content = content.split('```json')[1].split('```')[0].strip()
                elif '```' in content:
                    content = content.split('```')[1].split('```')[0].strip()
                
                deep_result = json.loads(content)
                cpd_id = deep_result.get('compound_id', 'UNMAPPED')
                decision = deep_result.get('decision', 'unmappable')
                explanation = deep_result.get('explanation', '')
                caveats = deep_result.get('caveats', '')
                manual_review = deep_result.get('manual_review_needed', True)
                
                print(f"  GPT-5 Decision: {decision.upper()}")
                print(f"  Compound ID: {cpd_id}")
                print(f"  Explanation: {explanation[:150]}...")
                if caveats:
                    print(f"  ⚠️  Caveats: {caveats[:100]}...")
                if manual_review:
                    print(f"  ⚠️  MANUAL REVIEW REQUIRED")
                
                if cpd_id != 'UNMAPPED':
                    # Verify ID exists in template
                    verification = search_template_by_id(cpd_id)
                    if verification:
                        print(f"  ✓ Verified in template")
                        
                        # Update mapping
                        mappings[idx] = {
                            'Carbon_Source_Original': carbon_source,
                            'ModelSEED_ID': cpd_id,
                            'ModelSEED_Name': verification['name'],
                            'Formula': verification.get('formula', ''),
                            'Mass': verification.get('mass', 0),
                            'Charge': verification.get('defaultCharge', 0),
                            'Mapping_Method': f'round4_gpt5_{decision}',
                            'Confidence': deep_result.get('confidence', 'medium').capitalize(),
                            'AI_Explanation': explanation,
                            'Duplicate_IDs': ''
                        }
                        
                        round4_updates.append({
                            'Carbon_Source': carbon_source,
                            'ModelSEED_ID': cpd_id,
                            'Decision': decision,
                            'Caveats': caveats,
                            'Manual_Review_Needed': manual_review,
                            'Explanation': explanation
                        })
                    else:
                        print(f"  ✗ ID not verified in template - keeping UNMAPPED")
                else:
                    print(f"  Confirmed UNMAPPABLE")
                    round4_updates.append({
                        'Carbon_Source': carbon_source,
                        'ModelSEED_ID': 'UNMAPPED',
                        'Decision': 'unmappable',
                        'Caveats': '',
                        'Manual_Review_Needed': True,
                        'Explanation': explanation
                    })
                    
        except Exception as e:
            print(f"  ERROR: {e}")
        
        time.sleep(1)
    
    print(f"\n{'='*80}")
    print("ROUND 4 COMPLETE")
    print(f"{'='*80}")
    
    # Save Round 4 results
    if len(round4_updates) > 0:
        round4_df = pd.DataFrame(round4_updates)
        round4_file = OUTPUT_DIR / 'round4_deep_dive_results.csv'
        round4_df.to_csv(round4_file, index=False)
        print(f"\nRound 4 results saved: {round4_file}")
        
        newly_mapped = len(round4_df[round4_df['ModelSEED_ID'] != 'UNMAPPED'])
        still_unmapped_after_r4 = len(round4_df[round4_df['ModelSEED_ID'] == 'UNMAPPED'])
        needs_review = len(round4_df[round4_df['Manual_Review_Needed'] == True])
        
        print(f"\nRound 4 Summary:")
        print(f"  Newly mapped: {newly_mapped}")
        print(f"  Still unmapped: {still_unmapped_after_r4}")
        print(f"  Require manual review: {needs_review}")


ROUND 4: DEEP DIVE FOR UNMAPPED COMPOUNDS (GPT-5)

Deep dive analysis for 10 unmapped compounds...
Using GPT-5 for comprehensive biochemical analysis


[94/10] 6-O-Acetyl-D-glucose
  Previous attempt: round2_llm_failed
  Reason unmapped: 6-O-Acetyl-D-glucose is a modified form of D-glucose with an acetyl group, and there is no direct ma...
  GPT-5 Decision: UNMAPPABLE
  Compound ID: UNMAPPED
  Explanation: 1) Chemical structure: 6-O-Acetyl-D-glucose is D-glucose in which the primary hydroxyl at C6 (the CH2OH group) is esterified with an acetyl (CH3CO–) g...
  ⚠️  Caveats: ModelSEED likely lacks small-molecule O-acetylated monosaccharides; using D-glucose as a placeholder...
  ⚠️  MANUAL REVIEW REQUIRED
  Confirmed UNMAPPABLE

[100/10] D-Gluconic Acid sodium salt
  Previous attempt: round2_llm_unverified
  Reason unmapped: LLM suggested cpd00257 but not found in template...
  GPT-5 Decision: APPROXIMATE
  Compound ID: UNMAPPED
  Explanation: D-Gluconic acid sodium salt is the sodium sa

## Decision Summary: Mapping Results

**PURPOSE**: Comprehensive report of all mapping decisions before creating media files.

**This section provides**:
1. Final mapping statistics
2. List of all flagged issues from Round 3 (duplicates)
3. List of all Round 4 decisions (unmapped compounds)
4. Clear identification of compounds requiring manual review
5. Summary for CDMSCI-199 impact

**⚠️ MANUAL REVIEW CHECKPOINT ⚠️**

**DO NOT proceed to media generation** until you have:
1. Reviewed Round 3 validation results
2. Reviewed Round 4 deep dive results  
3. Made manual corrections to problematic mappings
4. Updated `carbon_source_mapping.csv` as needed

**Files to review**:
- `results/round3_duplicate_validation.csv`
- `results/round4_deep_dive_results.csv`
- `results/carbon_source_mapping.csv`

In [15]:
print("\n" + "="*80)
print("DECISION SUMMARY: MAPPING RESULTS")
print("="*80)

# Create final DataFrame
mapping_df_final = pd.DataFrame(mappings)

# Overall statistics
total = len(mapping_df_final)
mapped = (mapping_df_final['ModelSEED_ID'] != 'UNMAPPED').sum()
unmapped_final = (mapping_df_final['ModelSEED_ID'] == 'UNMAPPED').sum()

round1_mapped = mapping_df_final['Mapping_Method'].str.startswith('round1').sum()
round2_mapped = (mapping_df_final['Mapping_Method'].str.startswith('round2') & 
                 (mapping_df_final['ModelSEED_ID'] != 'UNMAPPED')).sum()
round4_mapped = (mapping_df_final['Mapping_Method'].str.startswith('round4') & 
                 (mapping_df_final['ModelSEED_ID'] != 'UNMAPPED')).sum()

print("\n" + "="*80)
print("FINAL MAPPING STATISTICS")
print("="*80)
print(f"\nTotal carbon sources: {total}")
print(f"\nSuccessfully mapped: {mapped} ({100*mapped/total:.1f}%)")
print(f"  Round 1 (automated): {round1_mapped}")
print(f"  Round 2 (GPT-4o): {round2_mapped}")
print(f"  Round 4 (GPT-5 deep dive): {round4_mapped}")
print(f"\nStill unmapped: {unmapped_final} ({100*unmapped_final/total:.1f}%)")

# Round 3 issues
print("\n" + "="*80)
print("ROUND 3: DUPLICATE VALIDATION ISSUES")
print("="*80)

validation_file = OUTPUT_DIR / 'round3_duplicate_validation.csv'
if validation_file.exists():
    validation_df = pd.read_csv(validation_file)
    total_incorrect = validation_df['Incorrect_Count'].sum()
    total_questionable = validation_df['Questionable_Count'].sum()
    
    print(f"\nTotal duplicate groups validated: {len(validation_df)}")
    print(f"Total incorrect mappings flagged: {total_incorrect}")
    print(f"Total questionable mappings: {total_questionable}")
    
    if total_incorrect > 0:
        print(f"\n⚠️  INCORRECT MAPPINGS REQUIRING CORRECTION:")
        for _, row in validation_df[validation_df['Incorrect_Count'] > 0].iterrows():
            expl = row.get('Explanation', '')
            if pd.isna(expl):
                expl = ''
            print(f"\n  {row['ModelSEED_ID']} ({row['ModelSEED_Name']}, {row['Formula']})")
            print(f"    Incorrect: {row['Incorrect_Sources']}")
            print(f"    Reason: {str(expl)[:150]}...")

    if total_questionable > 0:
        print(f"\n⚠️  QUESTIONABLE MAPPINGS REQUIRING REVIEW:")
        for _, row in validation_df[validation_df['Questionable_Count'] > 0].iterrows():
            print(f"\n  {row['ModelSEED_ID']}: {row['Questionable_Sources']}")
else:
    print("\n  No duplicate validation results found")


# Round 4 decisions
print("\n" + "="*80)
print("ROUND 4: DEEP DIVE DECISIONS")
print("="*80)

round4_file = OUTPUT_DIR / 'round4_deep_dive_results.csv'
if round4_file.exists():
    round4_df = pd.read_csv(round4_file)
    print(f"\nTotal compounds analyzed: {len(round4_df)}")
    
    for _, row in round4_df.iterrows():
        print(f"\n  {row['Carbon_Source']}")
        print(f"    Decision: {row['Decision'].upper()}")
        print(f"    ModelSEED ID: {row['ModelSEED_ID']}")
        caveats = row.get('Caveats', '')
        if not pd.isna(caveats) and str(caveats).strip():
            print(f"    ⚠️  Caveats: {str(caveats)[:100]}...")
        needs_review = row.get('Manual_Review_Needed', False)
        # If the column is read as object/float, normalize to bool:
        try:
            needs_review = bool(int(needs_review)) if isinstance(needs_review, (str, float)) and str(needs_review).strip() != '' else bool(needs_review)
        except ValueError:
            needs_review = bool(needs_review)
        if needs_review:
            print(f"    ⚠️  REQUIRES MANUAL REVIEW")
else:
    print("\n  No Round 4 results (all compounds mapped in earlier rounds)")


DECISION SUMMARY: MAPPING RESULTS

FINAL MAPPING STATISTICS

Total carbon sources: 140

Successfully mapped: 130 (92.9%)
  Round 1 (automated): 86
  Round 2 (GPT-4o): 44
  Round 4 (GPT-5 deep dive): 0

Still unmapped: 10 (7.1%)

ROUND 3: DUPLICATE VALIDATION ISSUES

Total duplicate groups validated: 14
Total incorrect mappings flagged: 16
Total questionable mappings: 0

⚠️  INCORRECT MAPPINGS REQUIRING CORRECTION:

  cpd00108 (Galactose, C6H12O6)
    Incorrect: D-Maltose monohydrate
    Reason: cpd00108 has formula C6H12O6, characteristic of a monosaccharide hexose. D-Galactose matches this formula (stereoisomer of other C6H12O6 sugars), so i...

  cpd00751 (L-Fucose, C6H12O5)
    Incorrect: 1,4-Butanediol; 5-Keto-D-Gluconic Acid potassium salt; D-(-)-tagatose; Sodium adipate
    Reason: ModelSEED cpd00751 has formula C6H12O5, characteristic of 6-deoxyhexoses (e.g., fucose/rhamnose). Of the listed carbon sources, only L-Fucose matches ...

  cpd00122 (N-Acetyl-D-glucosamine, C8H15NO6)

## Save Final Mapping Table

Save the complete mapping table with all decisions from Rounds 1-4.

In [16]:
print(f"\nSaving final mapping table to: {MAPPING_FILE}")
mapping_df = pd.DataFrame(mappings)
mapping_df.to_csv(MAPPING_FILE, index=False)
print(f"✓ Saved {len(mapping_df)} mappings")

# Display first 20 rows
print(f"\nFirst 20 mappings:")
display(mapping_df.head(20))

# Display unmapped
unmapped_df = mapping_df[mapping_df['ModelSEED_ID'] == 'UNMAPPED']
if len(unmapped_df) > 0:
    print(f"\nUnmapped compounds ({len(unmapped_df)}):")
    for idx, row in unmapped_df.iterrows():
        print(f"  - {row['Carbon_Source_Original']}: {row['AI_Explanation'][:80]}...")


Saving final mapping table to: results/carbon_source_mapping.csv
✓ Saved 140 mappings

First 20 mappings:


Unnamed: 0,Carbon_Source_Original,ModelSEED_ID,ModelSEED_Name,Formula,Mass,Charge,Mapping_Method,Confidence,AI_Explanation,Duplicate_IDs
0,"1,2-Propanediol",cpd00453,"1,2-Propanediol",C3H8O2,76,0,round1_template,High,,
1,1-Pentanol,cpd16586,1-Pentanol,C5H12O,88,0,round1_template,High,,
2,2-Deoxy-D-Ribose,cpd01242,Thyminose,C5H10O4,134,0,round1_modelseed_local,High,,
3,2-methyl-1-butanol,cpd16873,2-methyl-1-butanol,C5H12O,0,0,round1_template,High,,
4,3-Methyl-2-Oxobutanoic Acid,cpd00123,3-Methyl-2-oxobutanoate,C5H7O3,115,-1,round1_modelseed_local,High,,
5,3-methyl-1-butanol,cpd04533,Isoamyl alcohol,C5H12O,88,0,round1_modelseed_local,High,,
6,4-Aminobutyric acid,cpd00281,GABA,C4H9NO2,103,0,round1_modelseed_local,High,,
7,4-Hydroxybenzoic Acid,cpd00136,4-Hydroxybenzoate,C7H5O3,137,-1,round1_modelseed_local,High,,
8,5-Aminovaleric acid,cpd00339,5-Aminopentanoate,C5H11NO2,117,0,round1_modelseed_local,High,,
9,Adenosine,cpd00182,Adenosine,C10H13N5O4,267,0,round1_template,High,,



Unmapped compounds (10):
  - 6-O-Acetyl-D-glucose: 6-O-Acetyl-D-glucose is a modified form of D-glucose with an acetyl group, and t...
  - D-Gluconic Acid sodium salt: LLM suggested cpd00257 but not found in template...
  - D-Glucuronic Acid: LLM suggested cpd00257 but not found in template...
  - Dodecandioic acid: LLM suggested cpd29673 but not found in template...
  - Gly-DL-Asp: Gly-DL-Asp is a dipeptide composed of glycine and aspartic acid, and ModelSEED m...
  - L-(-)-sorbose: L-(-)-sorbose is not a common carbon source in the ModelSEED database, and there...
  - L-Rhamnose monohydrate: LLM suggested cpd08397 but not found in template...
  - Lacto-N-neotetraose: Lacto-N-neotetraose is a specific oligosaccharide and does not have a direct mat...
  - Maltitol: Maltitol is a sugar alcohol derived from maltose, and it does not have a direct ...
  - Methyl-B-D-galactopyranoside: Methyl-B-D-galactopyranoside is a specific methylated sugar derivative that does...


## Round 5: Apply Manual Corrections

Apply manual corrections from Round 3 validation and verify all mappings against ModelSEED template.

**Input**: `Manual_review_media_cpds.csv` (manual corrections file)

**What this does**:
1. Loads manual corrections based on Round 3 GPT-5 validation
2. Applies corrections to fix incorrect AI mappings
3. Validates all compounds exist in ModelSEED template
4. Updates carbon_source_mapping.csv with corrected mappings

**Why needed**: Round 2 GPT-4o hallucinated some incorrect mappings. These must be corrected before generating media files.

In [None]:
# Load ModelSEED Template for Validation
print("="*80)
print("LOADING MODELSEED TEMPLATE FOR VALIDATION")
print("="*80)

# Load ModelSEED template to validate compounds
template_path = Path('../references/build_metabolic_model/GramNegModelTemplateV6.json')
print(f"\nLoading template from: {template_path}")

with open(template_path, 'r') as f:
    template = json.load(f)

# Build lookup dict for compounds
compound_lookup = {}
for cpd in template['compounds']:
    compound_lookup[cpd['id']] = {
        'name': cpd.get('name', ''),
        'formula': cpd.get('formula', ''),
        'mass': cpd.get('mass', 0),
        'charge': cpd.get('charge', 0)
    }

print(f"✓ Loaded {len(compound_lookup)} compounds from template")
print()

In [None]:
# Apply Manual Corrections
print("="*80)
print("APPLYING MANUAL CORRECTIONS")
print("="*80)
print()
print("Applying manual corrections based on Round 3 validation and template verification")
print()

# Check if manual review file exists
manual_review_file = Path('Manual_review_media_cpds.csv')

if manual_review_file.exists():
    print(f"Loading manual corrections from: {manual_review_file}")
    manual_corrections = pd.read_csv(manual_review_file)
    print(f"Found {len(manual_corrections)} manual corrections")
    print()

    # Convert current mappings to DataFrame for easier manipulation
    mapping_df = pd.DataFrame(mappings)

    corrections_applied = 0
    corrections_failed = []

    for _, correction_row in manual_corrections.iterrows():
        carbon_source = correction_row['Carbon Source']
        old_cpd = correction_row['Current (Wrong)']
        new_cpd = correction_row['Manual Review']

        # Find the mapping
        mask = mapping_df['Carbon_Source_Original'] == carbon_source

        if not mask.any():
            print(f"WARNING: '{carbon_source}' not found in mappings")
            corrections_failed.append((carbon_source, "Not found in mappings"))
            continue

        # Get current ModelSEED_ID
        current_cpd = mapping_df.loc[mask, 'ModelSEED_ID'].values[0]

        # Verify it matches expected old value
        if current_cpd.strip() != old_cpd.strip():
            print(f"WARNING: Expected {old_cpd} but found {current_cpd} for '{carbon_source}'")

        # Apply correction
        if new_cpd == 'UNMAPPED':
            # Mark as UNMAPPED
            mapping_df.loc[mask, 'ModelSEED_ID'] = 'UNMAPPED'
            mapping_df.loc[mask, 'ModelSEED_Name'] = ''
            mapping_df.loc[mask, 'Formula'] = ''
            mapping_df.loc[mask, 'Mass'] = 0
            mapping_df.loc[mask, 'Charge'] = 0
            mapping_df.loc[mask, 'Mapping_Method'] = 'manual_correction'
            mapping_df.loc[mask, 'Confidence'] = 'High'
            mapping_df.loc[mask, 'AI_Explanation'] = f"Manual correction: No valid ModelSEED mapping found. Previously incorrectly mapped to {old_cpd}."

            print(f"✓ {carbon_source}")
            print(f"  {old_cpd} → UNMAPPED")
            corrections_applied += 1
        else:
            # Look up new compound in template
            if new_cpd in compound_lookup:
                compound_info = compound_lookup[new_cpd]

                # Apply correction
                mapping_df.loc[mask, 'ModelSEED_ID'] = new_cpd
                mapping_df.loc[mask, 'ModelSEED_Name'] = compound_info['name']
                mapping_df.loc[mask, 'Formula'] = compound_info['formula']
                mapping_df.loc[mask, 'Mass'] = compound_info['mass']
                mapping_df.loc[mask, 'Charge'] = compound_info['charge']
                mapping_df.loc[mask, 'Mapping_Method'] = 'manual_correction'
                mapping_df.loc[mask, 'Confidence'] = 'High'
                mapping_df.loc[mask, 'AI_Explanation'] = f"Manual correction: {compound_info['name']} ({compound_info['formula']}). Previously incorrectly mapped to {old_cpd}."

                print(f"✓ {carbon_source}")
                print(f"  {old_cpd} → {new_cpd} ({compound_info['name']}, {compound_info['formula']})")
                corrections_applied += 1
            else:
                # Compound not in template - mark as UNMAPPED
                print(f"WARNING: {carbon_source}: {new_cpd} not in template - marking as UNMAPPED")

                mapping_df.loc[mask, 'ModelSEED_ID'] = 'UNMAPPED'
                mapping_df.loc[mask, 'ModelSEED_Name'] = ''
                mapping_df.loc[mask, 'Formula'] = ''
                mapping_df.loc[mask, 'Mass'] = 0
                mapping_df.loc[mask, 'Charge'] = 0
                mapping_df.loc[mask, 'Mapping_Method'] = 'manual_correction'
                mapping_df.loc[mask, 'Confidence'] = 'High'
                mapping_df.loc[mask, 'AI_Explanation'] = f"Manual correction: Compound not found in ModelSEED template (GramNegModelTemplateV6). Cannot be used in FBA simulations. Previously incorrectly mapped to {old_cpd}."

                corrections_applied += 1
        print()

    # Update mappings list with corrected data
    mappings = mapping_df.to_dict('records')

    print("="*80)
    print("MANUAL CORRECTIONS SUMMARY")
    print("="*80)
    print(f"Corrections applied: {corrections_applied}")
    print(f"Corrections failed: {len(corrections_failed)}")

    if corrections_failed:
        print("\nFailed corrections:")
        for source, reason in corrections_failed:
            print(f"  - {source}: {reason}")

    # Calculate final statistics
    mapped_count = len([m for m in mappings if m['ModelSEED_ID'] != 'UNMAPPED'])
    unmapped_count = len([m for m in mappings if m['ModelSEED_ID'] == 'UNMAPPED'])

    print(f"\nFinal statistics:")
    print(f"  Total carbon sources: {len(mappings)}")
    print(f"  Mapped: {mapped_count} ({mapped_count/len(mappings)*100:.1f}%)")
    print(f"  Unmapped: {unmapped_count} ({unmapped_count/len(mappings)*100:.1f}%)")

else:
    print(f"WARNING: Manual review file not found: {manual_review_file}")
    print("Proceeding with Round 4 results (may contain incorrect mappings)")
    print()

print()

In [None]:
# Validate All Mapped Compounds Against Template
print("="*80)
print("TEMPLATE VALIDATION CHECK")
print("="*80)
print()
print("Verifying all mapped compounds exist in ModelSEED template...")
print()

not_in_template = []
mapped_only = [m for m in mappings if m['ModelSEED_ID'] != 'UNMAPPED']

for mapping in mapped_only:
    cpd_id = mapping['ModelSEED_ID']
    carbon_source = mapping['Carbon_Source_Original']

    if cpd_id not in compound_lookup:
        not_in_template.append((carbon_source, cpd_id))
        print(f"ERROR: {carbon_source} → {cpd_id} (NOT IN TEMPLATE)")

if not_in_template:
    print()
    print(f"ERROR: {len(not_in_template)} mapped compounds NOT in template!")
    print("These compounds cannot be used in FBA simulations.")
    print()
    print("Action: Marking as UNMAPPED")

    # Automatically mark as UNMAPPED
    for carbon_source, cpd_id in not_in_template:
        for i, mapping in enumerate(mappings):
            if mapping['Carbon_Source_Original'] == carbon_source:
                mappings[i]['ModelSEED_ID'] = 'UNMAPPED'
                mappings[i]['ModelSEED_Name'] = ''
                mappings[i]['Formula'] = ''
                mappings[i]['Mass'] = 0
                mappings[i]['Charge'] = 0
                mappings[i]['Mapping_Method'] = 'template_validation_failed'
                mappings[i]['Confidence'] = 'High'
                mappings[i]['AI_Explanation'] = f"Template validation failed: {cpd_id} not found in GramNegModelTemplateV6. Cannot be used in models."
                print(f"  → Marked as UNMAPPED: {carbon_source}")

    # Recalculate statistics
    mapped_count = len([m for m in mappings if m['ModelSEED_ID'] != 'UNMAPPED'])
    unmapped_count = len([m for m in mappings if m['ModelSEED_ID'] == 'UNMAPPED'])

    print()
    print("Updated statistics:")
    print(f"  Mapped: {mapped_count} ({mapped_count/len(mappings)*100:.1f}%)")
    print(f"  Unmapped: {unmapped_count} ({unmapped_count/len(mappings)*100:.1f}%)")
else:
    print(f"✓ All {len(mapped_only)} mapped compounds exist in template")
    print()
    print("Template validation: PASSED")

print()
print("="*80)

In [None]:
# Save Final Corrected Mappings
print("="*80)
print("SAVING FINAL CORRECTED MAPPINGS")
print("="*80)
print()

# Save to CSV
final_mapping_df = pd.DataFrame(mappings)
output_file = OUTPUT_DIR / 'carbon_source_mapping.csv'
final_mapping_df.to_csv(output_file, index=False)

print(f"✓ Saved final mapping table: {output_file}")
print(f"  Total rows: {len(final_mapping_df)}")

# Summary by mapping method
print("\nMappings by method:")
method_counts = final_mapping_df['Mapping_Method'].value_counts()
for method, count in method_counts.items():
    print(f"  {method}: {count}")

# Summary by status
mapped = final_mapping_df[final_mapping_df['ModelSEED_ID'] != 'UNMAPPED']
unmapped = final_mapping_df[final_mapping_df['ModelSEED_ID'] == 'UNMAPPED']

print(f"\nFinal status:")
print(f"  Mapped: {len(mapped)} ({len(mapped)/len(final_mapping_df)*100:.1f}%)")
print(f"  Unmapped: {len(unmapped)} ({len(unmapped)/len(final_mapping_df)*100:.1f}%)")

print()
print("="*80)
print()

## Generate Media JSON Files

Create individual media formulation files for each mapped carbon source.

In [17]:
# Base media formulation (same for all)
BASE_MEDIA = {
    'cpd00007': (-10, 100),    # O2
    'cpd00001': (-100, 100),   # H2O
    'cpd00009': (-100, 100),   # Phosphate
    'cpd00013': (-100, 100),   # NH3
    'cpd00048': (-100, 100),   # Sulfate
    'cpd00099': (-100, 100),   # Cl-
    'cpd00067': (-100, 100),   # H+
    'cpd00205': (-100, 100),   # K+
    'cpd00254': (-100, 100),   # Mg2+
    'cpd00971': (-100, 100),   # Na+
    'cpd00149': (-100, 100),   # Co2+
    'cpd00063': (-100, 100),   # Ca2+
    'cpd00058': (-100, 100),   # Cu2+
    'cpd00034': (-100, 100),   # Zn2+
    'cpd00030': (-100, 100),   # Mn2+
    'cpd10515': (-100, 100),   # Fe2+
    'cpd10516': (-100, 100),   # Fe3+
    'cpd11574': (-100, 100),   # Molybdate
    'cpd00244': (-100, 100),   # Ni2+
}

# Carbon source uptake rate (negative = uptake)
CARBON_UPTAKE_RATE = -5

print("Base media formulation loaded")
print(f"  Base nutrients: {len(BASE_MEDIA)}")
print(f"  Carbon uptake rate: {CARBON_UPTAKE_RATE} mmol/gDW/hr")

Base media formulation loaded
  Base nutrients: 19
  Carbon uptake rate: -5 mmol/gDW/hr


In [18]:
print(f"\nGenerating media JSON files...")
print("="*80)

media_generated = 0
media_skipped = 0

for idx, row in mapping_df.iterrows():
    carbon_source = row['Carbon_Source_Original']
    cpd_id = row['ModelSEED_ID']
    
    if cpd_id == 'UNMAPPED':
        print(f"  Skipping {carbon_source} (unmapped)")
        media_skipped += 1
        continue
    
    # Create media formulation
    media_dict = BASE_MEDIA.copy()
    media_dict[cpd_id] = (CARBON_UPTAKE_RATE, 100)
    
    # Create safe filename
    safe_filename = carbon_source.replace('/', '_').replace(' ', '_').replace(',', '')
    safe_filename = safe_filename.replace('(', '').replace(')', '')
    media_file = MEDIA_DIR / f"{safe_filename}.json"
    
    # Save media file
    with open(media_file, 'w') as f:
        json.dump(media_dict, f, indent=2)
    
    media_generated += 1
    
    if media_generated <= 5:
        print(f"  Generated: {media_file.name}")

if media_generated > 5:
    print(f"  ... and {media_generated - 5} more files")

print(f"\n{'='*80}")
print(f"Media generation complete")
print(f"  Generated: {media_generated}")
print(f"  Skipped: {media_skipped}")
print(f"  Output directory: {MEDIA_DIR}")
print(f"{'='*80}")


Generating media JSON files...
  Generated: 12-Propanediol.json
  Generated: 1-Pentanol.json
  Generated: 2-Deoxy-D-Ribose.json
  Generated: 2-methyl-1-butanol.json
  Generated: 3-Methyl-2-Oxobutanoic_Acid.json
  Skipping 6-O-Acetyl-D-glucose (unmapped)
  Skipping D-Gluconic Acid sodium salt (unmapped)
  Skipping D-Glucuronic Acid (unmapped)
  Skipping Dodecandioic acid (unmapped)
  Skipping Gly-DL-Asp (unmapped)
  Skipping L-(-)-sorbose (unmapped)
  Skipping L-Rhamnose monohydrate (unmapped)
  Skipping Lacto-N-neotetraose (unmapped)
  Skipping Maltitol (unmapped)
  Skipping Methyl-B-D-galactopyranoside (unmapped)
  ... and 125 more files

Media generation complete
  Generated: 130
  Skipped: 10
  Output directory: media


## Usage Example

How to load media files for ModelSEEDpy:

In [19]:
# Example: Load glucose media
print("Example: Loading media for D-Glucose")
print("="*80)

# Find glucose media file
glucose_file = MEDIA_DIR / "2-Deoxy-D-Ribose.json"  # First carbon source as example

if glucose_file.exists():
    with open(glucose_file) as f:
        media_dict = json.load(f)
    
    print(f"\nLoaded media from: {glucose_file.name}")
    print(f"Total compounds: {len(media_dict)}")
    print(f"\nCarbon source:")
    for cpd_id, bounds in media_dict.items():
        if bounds[0] == CARBON_UPTAKE_RATE:  # Find carbon source
            print(f"  {cpd_id}: {bounds}")
    
    print(f"\nUsage with ModelSEEDpy:")
    print("  from modelseedpy import MSMedia")
    print(f"  media = MSMedia.from_dict(media_dict)")
    print("  model.medium = media.get_media_constraints()")
else:
    print("Example media file not found")

Example: Loading media for D-Glucose

Loaded media from: 2-Deoxy-D-Ribose.json
Total compounds: 20

Carbon source:
  cpd01242: [-5, 100]

Usage with ModelSEEDpy:
  from modelseedpy import MSMedia
  media = MSMedia.from_dict(media_dict)
  model.medium = media.get_media_constraints()


## Summary

**Intermediate Files Created**:
1. `results/round1_mapped.csv` - Compounds mapped in Round 1 (automated)
2. `results/round1_unmapped.txt` - Compounds that need Round 2
3. `results/round2_all_mappings.csv` - All mappings after Round 2
4. `results/round2_still_unmapped.csv` - Compounds still unmapped after Round 2 (if any)

**Final Output Files**:
1. `results/carbon_source_mapping.csv` - Complete final mapping table
2. `media/*.json` - Individual media formulation files (one per mapped carbon source)

**Workflow Summary**:
- Round 1: Automated search (template + local database)
- Round 2: AI-assisted mapping with GPT-4o
- Round 3: Optional deep dive with GPT-5 (o3) for difficult cases
- Final: Combined mapping table and media generation

**Next Steps**:
1. Review unmapped compounds (if any) and perform manual curation
2. Update mapping CSV with manual corrections if needed
3. Re-run media generation section if needed
4. Proceed to CDMSCI-198: Build metabolic models
5. Proceed to CDMSCI-199: Run FBA simulations with these media