# Map Carbon Sources to ModelSEED Compound IDs

**Parent**: CDMSCI-193 - RBTnSeq Modeling Analysis

**Ticket**: CDMSCI-197 - Translate to Computational Media Formulations

## Objective

Map 141 filtered carbon sources from CDMSCI-196 to ModelSEED compound IDs (cpd#####) for metabolic modeling.

## Input

Using the filtered growth matrix from CDMSCI-196 which contains:
- 141 carbon sources (after removing unsuitable compounds)
- 44 organisms (after filtering organisms with no growth data)

## Mapping Strategy

**Round 1: Automated Search**
1. Search local template (GramNegModelTemplateV6.json)
2. Search ModelSEED local database (offline)
3. Handle duplicates by choosing lower compound ID

**Round 2: AI-Assisted Mapping**
1. Use GPT-4o (via Argo proxy) for unmapped compounds
2. Provide compound name + chemical context
3. Get ModelSEED ID suggestion with explanation

## Outputs

1. `carbon_source_mapping.csv` - Complete mapping table
2. `media/` directory - Individual media JSON files for each carbon source

**Last updated**: 2025-10-15

## Setup

In [1]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from urllib.request import urlopen, URLError
from urllib.parse import quote
import requests
import time

print("Imports successful")

Imports successful


## Configuration

In [2]:
# Paths
CARBON_SOURCES_FILE = Path('../CDMSCI-196-carbon-sources/results/combined_growth_matrix_filtered.csv')
TEMPLATE_PATH = Path('../references/build_metabolic_model/GramNegModelTemplateV6.json')

# Output paths
OUTPUT_DIR = Path('results')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

MEDIA_DIR = Path('media')
MEDIA_DIR.mkdir(parents=True, exist_ok=True)

MAPPING_FILE = OUTPUT_DIR / 'carbon_source_mapping.csv'

# Argo proxy for LLM
ARGO_BASE_URL = 'http://localhost:8000/v1'
ARGO_MODEL = 'gpt4o'

# Local ModelSEED Database files
MODELSEED_ALIASES = Path('../data/modelseed_database/Unique_ModelSEED_Compound_Aliases.txt')
MODELSEED_NAMES = Path('../data/modelseed_database/Unique_ModelSEED_Compound_Names.txt')

print(f"Configuration set")
print(f"  Carbon sources: {CARBON_SOURCES_FILE}")
print(f"  Template: {TEMPLATE_PATH}")
print(f"  Output: {MAPPING_FILE}")
print(f"  Media directory: {MEDIA_DIR}")

Configuration set
  Carbon sources: ../CDMSCI-196-carbon-sources/results/combined_growth_matrix_filtered.csv
  Template: ../references/build_metabolic_model/GramNegModelTemplateV6.json
  Output: results/carbon_source_mapping.csv
  Media directory: media


## Load Filtered Carbon Sources

In [3]:
print("Loading filtered carbon sources from CDMSCI-196...")
growth_matrix = pd.read_csv(CARBON_SOURCES_FILE, index_col=0)

# Filter out NaN values from index
carbon_sources = [cs for cs in growth_matrix.index.tolist() if pd.notna(cs)]

print(f"\nLoaded {len(carbon_sources)} filtered carbon sources")
print(f"(These are the 141 carbon sources selected for modeling in CDMSCI-196)")
print(f"\nFirst 10 carbon sources:")
for i, cs in enumerate(carbon_sources[:10], 1):
    print(f"  {i:3d}. {cs}")

Loading filtered carbon sources from CDMSCI-196...

Loaded 141 filtered carbon sources
(These are the 141 carbon sources selected for modeling in CDMSCI-196)

First 10 carbon sources:
    1. 1,2-Propanediol
    2. 1,3-Butandiol
    3. 1,4-Butanediol
    4. 1,5-Pentanediol
    5. 1-Pentanol
    6. 2-Deoxy-D-Ribose
    7. 2-methyl-1-butanol
    8. 3-Methyl-2-Oxobutanoic Acid
    9. 3-methyl-1-butanol
   10. 3-methyl-2-oxopentanoic acid


## Load Template

In [4]:
print(f"Loading ModelSEED template: {TEMPLATE_PATH}")
with open(TEMPLATE_PATH) as f:
    template = json.load(f)

print(f"\nTemplate loaded:")
print(f"  Compounds: {len(template['compounds'])}")
print(f"  Reactions: {len(template['reactions'])}")

# Create compound index for fast lookup
template_compounds = template['compounds']
print(f"\nIndexed {len(template_compounds)} compounds for searching")

Loading ModelSEED template: ../references/build_metabolic_model/GramNegModelTemplateV6.json

Template loaded:
  Compounds: 6573
  Reactions: 8584

Indexed 6573 compounds for searching


In [5]:
# Load ModelSEED alias files (local, no internet needed)
print("Loading ModelSEED alias files...")

# Load compound names
compound_names = {}
with open(MODELSEED_NAMES) as f:
    next(f)  # Skip header
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 2:
            cpd_id = parts[0]
            name = parts[1].lower()
            if name not in compound_names:
                compound_names[name] = []
            compound_names[name].append(cpd_id)

print(f"  Loaded {len(compound_names):,} compound names")

# Load compound aliases
compound_aliases = {}
with open(MODELSEED_ALIASES) as f:
    next(f)  # Skip header
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 3:
            cpd_id = parts[0]
            alias = parts[1].lower()
            source = parts[2]
            if alias not in compound_aliases:
                compound_aliases[alias] = []
            compound_aliases[alias].append(cpd_id)

print(f"  Loaded {len(compound_aliases):,} compound aliases")
print("ModelSEED database ready for offline searching")

Loading ModelSEED alias files...
  Loaded 130,196 compound names
  Loaded 108,286 compound aliases
ModelSEED database ready for offline searching


## Search Functions

In [6]:
def search_template(compound_name):
    """Search for compound in local template"""
    matches = []
    search_lower = compound_name.lower()
    
    for compound in template_compounds:
        # Search in name
        if search_lower == compound['name'].lower():
            matches.append(compound)
            continue
        
        # Search in abbreviation
        abbr = compound.get('abbreviation', '')
        if abbr and search_lower == abbr.lower():
            matches.append(compound)
            continue
        
        # Search in aliases
        for alias in compound.get('aliases', []):
            if search_lower in alias.lower():
                matches.append(compound)
                break
    
    return matches


def search_template_by_id(compound_id):
    """Search for compound by ID in local template"""
    for compound in template_compounds:
        if compound['id'] == compound_id:
            return compound
    return None


def search_modelseed_local(compound_name):
    """Search ModelSEED using local alias files (offline)"""
    search_lower = compound_name.lower()
    found_ids = set()

    # Search in compound names
    if search_lower in compound_names:
        found_ids.update(compound_names[search_lower])

    # Search in aliases
    if search_lower in compound_aliases:
        found_ids.update(compound_aliases[search_lower])

    # Get compound details from template
    matches = []
    for cpd_id in found_ids:
        compound = search_template_by_id(cpd_id)
        if compound:
            matches.append({
                'id': cpd_id,
                'name': compound['name'],
                'formula': compound.get('formula', ''),
                'charge': compound.get('defaultCharge', 0),
                'mass': compound.get('mass', 0),
                'source': 'modelseed_local'
            })

    return matches


def search_compound_round1(compound_name):
    """Round 1: Search template and local ModelSEED database"""
    # Try template first (faster, offline)
    template_matches = search_template(compound_name)
    
    # Try local ModelSEED if no template matches
    modelseed_matches = search_modelseed_local(compound_name) if not template_matches else []
    
    # Combine and deduplicate
    all_matches = []
    seen_ids = set()
    
    for match in template_matches:
        cpd_id = match['id']
        if cpd_id not in seen_ids:
            all_matches.append({
                'id': cpd_id,
                'name': match['name'],
                'formula': match.get('formula', ''),
                'charge': match.get('defaultCharge', 0),
                'mass': match.get('mass', 0),
                'source': 'template'
            })
            seen_ids.add(cpd_id)
    
    for match in modelseed_matches:
        cpd_id = match['id']
        if cpd_id not in seen_ids:
            all_matches.append(match)
            seen_ids.add(cpd_id)
    
    # Sort by ID (lower IDs first)
    all_matches.sort(key=lambda x: x['id'])
    
    return all_matches

print("Search functions defined")

Search functions defined


## Round 1: Automated Mapping

Search for ModelSEED compound IDs using local template and database files.

In [7]:
print("="*80)
print("ROUND 1: AUTOMATED MAPPING")
print("="*80)

mappings = []
unmapped = []

for i, carbon_source in enumerate(carbon_sources, 1):
    print(f"\n[{i}/{len(carbon_sources)}] {carbon_source}")
    
    matches = search_compound_round1(carbon_source)
    
    if matches:
        # Found matches
        best_match = matches[0]  # Lowest ID (already sorted)
        
        if len(matches) > 1:
            # Report duplicates
            duplicate_ids = [m['id'] for m in matches]
            print(f"  DUPLICATE: Found {len(matches)} matches: {duplicate_ids}")
            print(f"  Selected: {best_match['id']} (lowest ID)")
        else:
            print(f"  Mapped: {best_match['id']} - {best_match['name']}")
        
        mappings.append({
            'Carbon_Source_Original': carbon_source,
            'ModelSEED_ID': best_match['id'],
            'ModelSEED_Name': best_match['name'],
            'Formula': best_match['formula'],
            'Mass': best_match['mass'],
            'Charge': best_match['charge'],
            'Mapping_Method': f"round1_{best_match['source']}",
            'Confidence': 'High',
            'AI_Explanation': '',
            'Duplicate_IDs': ';'.join([m['id'] for m in matches[1:]]) if len(matches) > 1 else ''
        })
    else:
        # No matches found
        print(f"  NOT FOUND - will try LLM in Round 2")
        unmapped.append(carbon_source)

print(f"\n{'='*80}")
print(f"ROUND 1 COMPLETE")
print(f"  Mapped: {len(mappings)} ({100*len(mappings)/len(carbon_sources):.1f}%)")
print(f"  Unmapped: {len(unmapped)} ({100*len(unmapped)/len(carbon_sources):.1f}%)")
print(f"{'='*80}")

# Save Round 1 intermediate results
print(f"\nSaving Round 1 intermediate results...")
round1_df = pd.DataFrame(mappings)
round1_file = OUTPUT_DIR / 'round1_mapped.csv'
round1_df.to_csv(round1_file, index=False)
print(f"  Mapped compounds: {round1_file}")

unmapped_file = OUTPUT_DIR / 'round1_unmapped.txt'
with open(unmapped_file, 'w') as f:
    for cs in unmapped:
        f.write(f"{cs}\n")
print(f"  Unmapped compounds: {unmapped_file}")
print(f"\nIntermediate files saved. Proceeding to Round 2...")

ROUND 1: AUTOMATED MAPPING

[1/141] 1,2-Propanediol
  Mapped: cpd00453 - 1,2-Propanediol

[2/141] 1,3-Butandiol
  NOT FOUND - will try LLM in Round 2

[3/141] 1,4-Butanediol
  NOT FOUND - will try LLM in Round 2

[4/141] 1,5-Pentanediol
  NOT FOUND - will try LLM in Round 2

[5/141] 1-Pentanol
  Mapped: cpd16586 - 1-Pentanol

[6/141] 2-Deoxy-D-Ribose
  Mapped: cpd01242 - Thyminose

[7/141] 2-methyl-1-butanol
  Mapped: cpd16873 - 2-methyl-1-butanol

[8/141] 3-Methyl-2-Oxobutanoic Acid
  Mapped: cpd00123 - 3-Methyl-2-oxobutanoate

[9/141] 3-methyl-1-butanol
  Mapped: cpd04533 - Isoamyl alcohol

[10/141] 3-methyl-2-oxopentanoic acid
  NOT FOUND - will try LLM in Round 2

[11/141] 4-Aminobutyric acid
  Mapped: cpd00281 - GABA

[12/141] 4-Hydroxybenzoic Acid
  Mapped: cpd00136 - 4-Hydroxybenzoate

[13/141] 4-Hydroxyvalerate
  NOT FOUND - will try LLM in Round 2

[14/141] 4-Methyl-2-oxovaleric acid
  NOT FOUND - will try LLM in Round 2

[15/141] 5-Aminovaleric acid
  Mapped: cpd00339 - 5-Ami

## Round 2: AI-Assisted Mapping

In [8]:
def ask_llm_for_mapping(compound_name):
    """Use LLM to suggest ModelSEED compound ID"""
    
    prompt = f"""You are a biochemistry expert helping map compound names to ModelSEED database IDs.

Compound name: "{compound_name}"

Task: Suggest the most likely ModelSEED compound ID (format: cpd#####) for this compound.

Context:
- This is a carbon source from bacterial growth experiments
- ModelSEED uses standardized compound IDs (e.g., cpd00027 = D-Glucose)
- Common carbon sources: glucose (cpd00027), glycerol (cpd00100), acetate (cpd00029)
- For complex names, try to identify the base metabolite
- For salts/hydrates, use the base compound (e.g., "Citric Acid" → cpd00137 = Citrate)
- For polymers, suggest the monomer (e.g., "Amylose" → cpd00027 = Glucose)

Response format (JSON):
{{
  "compound_id": "cpd#####",
  "compound_name": "Official ModelSEED name",
  "explanation": "One-line explanation of mapping rationale",
  "confidence": "high/medium/low"
}}

If you cannot confidently map this compound, return:
{{
  "compound_id": "UNMAPPED",
  "compound_name": "",
  "explanation": "Reason why mapping is not possible",
  "confidence": "low"
}}
"""
    
    try:
        response = requests.post(
            f"{ARGO_BASE_URL}/chat/completions",
            json={
                "model": ARGO_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.1,  # Low temperature for consistent reasoning
                "max_tokens": 500
            },
            timeout=90
        )
        
        if response.status_code == 200:
            result = response.json()
            content = result['choices'][0]['message']['content']
            
            # Try to parse JSON from response
            # Handle markdown code blocks if present
            if '```json' in content:
                content = content.split('```json')[1].split('```')[0].strip()
            elif '```' in content:
                content = content.split('```')[1].split('```')[0].strip()
            
            mapping = json.loads(content)
            return mapping
        else:
            print(f"    WARNING: LLM request failed with status {response.status_code}")
            return None
            
    except Exception as e:
        print(f"    WARNING: LLM request error: {e}")
        return None

print("LLM mapping function defined")

LLM mapping function defined


In [9]:
if unmapped:
    print("\n" + "="*80)
    print("ROUND 2: AI-ASSISTED MAPPING")
    print("="*80)
    print(f"\nAttempting to map {len(unmapped)} compounds using LLM...")
    
    for i, carbon_source in enumerate(unmapped, 1):
        print(f"\n[{i}/{len(unmapped)}] {carbon_source}")
        
        llm_result = ask_llm_for_mapping(carbon_source)
        
        if llm_result:
            cpd_id = llm_result.get('compound_id', 'UNMAPPED')
            cpd_name = llm_result.get('compound_name', '')
            explanation = llm_result.get('explanation', '')
            confidence = llm_result.get('confidence', 'low')
            
            if cpd_id != 'UNMAPPED':
                print(f"  LLM Mapped: {cpd_id} - {cpd_name}")
                print(f"     Explanation: {explanation}")
                print(f"     Confidence: {confidence}")
                
                # Verify LLM suggestion exists in template
                verification = search_template_by_id(cpd_id)
                if verification:
                    print(f"  Verified in template")
                    mappings.append({
                        'Carbon_Source_Original': carbon_source,
                        'ModelSEED_ID': cpd_id,
                        'ModelSEED_Name': verification['name'],
                        'Formula': verification.get('formula', ''),
                        'Mass': verification.get('mass', 0),
                        'Charge': verification.get('defaultCharge', 0),
                        'Mapping_Method': 'round2_llm',
                        'Confidence': confidence.capitalize(),
                        'AI_Explanation': explanation,
                        'Duplicate_IDs': ''
                    })
                else:
                    print(f"  NOT VERIFIED - LLM suggested ID not in template")
                    mappings.append({
                        'Carbon_Source_Original': carbon_source,
                        'ModelSEED_ID': 'UNMAPPED',
                        'ModelSEED_Name': '',
                        'Formula': '',
                        'Mass': 0,
                        'Charge': 0,
                        'Mapping_Method': 'round2_llm_unverified',
                        'Confidence': 'Low',
                        'AI_Explanation': f"LLM suggested {cpd_id} but not found in template",
                        'Duplicate_IDs': ''
                    })
            else:
                print(f"  LLM could not map: {explanation}")
                mappings.append({
                    'Carbon_Source_Original': carbon_source,
                    'ModelSEED_ID': 'UNMAPPED',
                    'ModelSEED_Name': '',
                    'Formula': '',
                    'Mass': 0,
                    'Charge': 0,
                    'Mapping_Method': 'round2_llm_failed',
                    'Confidence': 'Low',
                    'AI_Explanation': explanation,
                    'Duplicate_IDs': ''
                })
        else:
            print(f"  LLM request failed")
            mappings.append({
                'Carbon_Source_Original': carbon_source,
                'ModelSEED_ID': 'UNMAPPED',
                'ModelSEED_Name': '',
                'Formula': '',
                'Mass': 0,
                'Charge': 0,
                'Mapping_Method': 'round2_llm_error',
                'Confidence': 'Low',
                'AI_Explanation': 'LLM request error',
                'Duplicate_IDs': ''
            })
        
        # Rate limit: small delay between requests
        time.sleep(0.5)
    
    print(f"\n{'='*80}")
    print("ROUND 2 COMPLETE")
    print(f"{'='*80}")
    
    # Save Round 2 results
    print(f"\nSaving Round 2 intermediate results...")
    round2_df = pd.DataFrame(mappings)
    round2_file = OUTPUT_DIR / 'round2_all_mappings.csv'
    round2_df.to_csv(round2_file, index=False)
    print(f"  All mappings so far: {round2_file}")
    
    # Save still unmapped after Round 2
    still_unmapped = round2_df[round2_df['ModelSEED_ID'] == 'UNMAPPED']
    if len(still_unmapped) > 0:
        unmapped_round2_file = OUTPUT_DIR / 'round2_still_unmapped.csv'
        still_unmapped.to_csv(unmapped_round2_file, index=False)
        print(f"  Still unmapped compounds: {unmapped_round2_file}")
        print(f"  {len(still_unmapped)} compounds still need manual curation")
else:
    print("\nAll compounds mapped in Round 1 - skipping Round 2")


ROUND 2: AI-ASSISTED MAPPING

Attempting to map 54 compounds using LLM...

[1/54] 1,3-Butandiol
  LLM Mapped: cpd00738 - 1,3-Butanediol
     Explanation: 1,3-Butandiol is a diol with two hydroxyl groups on a butane chain, matching the ModelSEED compound 1,3-Butanediol.
     Confidence: medium
  Verified in template

[2/54] 1,4-Butanediol
  LLM Mapped: cpd00751 - 1,4-Butanediol
     Explanation: 1,4-Butanediol is a known compound in the ModelSEED database with the ID cpd00751.
     Confidence: high
  Verified in template

[3/54] 1,5-Pentanediol
  LLM Mapped: cpd19020 - 1,5-Pentanediol
     Explanation: 1,5-Pentanediol is a diol with a five-carbon chain, and this ID corresponds to it in the ModelSEED database.
     Confidence: medium
  Verified in template

[4/54] 3-methyl-2-oxopentanoic acid
  LLM Mapped: cpd11493 - 3-Methyl-2-oxopentanoate
     Explanation: The compound '3-methyl-2-oxopentanoic acid' corresponds to the keto acid form of leucine, which is mapped to cpd11493 in the Mode

## Optional Round 3: Deep Dive with GPT-5

For the 10 compounds that remain unmapped after Round 2, use GPT-5 for deeper biochemical analysis.

**When to use Round 3:**
- Compounds still UNMAPPED after Round 2 (currently 10)
- Need more detailed reasoning and analysis

**Skip this section if:**
- Acceptable to manually curate the 10 remaining compounds

Uncomment the cell below to run Round 3.

In [10]:
# # OPTIONAL: Round 3 - Deep Dive with GPT-5 for 10 unmapped compounds
# # Uncomment to run
# 
# # Get unmapped compounds
# unmapped_df_round2 = pd.DataFrame(mappings)
# still_unmapped = unmapped_df_round2[unmapped_df_round2['ModelSEED_ID'] == 'UNMAPPED']
# 
# if len(still_unmapped) > 0:
#     print("\n" + "="*80)
#     print("ROUND 3: DEEP DIVE WITH GPT-5")
#     print("="*80)
#     print(f"\n{len(still_unmapped)} compounds need deeper analysis...")
#     
#     for idx, row in still_unmapped.iterrows():
#         carbon_source = row['Carbon_Source_Original']
#         print(f"\n[Deep dive] {carbon_source}")
#         
#         # More detailed prompt for GPT-5
#         prompt = f"""You are a biochemistry expert with deep knowledge of metabolic databases.
# 
# Compound: \"{carbon_source}\"
# Previous attempt: {row['Mapping_Method']}
# Previous explanation: {row['AI_Explanation']}
# 
# Task: Perform comprehensive analysis to map this to ModelSEED.
# 
# Analysis steps:
# 1. Identify core chemical structure
# 2. Consider biochemical transformations
# 3. Search for synonyms and alternative names
# 4. Check if it's a salt, hydrate, or derivative
# 5. Suggest best ModelSEED ID
# 
# Response (JSON):
# {{
#   \"compound_id\": \"cpd#####\",
#   \"compound_name\": \"ModelSEED name\",
#   \"explanation\": \"Detailed reasoning\",
#   \"confidence\": \"high/medium/low\",
#   \"alternative_ids\": [\"cpd#####\"]  
# }}
# """
#         
#         try:
#             response = requests.post(
#                 f"{ARGO_BASE_URL}/chat/completions",
#                 json={
#                     "model": "gpt5",  # Use GPT-5 for deeper analysis
#                     "messages": [{"role": "user", "content": prompt}],
#                     "temperature": 0.1,
#                     "max_tokens": 1000
#                 },
#                 timeout=120
#             )
#             
#             if response.status_code == 200:
#                 result = response.json()
#                 content = result['choices'][0]['message']['content']
#                 
#                 if '```json' in content:
#                     content = content.split('```json')[1].split('```')[0].strip()
#                 elif '```' in content:
#                     content = content.split('```')[1].split('```')[0].strip()
#                 
#                 deep_result = json.loads(content)
#                 cpd_id = deep_result.get('compound_id', 'UNMAPPED')
#                 
#                 if cpd_id != 'UNMAPPED':
#                     verification = search_template_by_id(cpd_id)
#                     if verification:
#                         print(f"  GPT-5 Mapped: {cpd_id}")
#                         print(f"  Explanation: {deep_result.get('explanation', '')[:100]}...")
#                         
#                         # Update mapping
#                         mappings[idx] = {
#                             'Carbon_Source_Original': carbon_source,
#                             'ModelSEED_ID': cpd_id,
#                             'ModelSEED_Name': verification['name'],
#                             'Formula': verification.get('formula', ''),
#                             'Mass': verification.get('mass', 0),
#                             'Charge': verification.get('defaultCharge', 0),
#                             'Mapping_Method': 'round3_gpt5',
#                             'Confidence': deep_result.get('confidence', 'medium').capitalize(),
#                             'AI_Explanation': deep_result.get('explanation', ''),
#                             'Duplicate_IDs': ';'.join(deep_result.get('alternative_ids', []))
#                         }
#                         print(f"  Updated mapping")
#         except Exception as e:
#             print(f"  Error: {e}")
#         
#         time.sleep(1)
#     
#     print(f"\n{'='*80}")
#     print("ROUND 3 COMPLETE")
#     print(f"{'='*80}")
# else:
#     print("\nNo compounds need deep dive")
# 
# print("Round 3 (optional) - uncomment to run")

## Final Summary Statistics

Combined results from all mapping rounds.

In [11]:
# Create DataFrame
mapping_df = pd.DataFrame(mappings)

# Statistics
total = len(mapping_df)
mapped = (mapping_df['ModelSEED_ID'] != 'UNMAPPED').sum()
unmapped_final = (mapping_df['ModelSEED_ID'] == 'UNMAPPED').sum()

round1_mapped = mapping_df['Mapping_Method'].str.startswith('round1').sum()
round2_mapped = (mapping_df['Mapping_Method'].str.startswith('round2') & 
                 (mapping_df['ModelSEED_ID'] != 'UNMAPPED')).sum()

duplicates = (mapping_df['Duplicate_IDs'] != '').sum()
ai_assisted = (mapping_df['AI_Explanation'] != '').sum()

print("\n" + "="*80)
print("MAPPING SUMMARY")
print("="*80)
print(f"\nTotal carbon sources: {total}")
print(f"\nSuccessfully mapped: {mapped} ({100*mapped/total:.1f}%)")
print(f"  Round 1 (automated): {round1_mapped}")
print(f"  Round 2 (AI-assisted): {round2_mapped}")
print(f"\nUnmapped: {unmapped_final} ({100*unmapped_final/total:.1f}%)")
print(f"\nDuplicates resolved: {duplicates}")
print(f"AI-assisted mappings: {ai_assisted}")

print(f"\n{'='*80}")


MAPPING SUMMARY

Total carbon sources: 141

Successfully mapped: 133 (94.3%)
  Round 1 (automated): 87
  Round 2 (AI-assisted): 46

Unmapped: 8 (5.7%)

Duplicates resolved: 2
AI-assisted mappings: 54



## Save Final Mapping Table

Final combined output with all mappings.

In [12]:
print(f"\nSaving mapping table to: {MAPPING_FILE}")
mapping_df.to_csv(MAPPING_FILE, index=False)
print(f"Saved {len(mapping_df)} mappings")

# Display first 20 rows
print(f"\nFirst 20 mappings:")
display(mapping_df.head(20))


Saving mapping table to: results/carbon_source_mapping.csv
Saved 141 mappings

First 20 mappings:


Unnamed: 0,Carbon_Source_Original,ModelSEED_ID,ModelSEED_Name,Formula,Mass,Charge,Mapping_Method,Confidence,AI_Explanation,Duplicate_IDs
0,"1,2-Propanediol",cpd00453,"1,2-Propanediol",C3H8O2,76,0,round1_template,High,,
1,1-Pentanol,cpd16586,1-Pentanol,C5H12O,88,0,round1_template,High,,
2,2-Deoxy-D-Ribose,cpd01242,Thyminose,C5H10O4,134,0,round1_modelseed_local,High,,
3,2-methyl-1-butanol,cpd16873,2-methyl-1-butanol,C5H12O,0,0,round1_template,High,,
4,3-Methyl-2-Oxobutanoic Acid,cpd00123,3-Methyl-2-oxobutanoate,C5H7O3,115,-1,round1_modelseed_local,High,,
5,3-methyl-1-butanol,cpd04533,Isoamyl alcohol,C5H12O,88,0,round1_modelseed_local,High,,
6,4-Aminobutyric acid,cpd00281,GABA,C4H9NO2,103,0,round1_modelseed_local,High,,
7,4-Hydroxybenzoic Acid,cpd00136,4-Hydroxybenzoate,C7H5O3,137,-1,round1_modelseed_local,High,,
8,5-Aminovaleric acid,cpd00339,5-Aminopentanoate,C5H11NO2,117,0,round1_modelseed_local,High,,
9,Adenosine,cpd00182,Adenosine,C10H13N5O4,267,0,round1_template,High,,


## Review Unmapped Compounds

In [13]:
unmapped_df = mapping_df[mapping_df['ModelSEED_ID'] == 'UNMAPPED']

if len(unmapped_df) > 0:
    print(f"\nMANUAL CURATION REQUIRED for {len(unmapped_df)} compounds:")
    print("="*80)
    
    for idx, row in unmapped_df.iterrows():
        print(f"\n{row['Carbon_Source_Original']}")
        print(f"  Method: {row['Mapping_Method']}")
        if row['AI_Explanation']:
            print(f"  AI says: {row['AI_Explanation']}")
    
    print("\n" + "="*80)
    print("\nSuggestions for manual curation:")
    print("1. Check ModelSEED web interface: https://modelseed.org")
    print("2. Search KEGG database for compound IDs")
    print("3. For complex mixtures, consider using representative compound")
    print("4. For proprietary prebiotics, research composition")
    print("5. Update mapping CSV manually and re-run media generation")
else:
    print("\nAll compounds successfully mapped!")


MANUAL CURATION REQUIRED for 8 compounds:

6-O-Acetyl-D-glucose
  Method: round2_llm_failed
  AI says: 6-O-Acetyl-D-glucose is a modified form of D-glucose with an acetyl group, and there is no direct match in the ModelSEED database for this specific compound.

D-Gluconic Acid sodium salt
  Method: round2_llm_unverified
  AI says: LLM suggested cpd00257 but not found in template

D-Glucuronic Acid
  Method: round2_llm_unverified
  AI says: LLM suggested cpd00257 but not found in template

Dodecandioic acid
  Method: round2_llm_unverified
  AI says: LLM suggested cpd29697 but not found in template

Gly-DL-Asp
  Method: round2_llm_failed
  AI says: Gly-DL-Asp is a dipeptide composed of glycine and aspartic acid, and ModelSEED may not have a specific ID for dipeptides. The database typically includes individual amino acids or common metabolites.

Lacto-N-neotetraose
  Method: round2_llm_failed
  AI says: Lacto-N-neotetraose is a specific oligosaccharide and does not have a direct match i

## Generate Media JSON Files

Create individual media formulation files for each mapped carbon source.

In [14]:
# Base media formulation (same for all)
BASE_MEDIA = {
    'cpd00007': (-10, 100),    # O2
    'cpd00001': (-100, 100),   # H2O
    'cpd00009': (-100, 100),   # Phosphate
    'cpd00013': (-100, 100),   # NH3
    'cpd00048': (-100, 100),   # Sulfate
    'cpd00099': (-100, 100),   # Cl-
    'cpd00067': (-100, 100),   # H+
    'cpd00205': (-100, 100),   # K+
    'cpd00254': (-100, 100),   # Mg2+
    'cpd00971': (-100, 100),   # Na+
    'cpd00149': (-100, 100),   # Co2+
    'cpd00063': (-100, 100),   # Ca2+
    'cpd00058': (-100, 100),   # Cu2+
    'cpd00034': (-100, 100),   # Zn2+
    'cpd00030': (-100, 100),   # Mn2+
    'cpd10515': (-100, 100),   # Fe2+
    'cpd10516': (-100, 100),   # Fe3+
    'cpd11574': (-100, 100),   # Molybdate
    'cpd00244': (-100, 100),   # Ni2+
}

# Carbon source uptake rate (negative = uptake)
CARBON_UPTAKE_RATE = -5

print("Base media formulation loaded")
print(f"  Base nutrients: {len(BASE_MEDIA)}")
print(f"  Carbon uptake rate: {CARBON_UPTAKE_RATE} mmol/gDW/hr")

Base media formulation loaded
  Base nutrients: 19
  Carbon uptake rate: -5 mmol/gDW/hr


In [15]:
print(f"\nGenerating media JSON files...")
print("="*80)

media_generated = 0
media_skipped = 0

for idx, row in mapping_df.iterrows():
    carbon_source = row['Carbon_Source_Original']
    cpd_id = row['ModelSEED_ID']
    
    if cpd_id == 'UNMAPPED':
        print(f"  Skipping {carbon_source} (unmapped)")
        media_skipped += 1
        continue
    
    # Create media formulation
    media_dict = BASE_MEDIA.copy()
    media_dict[cpd_id] = (CARBON_UPTAKE_RATE, 100)
    
    # Create safe filename
    safe_filename = carbon_source.replace('/', '_').replace(' ', '_').replace(',', '')
    safe_filename = safe_filename.replace('(', '').replace(')', '')
    media_file = MEDIA_DIR / f"{safe_filename}.json"
    
    # Save media file
    with open(media_file, 'w') as f:
        json.dump(media_dict, f, indent=2)
    
    media_generated += 1
    
    if media_generated <= 5:
        print(f"  Generated: {media_file.name}")

if media_generated > 5:
    print(f"  ... and {media_generated - 5} more files")

print(f"\n{'='*80}")
print(f"Media generation complete")
print(f"  Generated: {media_generated}")
print(f"  Skipped: {media_skipped}")
print(f"  Output directory: {MEDIA_DIR}")
print(f"{'='*80}")


Generating media JSON files...
  Generated: 12-Propanediol.json
  Generated: 1-Pentanol.json
  Generated: 2-Deoxy-D-Ribose.json
  Generated: 2-methyl-1-butanol.json
  Generated: 3-Methyl-2-Oxobutanoic_Acid.json
  Skipping 6-O-Acetyl-D-glucose (unmapped)
  Skipping D-Gluconic Acid sodium salt (unmapped)
  Skipping D-Glucuronic Acid (unmapped)
  Skipping Dodecandioic acid (unmapped)
  Skipping Gly-DL-Asp (unmapped)
  Skipping Lacto-N-neotetraose (unmapped)
  Skipping Maltitol (unmapped)
  Skipping Methyl-B-D-galactopyranoside (unmapped)
  ... and 128 more files

Media generation complete
  Generated: 133
  Skipped: 8
  Output directory: media


## Usage Example

How to load media files for ModelSEEDpy:

In [16]:
# Example: Load glucose media
print("Example: Loading media for D-Glucose")
print("="*80)

# Find glucose media file
glucose_file = MEDIA_DIR / "2-Deoxy-D-Ribose.json"  # First carbon source as example

if glucose_file.exists():
    with open(glucose_file) as f:
        media_dict = json.load(f)
    
    print(f"\nLoaded media from: {glucose_file.name}")
    print(f"Total compounds: {len(media_dict)}")
    print(f"\nCarbon source:")
    for cpd_id, bounds in media_dict.items():
        if bounds[0] == CARBON_UPTAKE_RATE:  # Find carbon source
            print(f"  {cpd_id}: {bounds}")
    
    print(f"\nUsage with ModelSEEDpy:")
    print("  from modelseedpy import MSMedia")
    print(f"  media = MSMedia.from_dict(media_dict)")
    print("  model.medium = media.get_media_constraints()")
else:
    print("Example media file not found")

Example: Loading media for D-Glucose

Loaded media from: 2-Deoxy-D-Ribose.json
Total compounds: 20

Carbon source:
  cpd01242: [-5, 100]

Usage with ModelSEEDpy:
  from modelseedpy import MSMedia
  media = MSMedia.from_dict(media_dict)
  model.medium = media.get_media_constraints()


## Summary

**Intermediate Files Created**:
1. `results/round1_mapped.csv` - Compounds mapped in Round 1 (automated)
2. `results/round1_unmapped.txt` - Compounds that need Round 2
3. `results/round2_all_mappings.csv` - All mappings after Round 2
4. `results/round2_still_unmapped.csv` - Compounds still unmapped after Round 2 (if any)

**Final Output Files**:
1. `results/carbon_source_mapping.csv` - Complete final mapping table
2. `media/*.json` - Individual media formulation files (one per mapped carbon source)

**Workflow Summary**:
- Round 1: Automated search (template + local database)
- Round 2: AI-assisted mapping with GPT-4o
- Round 3: Optional deep dive with GPT-5 (o3) for difficult cases
- Final: Combined mapping table and media generation

**Next Steps**:
1. Review unmapped compounds (if any) and perform manual curation
2. Update mapping CSV with manual corrections if needed
3. Re-run media generation section if needed
4. Proceed to CDMSCI-198: Build metabolic models
5. Proceed to CDMSCI-199: Run FBA simulations with these media