# CDMSCI-199: Condition-Specific Gap-filling Experiment (Corrected)

## Objective

For each of the 571 false negative cases (where experimental growth = 1 but model predicted = 0), run condition-specific gap-filling using the **proper ModelSEEDpy protocol**:

1. Use **Core-V5.2** template for gap-filling (not GramNegModelTemplateV6)
2. Use **MSGapfill** (not COBRApy's basic gapfill)
3. Use **MSMedia** for media definitions
4. Follow the reference workflow from `references/build_metabolic_model/build_model.ipynb`

## Research Question

**Can condition-specific gap-filling rescue false negatives, and if so, is it adding meaningful biology or just overfitting?**

## Key Difference from CDMSCI-198

- **CDMSCI-198**: Gap-filled on pyruvate minimal media only
- **This analysis**: Gap-fill on each specific carbon source media

## Runtime

Estimated: 5-10 hours for 571 gap-filling experiments

## Setup and Imports

In [None]:
import cobra
from cobra.io import load_json_model, save_json_model
import pandas as pd
import json
from pathlib import Path
from tqdm.notebook import tqdm
import time
import warnings
warnings.filterwarnings('ignore')

# ModelSEEDpy imports (proper gap-filling protocol)
from modelseedpy import MSMedia, MSGapfill, MSBuilder
from modelseedpy.core.mstemplate import MSTemplateBuilder
from modelseedpy.core.msmodel import get_reaction_constraints_from_direction

print(f"COBRApy version: {cobra.__version__}")
print(f"Pandas version: {pd.__version__}")
print("ModelSEEDpy imported successfully")

## Load Input Data

In [None]:
# Paths
models_dir = Path('../CDMSCI-198-build-models/models')
media_dir = Path('../CDMSCI-197-media-formulations/media')
false_negatives_file = Path('results/false_negatives.csv')
core_template_path = Path('../references/build_metabolic_model/Core-V5.2.json')

# Load false negatives
fn_df = pd.read_csv(false_negatives_file)
print(f"Loaded {len(fn_df)} false negatives to process")
print(f"\nFirst 5 FNs:")
print(fn_df[['organism', 'carbon_source', 'biomass_flux']].head())

In [None]:
# Add orgId to FN dataframe
org_metadata = pd.read_csv('results/organism_metadata.csv')
fn_df = fn_df.merge(org_metadata[['organism', 'orgId']], on='organism', how='left')

print(f"Added orgId column")
print(f"Missing orgId: {fn_df['orgId'].isna().sum()}")
if fn_df['orgId'].isna().sum() > 0:
    print("\nOrganisms with missing orgId:")
    print(fn_df[fn_df['orgId'].isna()]['organism'].unique())

## Load Core-V5.2 Template (Proper Gap-filling Template)

In [None]:
print(f"Loading Core-V5.2 template: {core_template_path}")
print("This is the proper template for gap-filling (per reference workflow)")
print()

try:
    with open(core_template_path) as fh:
        template_core = MSTemplateBuilder.from_dict(json.load(fh)).build()
    print(f"✓ Core-V5.2 template loaded successfully")
    print(f"  Reactions: {len(template_core.reactions):,}")
    print(f"  Compounds: {len(template_core.compcompounds):,}")
except Exception as e:
    print(f"✗ ERROR: Could not load Core-V5.2 template: {e}")
    raise

## Helper Functions (from CDMSCI-198)

In [None]:
def integrate_gapfill_solution(template, model, gapfill_result):
    """
    Integrate gapfill solution by adding reactions from template to model.
    Returns list of added reactions.
    """
    added_reactions = []
    
    # Process new reactions
    gap_sol = {}
    for rxn_id, direction in gapfill_result.get('new', {}).items():
        # Skip exchange reactions (EX_*) - they'll be added automatically
        if rxn_id.startswith('EX_'):
            continue
            
        # Remove index suffix (e.g., rxn05481_c0 -> rxn05481_c)
        if rxn_id.endswith('0'):
            template_rxn_id = rxn_id[:-1]  # Remove just the 0
        else:
            template_rxn_id = rxn_id
            
        if template_rxn_id in template.reactions:
            gap_sol[template_rxn_id] = get_reaction_constraints_from_direction(direction)
    
    # Add reactions to model
    for rxn_id, (lb, ub) in gap_sol.items():
        template_reaction = template.reactions.get_by_id(rxn_id)
        model_reaction = template_reaction.to_reaction(model)
        model_reaction.lower_bound = lb
        model_reaction.upper_bound = ub
        added_reactions.append(model_reaction)
    
    model.add_reactions(added_reactions)
    
    # Add missing exchanges
    add_exchanges = MSBuilder.add_exchanges_to_model(model)
    
    return added_reactions, add_exchanges

def apply_media_to_model(media, model, prefix='EX_'):
    """Apply media constraints to model medium."""
    import math
    medium = {}
    for cpd, (lb, ub) in media.get_media_constraints().items():
        rxn_exchange = f'{prefix}{cpd}'
        if rxn_exchange in model.reactions:
            medium[rxn_exchange] = math.fabs(lb)
    return medium

print("Helper functions defined")

## Configuration

In [None]:
# Growth threshold
GROWTH_THRESHOLD = 0.001  # h^-1

# Results storage
results = []
reaction_details = []
errors = []

print(f"Configuration:")
print(f"  Growth threshold: {GROWTH_THRESHOLD} h^-1")
print(f"  Total experiments: {len(fn_df)}")
print(f"  Estimated time: {len(fn_df) * 30 / 3600:.1f} - {len(fn_df) * 60 / 3600:.1f} hours")

## Run Gap-filling Experiments

**This will take 5-10 hours. Progress bar will update in real-time.**

For each false negative:
1. Load draft model
2. Load carbon source media (as MSMedia object)
3. Test pre-gapfill growth
4. Run MSGapfill with Core-V5.2 template
5. Integrate gapfill solution
6. Test post-gapfill growth
7. Track all reactions added

In [None]:
# Track timing
start_time = time.time()

print(f"Starting {len(fn_df)} gap-filling experiments...")
print(f"Start time: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print()

# Main loop with progress bar
for idx, row in tqdm(fn_df.iterrows(), total=len(fn_df), desc="Gap-filling FNs"):
    org_id = row.get('orgId')
    
    if pd.isna(org_id):
        errors.append({
            'organism': row['organism'],
            'orgId': '',
            'carbon_source': row['carbon_source'],
            'error': 'Missing orgId'
        })
        continue
    
    organism = row['organism']
    carbon_source = row['carbon_source']
    
    # Construct paths
    draft_model_path = models_dir / f'{org_id}_draft.json'
    
    # Try to find media file (handle different naming)
    media_path = None
    possible_names = [
        f"{carbon_source}.json",
        f"{carbon_source.replace(' ', '_')}.json",
        f"{carbon_source.replace(',', '').replace(' ', '_')}.json",
    ]
    
    for name in possible_names:
        test_path = media_dir / name
        if test_path.exists():
            media_path = test_path
            break
    
    # Check if files exist
    if not draft_model_path.exists():
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': 'Draft model file not found'
        })
        continue
    
    if media_path is None:
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': 'Media file not found'
        })
        continue
    
    # Load draft model
    try:
        model = load_json_model(str(draft_model_path))
    except Exception as e:
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': f'Model load error: {str(e)[:100]}'
        })
        continue
    
    # Load media as MSMedia object
    try:
        with open(media_path, 'r') as f:
            media_dict = json.load(f)
        # Convert to MSMedia format (ModelSEED compound IDs with bounds)
        msmedia = MSMedia.from_dict(media_dict)
    except Exception as e:
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': f'Media load error: {str(e)[:100]}'
        })
        continue
    
    # Apply media and test pre-gapfill
    try:
        model.medium = apply_media_to_model(msmedia, model)
        model.objective = 'bio1'
        pre_gapfill_solution = model.optimize()
        pre_gapfill_flux = pre_gapfill_solution.objective_value
    except Exception as e:
        pre_gapfill_flux = 0.0
    
    if pre_gapfill_flux > GROWTH_THRESHOLD:
        # Shouldn't happen for a false negative
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': f'Draft already grows (flux={pre_gapfill_flux:.4f})'
        })
        continue
    
    # Run gap-filling using MSGapfill (proper protocol)
    try:
        # Create MSGapfill object with Core-V5.2 template
        gapfiller = MSGapfill(
            model,
            default_gapfill_templates=[template_core],
            default_target='bio1'
        )
        
        # Run gapfilling
        gapfill_result = gapfiller.run_gapfilling(msmedia)
        
        num_gapfilled = len(gapfill_result.get('new', {}))
        
        if num_gapfilled > 0:
            # Integrate gapfill solution
            model_gapfilled = model.copy()
            added_rxns, added_exch = integrate_gapfill_solution(template_core, model_gapfilled, gapfill_result)
            
            # Test gap-filled model
            model_gapfilled.medium = apply_media_to_model(msmedia, model_gapfilled)
            model_gapfilled.objective = 'bio1'
            gapfilled_solution = model_gapfilled.optimize()
            post_gapfill_flux = gapfilled_solution.objective_value
            gapfill_success = post_gapfill_flux > GROWTH_THRESHOLD
            
            # Record result
            results.append({
                'organism': organism,
                'orgId': org_id,
                'carbon_source': carbon_source,
                'media_filename': media_path.name,
                'pre_gapfill_flux': pre_gapfill_flux,
                'post_gapfill_flux': post_gapfill_flux,
                'gapfill_success': gapfill_success,
                'num_reactions_added': len(added_rxns),
                'num_exchanges_added': len(added_exch),
                'reactions_added': ';'.join([r.id for r in added_rxns]),
            })
            
            # Record detailed reactions
            for reaction in added_rxns:
                reaction_details.append({
                    'organism': organism,
                    'orgId': org_id,
                    'carbon_source': carbon_source,
                    'reaction_id': reaction.id,
                    'reaction_name': reaction.name,
                    'reaction_formula': reaction.build_reaction_string(),
                    'subsystem': reaction.subsystem
                })
        else:
            # No solution found
            results.append({
                'organism': organism,
                'orgId': org_id,
                'carbon_source': carbon_source,
                'media_filename': media_path.name,
                'pre_gapfill_flux': pre_gapfill_flux,
                'post_gapfill_flux': 0.0,
                'gapfill_success': False,
                'num_reactions_added': 0,
                'num_exchanges_added': 0,
                'reactions_added': '',
            })
    
    except Exception as e:
        errors.append({
            'organism': organism,
            'orgId': org_id,
            'carbon_source': carbon_source,
            'error': f'Gap-filling error: {str(e)[:100]}'
        })

# Done!
elapsed_time = time.time() - start_time
print(f"\n{'='*80}")
print(f"COMPLETED!")
print(f"{'='*80}")
print(f"End time: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total time: {elapsed_time/60:.1f} minutes ({elapsed_time/3600:.2f} hours)")
print(f"Experiments completed: {len(results)}")
print(f"Errors: {len(errors)}")

## Save Results

In [None]:
# Save main results
results_df = pd.DataFrame(results)
results_df.to_csv('results/condition_specific_gapfilling_results.csv', index=False)
print(f"✓ Saved main results: results/condition_specific_gapfilling_results.csv")
print(f"  Rows: {len(results_df):,}")

# Save detailed reactions
reactions_df = pd.DataFrame(reaction_details)
reactions_df.to_csv('results/condition_specific_gapfilling_reactions.csv', index=False)
print(f"✓ Saved detailed reactions: results/condition_specific_gapfilling_reactions.csv")
print(f"  Rows: {len(reactions_df):,}")

# Save errors
if errors:
    errors_df = pd.DataFrame(errors)
    errors_df.to_csv('results/condition_specific_gapfilling_errors.csv', index=False)
    print(f"✓ Saved errors: results/condition_specific_gapfilling_errors.csv")
    print(f"  Rows: {len(errors_df):,}")

## Summary Statistics

In [None]:
print(f"{'='*80}")
print(f"SUMMARY STATISTICS")
print(f"{'='*80}")
print()
print(f"Total experiments: {len(results_df)}")
print(f"Successful gap-filling: {results_df['gapfill_success'].sum()} ({100*results_df['gapfill_success'].mean():.1f}%)")
print(f"Failed gap-filling: {(~results_df['gapfill_success']).sum()} ({100*(~results_df['gapfill_success']).mean():.1f}%)")
print(f"Errors: {len(errors)}")
print()

if len(results_df) > 0:
    print("Reactions added statistics:")
    print(results_df['num_reactions_added'].describe())
    print()
    print(f"Mean reactions added: {results_df['num_reactions_added'].mean():.1f}")
    print(f"Median reactions added: {results_df['num_reactions_added'].median():.1f}")
    print(f"Max reactions added: {results_df['num_reactions_added'].max()}")

## Experiment Complete!

Next steps:
1. Review `results/condition_specific_gapfilling_results.csv`
2. Analyze which reactions are most frequently added
3. Assess biological plausibility of added reactions
4. Compare to pyruvate gap-filling reactions from CDMSCI-198
5. Make recommendations on multi-condition gap-filling approach