# CAMEO Evaluation Test

This notebook replicates the evaluation workflow from AMAS `test_LLM_synonyms_plain.ipynb` using CAMEO functions.
It tests both single model evaluation and batch evaluation of multiple models.

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import logging

# Add the project root to the Python path
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

# Import CAMEO functions
from cameo.core import annotate_single_model, print_results
from cameo.utils import (
    evaluate_single_model,
    evaluate_models_in_folder,
    print_evaluation_results,
    compare_with_amas_results
)

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Configuration

Set up paths and parameters for evaluation.

In [2]:
# Configuration
test_model_file = "test_models/BIOMD0000000190.xml"  # Single test model
# model_dir = "/Users/luna/Desktop/CRBM/AMAS_proj/Models/BioModels"  # Directory with multiple models
model_dir = "test_models"
output_dir = "./results/"  # Output directory for results

# LLM configuration
llm_model = "meta-llama/llama-3.3-70b-instruct:free"  # or "gpt-4o-mini"

# Evaluation parameters
max_entities_per_model = 10  # Limit entities per model for testing
num_models_to_test = 5  # Number of models to test in batch evaluation

# Entity and database configuration
entity_type = "chemical"
database = "chebi"

print(f"Test model: {test_model_file}")
print(f"Model directory: {model_dir}")
print(f"LLM model: {llm_model}")
print(f"Entity type: {entity_type}")
print(f"Database: {database}")
print(f"Max entities per model: {max_entities_per_model}")
print(f"Number of models to test: {num_models_to_test}")

Test model: test_models/BIOMD0000000190.xml
Model directory: test_models
LLM model: meta-llama/llama-3.3-70b-instruct:free
Entity type: chemical
Database: chebi
Max entities per model: 10
Number of models to test: 5


## Test 1: Single Model Evaluation

Test the evaluation of a single model using both the core interface and utils functions.

In [3]:
print("Test 1: Single Model Evaluation")
print("-" * 40)

# Check if test model exists
if os.path.exists(test_model_file):
    print(f"✓ Test model found: {test_model_file}")
else:
    print(f"✗ Test model not found: {test_model_file}")
    print("Please ensure the test model is available in the tests directory.")

Test 1: Single Model Evaluation
----------------------------------------
✓ Test model found: BIOMD0000000190.xml


In [3]:
# Test using core interface (annotate_single_model)
print("\n1.1 Testing core interface (annotate_single_model)")
print("-" * 50)

try:
    recommendations_df, metrics = annotate_single_model(
        model_file=test_model_file,
        llm_model=llm_model,
        max_entities=max_entities_per_model,
        entity_type=entity_type,
        database=database
    )
    
    print(f"✓ Core interface test successful")
    print(f"  - Generated {len(recommendations_df)} recommendations")
    print(f"  - Accuracy: {metrics['accuracy']:.1%}")
    print(f"  - Annotation rate: {metrics['annotation_rate']:.1%}")
    print(f"  - Total time: {metrics['total_time']:.2f}s")
    print(f"  - LLM time: {metrics['llm_time']:.2f}s")
    print(f"  - Search time: {metrics['search_time']:.2f}s")
    
    # Display sample recommendations
    print("\n  Sample recommendations:")
    sample_cols = ['id', 'display_name', 'annotation', 'annotation_label', 'match_score', 'existing']
    print(recommendations_df[sample_cols].head().to_string(index=False))
    
except Exception as e:
    print(f"✗ Core interface test failed: {e}")
    import traceback
    traceback.print_exc()

2025-05-25 16:35:00,525 - INFO - Starting annotation for model: test_models/BIOMD0000000190.xml
2025-05-25 16:35:00,526 - INFO - Using LLM model: meta-llama/llama-3.3-70b-instruct:free
2025-05-25 16:35:00,526 - INFO - Entity type: chemical, Database: chebi
2025-05-25 16:35:00,526 - INFO - Step 1: Finding existing annotations...
2025-05-25 16:35:00,537 - INFO - Found 11 entities with existing annotations
2025-05-25 16:35:00,537 - INFO - Selected 10 entities for evaluation
2025-05-25 16:35:00,537 - INFO - Step 3: Extracting model context...
2025-05-25 16:35:00,578 - INFO - Extracted context for model: Model_1
2025-05-25 16:35:00,578 - INFO - Step 4: Formatting LLM prompt...
2025-05-25 16:35:00,608 - INFO - Step 5: Querying LLM (meta-llama/llama-3.3-70b-instruct:free)...



1.1 Testing core interface (annotate_single_model)
--------------------------------------------------


2025-05-25 16:35:01,941 - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-25 16:35:15,369 - INFO - LLM response received in 14.76s
2025-05-25 16:35:15,373 - INFO - Step 6: Parsing LLM response...
2025-05-25 16:35:15,375 - INFO - Parsed synonyms for 11 entities
2025-05-25 16:35:15,375 - INFO - Step 7: Searching chebi database...
2025-05-25 16:35:16,803 - INFO - Database search completed in 1.43s
2025-05-25 16:35:16,803 - INFO - Step 8: Generating recommendation table...
2025-05-25 16:35:16,806 - INFO - Annotation completed in 16.28s
2025-05-25 16:35:16,806 - INFO - Generated 30 recommendations


✓ Core interface test successful
  - Generated 30 recommendations
  - Accuracy: 90.0%
  - Annotation rate: 90.9%
  - Total time: 16.28s
  - LLM time: 14.76s
  - Search time: 1.43s

  Sample recommendations:
 id             display_name  annotation            annotation_label  match_score  existing
SAM  S-adenosyl-L-methionine CHEBI:15414     S-adenosyl-L-methionine     0.666667         1
SAM  S-adenosyl-L-methionine CHEBI:33442 (S)-S-adenosyl-L-methionine     0.333333         0
SAM  S-adenosyl-L-methionine CHEBI:67040   S-adenosyl-L-methioninate     0.333333         0
  A S-adenosylmethioninamine CHEBI:15625    S-adenosylmethioninamine     1.000000         1
  P               Putrescine CHEBI:17148                  putrescine     1.000000         1


In [6]:
metrics

{'total_entities': 11,
 'entities_with_predictions': 10,
 'annotation_rate': 0.9090909090909091,
 'total_predictions': 30,
 'correct_matches': 9,
 'accuracy': 0.9,
 'total_time': 16.279825925827026,
 'llm_time': 14.760626077651978,
 'search_time': 1.4276001453399658}

In [3]:
# Test using utils evaluation function
print("\n1.2 Testing utils evaluation function")
print("-" * 50)

try:
    result_df = evaluate_single_model(
        model_file=test_model_file,
        llm_model=llm_model,
        max_entities=max_entities_per_model,
        entity_type=entity_type,
        database=database,
        save_llm_results=True,
        output_dir=output_dir
    )
    
    if result_df is not None:
        print(f"✓ Utils evaluation test successful")
        print(f"  - Generated {len(result_df)} result rows")
        print(f"  - Average accuracy: {result_df['accuracy'].mean():.1%}")
        
        # Display sample results with updated column names
        print("\n  Sample results:")
        sample_cols = ['species_id', 'display_name', 'synonyms_LLM', 'predictions', 'accuracy', 'total_time']
        available_cols = [col for col in sample_cols if col in result_df.columns]
        print(result_df[available_cols].head().to_string(index=False))
        
        # Show formula-based metrics
        print("\n  Formula-based metrics:")
        formula_cols = ['species_id', 'recall_formula', 'precision_formula', 'recall_chebi', 'precision_chebi']
        available_formula_cols = [col for col in formula_cols if col in result_df.columns]
        print(result_df[available_formula_cols].head().to_string(index=False))
        
        # Show LLM results
        print("\n  LLM Results:")
        llm_cols = ['species_id', 'synonyms_LLM', 'reason']
        available_llm_cols = [col for col in llm_cols if col in result_df.columns]
        if available_llm_cols:
            for idx, row in result_df[available_llm_cols].head(3).iterrows():
                print(f"  {row['species_id']}: {row.get('synonyms_LLM', 'N/A')}")
            if 'reason' in result_df.columns and not result_df['reason'].empty:
                print(f"  Reason: {result_df['reason'].iloc[0][:100]}...")
        
        # Show match scores (updated from predictions_hits)
        print("\n  Match scores:")
        if 'match_score' in result_df.columns:
            for idx, row in result_df[['species_id', 'match_score']].head(3).iterrows():
                print(f"  {row['species_id']}: {row['match_score']}")
        
        # Show existing annotation names (now using proper ChEBI labels)
        print("\n  Existing annotation names:")
        if 'exist_annotation_name' in result_df.columns:
            for idx, row in result_df[['species_id', 'exist_annotation_name']].head(3).iterrows():
                print(f"  {row['species_id']}: {row['exist_annotation_name']}")
        
        # Save results
        os.makedirs(output_dir, exist_ok=True)
        result_file = os.path.join(output_dir, "single_model_test_results.csv")
        result_df.to_csv(result_file, index=False)
        print(f"\n  Results saved to: {result_file}")
        
    else:
        print(f"✗ Utils evaluation test failed: No results generated")
        
except Exception as e:
    print(f"✗ Utils evaluation test failed: {e}")
    import traceback
    traceback.print_exc()

2025-05-25 16:39:39,867 - INFO - Evaluating model: BIOMD0000000190.xml
2025-05-25 16:39:39,878 - INFO - Evaluating 10 entities in BIOMD0000000190.xml



1.2 Testing utils evaluation function
--------------------------------------------------


2025-05-25 16:39:41,592 - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"


LLM results saved to: results/llama-3.3-70b-instruct/BIOMD0000000190_llm_results.txt
✓ Utils evaluation test successful
  - Generated 10 result rows
  - Average accuracy: 100.0%

  Sample results:
species_id                        display_name                                                                           synonyms_LLM                             predictions  accuracy  total_time
       SAM                S-adenosylmethionine                                                   [S-adenosylmethionine, AdoMet, SAMe] [CHEBI:15414, CHEBI:33442, CHEBI:67040]         1   19.295151
         A decarboxylated S-adenosylmethionine              [decarboxylated S-adenosylmethionine, S-adenosylmethioninamine, dcAdoMet]                           [CHEBI:15625]         1   19.295151
         P                          putrescine                                    [putrescine, 1,4-diaminobutane, butane-1,4-diamine]             [CHEBI:17148, CHEBI:326268]         1   19.295151
         S         

## Test 2: Batch Model Evaluation

Test the evaluation of multiple models in a directory.

In [4]:
print("\nTest 2: Batch Model Evaluation")
print("-" * 40)

# Check if model directory exists
if os.path.exists(model_dir):
    model_files = [f for f in os.listdir(model_dir) if f.endswith('.xml')]
    print(f"✓ Model directory found: {model_dir}")
    print(f"  - Found {len(model_files)} XML files")
    print(f"  - Will test first {min(num_models_to_test, len(model_files))} models")
else:
    print(f"✗ Model directory not found: {model_dir}")
    print("Skipping batch evaluation test.")
    model_files = []


Test 2: Batch Model Evaluation
----------------------------------------
✓ Model directory found: test_models
  - Found 3 XML files
  - Will test first 3 models


In [6]:
# Run batch evaluation if models are available
if model_files:
    print("\n2.1 Running batch evaluation")
    print("-" * 50)
    
    try:
        batch_results_df = evaluate_models_in_folder(
            model_dir=model_dir,
            num_models=min(num_models_to_test, len(model_files)),
            llm_model=llm_model,
            max_entities=max_entities_per_model,
            entity_type=entity_type,
            database=database,
            save_llm_results=True,
            output_dir=output_dir,
            output_file="batch_evaluation_results.csv",
            start_at=1
        )
        
        if not batch_results_df.empty:
            print(f"✓ Batch evaluation successful")
            print(f"  - Evaluated {batch_results_df['model'].nunique()} models")
            print(f"  - Generated {len(batch_results_df)} total result rows")
            print(f"  - Average accuracy: {batch_results_df['accuracy'].mean():.1%}")
            
            # Show updated metrics
            print("\n  Updated metrics summary:")
            if 'recall_formula' in batch_results_df.columns:
                print(f"  - Average recall (formula): {batch_results_df['recall_formula'].mean():.3f}")
                print(f"  - Average precision (formula): {batch_results_df['precision_formula'].mean():.3f}")
            print(f"  - Average recall (ChEBI): {batch_results_df['recall_chebi'].mean():.3f}")
            print(f"  - Average precision (ChEBI): {batch_results_df['precision_chebi'].mean():.3f}")
            
            # Show sample of LLM results
            print("\n  Sample LLM results:")
            if 'synonyms_LLM' in batch_results_df.columns:
                for idx, row in batch_results_df[['species_id', 'synonyms_LLM']].head(3).iterrows():
                    print(f"    {row['species_id']}: {row['synonyms_LLM']}")
            
            # Show match scores instead of predictions_hits
            print("\n  Sample match scores:")
            if 'match_score' in batch_results_df.columns:
                for idx, row in batch_results_df[['species_id', 'match_score']].head(3).iterrows():
                    print(f"    {row['species_id']}: {row['match_score']}")
            
            # Print summary statistics
            print("\n  Summary statistics:")
            print_evaluation_results(os.path.join(output_dir, "batch_evaluation_results.csv"))
            
        else:
            print(f"✗ Batch evaluation failed: No results generated")
            
    except Exception as e:
        print(f"✗ Batch evaluation failed: {e}")
        import traceback
        traceback.print_exc()
else:
    print("\nSkipping batch evaluation - no model directory available")

2025-05-25 16:31:32,475 - INFO - Evaluating 3 models starting from index 1
2025-05-25 16:31:32,476 - INFO - Evaluating 1/3: BIOMD0000000190.xml
2025-05-25 16:31:32,476 - INFO - Evaluating model: BIOMD0000000190.xml
2025-05-25 16:31:32,494 - INFO - Evaluating 10 entities in BIOMD0000000190.xml



2.1 Running batch evaluation
--------------------------------------------------


2025-05-25 16:31:33,726 - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-25 16:31:47,641 - INFO - Saved intermediate results to: results/batch_evaluation_results.csv_1.csv
2025-05-25 16:31:47,642 - INFO - Evaluating 2/3: BIOMD0000000508.xml
2025-05-25 16:31:47,642 - INFO - Evaluating model: BIOMD0000000508.xml
2025-05-25 16:31:47,646 - INFO - Evaluating 5 entities in BIOMD0000000508.xml


LLM results saved to: results/llama-3.3-70b-instruct/BIOMD0000000190_llm_results.txt


2025-05-25 16:31:50,066 - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-25 16:31:52,557 - INFO - Saved intermediate results to: results/batch_evaluation_results.csv_2.csv
2025-05-25 16:31:52,557 - INFO - Evaluating 3/3: BIOMD0000000634.xml
2025-05-25 16:31:52,557 - INFO - Evaluating model: BIOMD0000000634.xml
2025-05-25 16:31:52,574 - INFO - Evaluating 5 entities in BIOMD0000000634.xml


LLM results saved to: results/llama-3.3-70b-instruct/BIOMD0000000508_llm_results.txt


2025-05-25 16:31:54,555 - INFO - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-25 16:32:03,496 - INFO - Saved intermediate results to: results/batch_evaluation_results.csv_3.csv
2025-05-25 16:32:03,498 - INFO - Saved final results to: results/batch_evaluation_results.csv


LLM results saved to: results/llama-3.3-70b-instruct/BIOMD0000000634_llm_results.txt
✓ Batch evaluation successful
  - Evaluated 3 models
  - Generated 20 total result rows
  - Average accuracy: 80.0%

  Updated metrics summary:
  - Average recall (formula): 0.800
  - Average precision (formula): 0.750
  - Average recall (ChEBI): 0.800
  - Average precision (ChEBI): 0.424

  Sample LLM results:
    SAM: ['S-adenosylmethionine', 'AdoMet', 'SAMe']
    A: ['S-adenosylmethioninamine', 'decarboxylated S-adenosylmethionine', 'dcAdoMet']
    P: ['putrescine', '1,4-diaminobutane', 'butane-1,4-diamine']

  Sample match scores:
    SAM: [1.0, 0.3333333333333333, 0.6666666666666666]
    A: [1.0]
    P: [1.0, 0.3333333333333333]

  Summary statistics:
Number of models assessed: 3
Number of models with predictions: 3
Average accuracy (per model): 0.77
Ave. total time (per model): 10.20
Ave. total time (per element, per model): 1.53
Ave. LLM time (per model): 9.76
Ave. LLM time (per element, per mod