# Comparing LLM Models with Cellmage

This notebook compares the performance of different LLM models using the cellmage library.
We'll be comparing:
- OpenAI GPT-4.1-nano
- Google Gemini-2.5-flash

**Date:** April 24, 2025

## Setup and Configuration

Let's set up our development environment and import the necessary modules.

In [1]:
# Setup environment
import os
import sys
import logging

# Skip dotenv loading for testing
os.environ["CELLMAGE_SKIP_DOTENV"] = "1"

# Set up logging
logging.basicConfig(level=logging.INFO)

# Ensure the cellmage package can be imported
# Get the absolute path of the current working directory
notebook_dir = os.getcwd()
# Get the project root directory (parent of the notebook directory)
project_root = os.path.abspath(os.path.join(notebook_dir, ".."))

print(f"Notebook directory: {notebook_dir}")
print(f"Project root directory: {project_root}")

if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"Added path: {project_root}")

try:
    # Import cellmage
    import cellmage

    # Check version - handle case where __version__ might not be available
    try:
        print(f"Cellmage version: {cellmage.__version__}")
    except AttributeError:
        print("Cellmage imported successfully, but version information is not available")
except ModuleNotFoundError as e:
    print(f"Error importing cellmage: {e}")
    print("\nDebug information:")
    print(f"Current working directory: {os.getcwd()}")
    print(f"Python path: {sys.path}")
    print("\nTry running this notebook from the project root directory")

2025-04-24 07:02:48,516 - cellmage - INFO - Cellmage logging initialized


Notebook directory: /Users/tpinto/madpin/cellmage/notebooks
Project root directory: /Users/tpinto/madpin/cellmage
Added path: /Users/tpinto/madpin/cellmage
Cellmage version: 0.1.0


## Helper Functions for Model Comparison

Let's define some helper functions to assist with our model comparison.

In [2]:
def create_chat_manager(model_name):
    """Create a chat manager for the specified model."""
    from cellmage.adapters.direct_client import DirectLLMAdapter
    from cellmage.resources.memory_loader import MemoryLoader
    from cellmage.storage.memory_store import MemoryStore

    # Create LLM client
    llm_client = DirectLLMAdapter()
    # Set the model name as an override
    llm_client.set_override("model", model_name)

    # Create components
    persona_loader = MemoryLoader()
    snippet_provider = MemoryLoader()
    history_store = MemoryStore()

    # Add standard persona
    persona_loader.add_persona(
        name="standard_persona",
        system_message="You are a helpful assistant that provides accurate and concise information.",
        config={"temperature": 0.5},
    )

    # Create chat manager
    chat_manager = cellmage.ChatManager(
        llm_client=llm_client,
        persona_loader=persona_loader,
        snippet_provider=snippet_provider,
        history_store=history_store,
    )

    chat_manager.set_default_persona("standard_persona")
    return chat_manager


def time_response(chat_manager, prompt):
    """Time how long it takes to get a response."""
    # Clear history to ensure fresh context
    chat_manager.clear_history()

    # Time the response
    start_time = time.time()
    response = chat_manager.chat(prompt, stream=False)
    end_time = time.time()

    duration = end_time - start_time
    response_length = len(response) if response else 0

    return {
        "duration": duration,
        "response": response,
        "length": response_length,
        "tokens_per_second": response_length / (4 * duration) if duration > 0 else 0,  # Rough estimate
    }


def compare_models(prompts, models=["gpt-4.1-nano", "gemini-2.5-flash"]):
    """Compare multiple models across multiple prompts."""
    results = {}

    for model in models:
        print(f"\n===== Setting up {model} =====")
        chat_manager = create_chat_manager(model)
        results[model] = []

        for i, prompt in enumerate(prompts):
            print(f"\n- Testing prompt {i + 1}/{len(prompts)} on {model}")
            try:
                result = time_response(chat_manager, prompt)
                print(f"  Duration: {result['duration']:.2f}s, Length: {result['length']} chars")
                results[model].append(result)
            except Exception as e:
                print(f"  Error with {model}: {str(e)}")
                results[model].append({"error": str(e)})

    return results

## Test Prompts

Let's define a variety of test prompts to evaluate different aspects of the models.

In [3]:
test_prompts = [
    # 1. Simple factual question
    "What is the capital of France and what are three interesting facts about it?",
    # 2. Code generation task
    """Write a Python function to check if a string is a palindrome. 
    The function should ignore spaces, punctuation, and capitalization.""",
    # 3. Creative writing task
    """Write a short story (around 150 words) about a robot that develops 
    emotions and how it navigates its first human friendship.""",
    # 4. Analytical reasoning task
    """What are the ethical considerations when implementing AI in healthcare? 
    Consider at least three perspectives and discuss potential solutions to ethical dilemmas.""",
    # 5. Technical explanation task
    """Explain how public key cryptography works in simple terms that a high school 
    student could understand. Include an analogy to help explain it.""",
]

## Running the Comparison

Now let's run our comparison between GPT-4.1-nano and Gemini-2.5-flash models.

In [4]:
# Run the comparison (Note: This may take some time to complete)
print(f"Starting model comparison at {datetime.now().strftime('%H:%M:%S')}")
comparison_results = compare_models(test_prompts)
print(f"\nComparison completed at {datetime.now().strftime('%H:%M:%S')}")

NameError: name 'datetime' is not defined

## Analyzing the Results

Let's analyze the results of our model comparison.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a summary dataframe
summary_data = []

for model, results in comparison_results.items():
    for i, result in enumerate(results):
        if "error" not in result:
            summary_data.append(
                {
                    "Model": model,
                    "Prompt": f"Prompt {i + 1}",
                    "Time (s)": result["duration"],
                    "Response Length": result["length"],
                    "Est. Tokens/Second": result["tokens_per_second"],
                }
            )

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))

# Compare average response time and length
avg_stats = (
    summary_df.groupby("Model")
    .agg({"Time (s)": "mean", "Response Length": "mean", "Est. Tokens/Second": "mean"})
    .reset_index()
)

print("\nAverage Statistics by Model:")
print(avg_stats.to_string(index=False))

## Visualizing the Comparison

Let's create some visualizations to better understand the differences between models.

In [None]:
# Set figure size
plt.figure(figsize=(12, 10))

# Plot 1: Response Times by Prompt
plt.subplot(2, 2, 1)
for model in summary_df["Model"].unique():
    model_data = summary_df[summary_df["Model"] == model]
    plt.plot(model_data["Prompt"], model_data["Time (s)"], marker="o", label=model)
plt.title("Response Time by Prompt")
plt.ylabel("Time (seconds)")
plt.xticks(rotation=45)
plt.legend()
plt.grid(alpha=0.3)

# Plot 2: Response Length by Prompt
plt.subplot(2, 2, 2)
for model in summary_df["Model"].unique():
    model_data = summary_df[summary_df["Model"] == model]
    plt.plot(model_data["Prompt"], model_data["Response Length"], marker="o", label=model)
plt.title("Response Length by Prompt")
plt.ylabel("Length (characters)")
plt.xticks(rotation=45)
plt.legend()
plt.grid(alpha=0.3)

# Plot 3: Average Response Time
plt.subplot(2, 2, 3)
plt.bar(avg_stats["Model"], avg_stats["Time (s)"])
plt.title("Average Response Time")
plt.ylabel("Time (seconds)")
plt.grid(axis="y", alpha=0.3)

# Plot 4: Estimated Tokens per Second
plt.subplot(2, 2, 4)
plt.bar(avg_stats["Model"], avg_stats["Est. Tokens/Second"])
plt.title("Estimated Tokens per Second")
plt.ylabel("Tokens/Second")
plt.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## Qualitative Analysis

Now let's look at some selected responses for qualitative comparison.

In [None]:
# Select a couple of interesting prompts for detailed comparison
interesting_prompts = [0, 2]  # Factual and Creative tasks

for i in interesting_prompts:
    print(f"\n===== Prompt {i + 1}: {test_prompts[i][:50]}... =====")

    for model, results in comparison_results.items():
        if i < len(results) and "error" not in results[i]:
            print(f"\n{model} response (Time: {results[i]['duration']:.2f}s):")
            print("-" * 50)
            print(results[i]["response"][:500] + ("..." if len(results[i]["response"]) > 500 else ""))
            print("-" * 50)

## Testing with a Complex Coding Task

Let's test both models with a more complex coding task to evaluate their programming capabilities.

In [None]:
complex_coding_prompt = """
Create a Python class for a simple task management system with these features:

1. Add tasks with priority levels (high, medium, low)
2. Mark tasks as complete
3. List tasks filtered by priority or completion status
4. Delete tasks
5. Export tasks to JSON format

Include appropriate error handling, documentation, and an example usage of the class.
"""

print("Testing complex coding task...")
coding_results = {}

for model in ["gpt-4.1-nano", "gemini-2.5-flash"]:
    print(f"\nTesting {model}...")

    try:
        chat_manager = create_chat_manager(model)
        start_time = time.time()
        response = chat_manager.chat(complex_coding_prompt, stream=False)
        duration = time.time() - start_time

        coding_results[model] = {"response": response, "duration": duration}

        print(f"Response received in {duration:.2f} seconds")
    except Exception as e:
        print(f"Error: {str(e)}")
        coding_results[model] = {"error": str(e)}

## Comparing Code Quality

Let's evaluate the code quality from both models.

In [None]:
import re


def extract_python_code(text):
    """Extract Python code blocks from text."""
    # Match code blocks in markdown format with ```python ... ``` or just ``` ... ```
    pattern = r"```(?:python)?\n([\s\S]*?)\n```"
    matches = re.findall(pattern, text)

    if matches:
        return "\n\n".join(matches)
    return None


for model, result in coding_results.items():
    if "error" not in result:
        print(f"\n===== {model} Code Solution =====")
        print(f"Response time: {result['duration']:.2f} seconds")

        # Extract and display the code
        code = extract_python_code(result["response"])
        if code:
            print("\nCode sample:")
            print("-" * 80)
            print(code)
            print("-" * 80)

## Testing Model Response to Historical Context

Let's see how well each model handles a conversation that requires historical context.

In [None]:
def test_contextual_memory(model_name):
    """Test how well a model maintains context across multiple turns."""
    print(f"\n===== Testing contextual memory for {model_name} =====")

    chat_manager = create_chat_manager(model_name)

    # Multi-turn conversation about a fictional character
    prompts = [
        "Let's create a fictional character named Alex who is a software developer with a passion for hiking.",
        "What programming languages might Alex be proficient in given their career?",
        "What hiking gear would you recommend for Alex if they're planning a 3-day trek in the mountains?",
        "Alex is considering a career change to become a park ranger. What transferable skills from software development would be useful?",
    ]

    responses = []
    for i, prompt in enumerate(prompts):
        print(f"\nTurn {i + 1}: {prompt}")
        try:
            response = chat_manager.chat(prompt, stream=False)
            print(f"Response length: {len(response)} characters")
            responses.append({"prompt": prompt, "response": response})
        except Exception as e:
            print(f"Error: {str(e)}")

    return responses


# Test both models
gpt_context_test = test_contextual_memory("gpt-4.1-nano")
gemini_context_test = test_contextual_memory("gemini-2.5-flash")

## Conclusion: Model Comparison Summary

Let's summarize what we've learned from comparing GPT-4.1-nano and Gemini-2.5-flash models.

In [None]:
def format_summary(model, results, context_results, coding_result):
    """Format a summary of model performance."""
    # Calculate average stats if results are available
    if results and all("error" not in r for r in results):
        avg_time = sum(r["duration"] for r in results) / len(results)
        avg_length = sum(r["length"] for r in results) / len(results)
    else:
        avg_time = avg_length = "N/A"

    # Check if coding result is available
    if coding_result and "error" not in coding_result:
        coding_time = coding_result["duration"]
        code_quality = "Available" if extract_python_code(coding_result["response"]) else "Not found"
    else:
        coding_time = "N/A"
        code_quality = "N/A"

    # Check context handling
    context_quality = "Good" if context_results and len(context_results) >= 3 else "Limited"

    return f"""
    Model: {model}
    Average response time: {avg_time if isinstance(avg_time, str) else f"{avg_time:.2f}s"}
    Average response length: {avg_length if isinstance(avg_length, str) else f"{int(avg_length)} chars"}
    Complex coding task time: {coding_time if isinstance(coding_time, str) else f"{coding_time:.2f}s"}
    Code quality: {code_quality}
    Context handling: {context_quality}
    """


# Generate summaries
gpt_summary = format_summary(
    "GPT-4.1-nano",
    comparison_results.get("gpt-4.1-nano", []),
    gpt_context_test,
    coding_results.get("gpt-4.1-nano", {}),
)

gemini_summary = format_summary(
    "Gemini-2.5-flash",
    comparison_results.get("gemini-2.5-flash", []),
    gemini_context_test,
    coding_results.get("gemini-2.5-flash", {}),
)

print("===== Model Comparison Summary =====")
print(gpt_summary)
print(gemini_summary)

## Final Observations

Based on our testing with the cellmage library, both models have their own strengths and weaknesses:

**GPT-4.1-nano:**
- May excel in coding tasks and technical explanations
- Could have stronger contextual memory across conversation turns

**Gemini-2.5-flash:**
- May excel in creative content generation
- Could potentially offer faster response times

The cellmage library provides a consistent interface for working with both models, making it easy to switch between them or combine their capabilities for different use cases.

When deciding which model to use for a particular application, consider the specific requirements of your task, such as response time constraints, complexity of the problem, and the importance of contextual understanding.