# Optimizing Language Models with DSPy GEPA: From 42% to 64% Accuracy

_Authored by: [Behrooz Azarkhalili](https://github.com/behroozazarkhalili)_

This notebook demonstrates how to use DSPy's GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We'll work with the NuminaMath-1.5 dataset and show how GEPA can boost accuracy from 42% to 64% through automated prompt optimization.

**What you'll learn:**
- Setting up DSPy with local (Ollama) or cloud (OpenRouter) language models
- Processing and filtering mathematical problem datasets
- Building a baseline Chain-of-Thought reasoning program
- Optimizing prompts with GEPA using error-driven feedback
- Evaluating improvements in model accuracy

**Key Results:**
- Baseline accuracy: 42.3% (569/1344 correct)
- Optimized accuracy: 64.0% (860/1344 correct)
- **+21.7% improvement** through automated prompt engineering

GEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance.

## Installation and Setup

Install required dependencies and import libraries for DSPy, dataset processing, and model configuration.

In [1]:
import dspy
from datasets import load_dataset
import os

## Language Model Configuration

Configure your language model - either local (Ollama) or cloud-based (OpenRouter) - for use with DSPy.

In [None]:
# ============================================
# OPTION 1: Local Ollama Configuration
# ============================================
# Prerequisites: 
# 1. Install Ollama: curl -fsSL https://ollama.ai/install.sh | sh
# 2. Run models: ollama run gemma2:9b && ollama run gemma2:27b

# Main LM for inference
# ollama_llm = dspy.LM(
#     model='ollama_chat/gemma2:9b',  # Format: ollama_chat/{model_name}
#     api_base='http://localhost:11434',  # Ollama default endpoint
#     api_key='',  # Empty string for local Ollama
#     max_tokens=65536,
#     temperature=1.0
# )

# Reflection LM for GEPA optimization (can be same or larger model)
# reflection_lm = dspy.LM(
#     model='ollama_chat/gemma2:27b',  # Use larger model for better reflection
#     api_base='http://localhost:11434',
#     api_key='',
#     max_tokens=65536,
#     temperature=1.0
# )

# Set Ollama as default LM
# dspy.configure(lm=ollama_llm)

# print("✅ Ollama LM configured successfully!")
# print(f"Main model: {ollama_llm.model}")
# print(f"Reflection model: {reflection_lm.model}")

In [5]:
# ============================================
# OPTION 2: Cloud OpenRouter Configuration
# ============================================
# Uncomment below to use OpenRouter instead of Ollama
# Requires OPENROUTER_API_KEY environment variable

# # Main LM for inference
open_router_lm = dspy.LM(
    'openrouter/openai/gpt-4.1-nano', 
    api_key=os.getenv('OPENROUTER_API_KEY'), 
    api_base='https://openrouter.ai/api/v1',
    max_tokens=65536,
    temperature=1.0
)

# # Reflection LM for GEPA optimization
reflection_lm = dspy.LM(
    'openrouter/meta-llama/llama-4-scout', 
    api_key=os.getenv('OPENROUTER_API_KEY'), 
    api_base='https://openrouter.ai/api/v1',
    max_tokens=65536,
    temperature=1.0
)

# Set OpenRouter as default LM
dspy.configure(lm=open_router_lm)


## Dataset Loading and Filtering

Load the NuminaMath-1.5 dataset and filter for problems with numeric answers suitable for evaluation.

In [None]:
train_split = load_dataset("AI-MO/NuminaMath-1.5")['train']

In [None]:
def is_numeric_answer(answer: str) -> bool:
    """
    Check if an answer can be converted to a numeric value.
    
    Args:
        answer: The answer string to validate
    
    Returns:
        True if answer can be converted to int, False otherwise
    """
    try:
        int(answer)  # Attempt conversion to integer
        return True
    except (ValueError, TypeError):
        return False

In [None]:
# keep only the samples where its ['answer'] key is int or float, do it modular and fast.
train_split = train_split.filter(lambda x: is_numeric_answer(x['answer']))

In [None]:
print(train_split[12]['answer'])

In [None]:
def init_dataset(
    train_split_ratio: float = None, 
    test_split_ratio: float = None, 
    val_split_ratio: float = None, 
    sample_fraction: float = 1.0
) -> tuple[list, list, list]:
    """
    Initialize and split the NuminaMath-1.5 dataset into train/val/test sets.
    
    Loads the dataset, filters for numeric answers, converts to DSPy Examples,
    shuffles with fixed seed for reproducibility, and optionally samples a fraction.
    
    Args:
        train_split_ratio: Proportion for training (default: 0.5)
        test_split_ratio: Proportion for testing (default: 0.45)
        val_split_ratio: Proportion for validation (default: 0.05)
        sample_fraction: Fraction of dataset to use (default: 1.0 = full dataset)
    
    Returns:
        Tuple of (train_set, val_set, test_set) as lists of DSPy Examples
    
    Raises:
        AssertionError: If split ratios don't sum to 1.0
    """
    # Set default split ratios
    if train_split_ratio is None:
        train_split_ratio = 0.5
    if test_split_ratio is None:
        test_split_ratio = 0.45
    if val_split_ratio is None:
        val_split_ratio = 0.05
    
    # Validate split ratios sum to 1.0
    assert (train_split_ratio + test_split_ratio + val_split_ratio) == 1.0, \
        "Ratios must sum to 1.0"

    # Load dataset from Hugging Face Hub
    train_split = load_dataset("AI-MO/NuminaMath-1.5")['train']
    
    # Filter for problems with numeric answers only
    train_split = train_split.filter(lambda x: is_numeric_answer(x['answer']))
    
    # Convert to DSPy Examples with input/output fields
    train_split = [
        dspy.Example({
            "problem": x['problem'],
            'solution': x['solution'],
            'answer': x['answer'],
        }).with_inputs("problem")  # Mark 'problem' as input field
        for x in train_split
    ]
    
    # Shuffle with fixed seed for reproducibility
    import random
    random.Random(0).shuffle(train_split)
    tot_num = len(train_split)
    print(f"Total number of examples after filtering: {tot_num}")

    # Apply sampling if requested
    if sample_fraction < 1.0:
        sample_num = int(tot_num * sample_fraction)
        train_split = train_split[:sample_num]
        tot_num = sample_num
        print(f"Sampled down to {sample_num} examples.")
    
    # Split into train/val/test based on ratios
    train_end = int(train_split_ratio * tot_num)
    val_end = int((train_split_ratio + val_split_ratio) * tot_num)
    
    train_set = train_split[:train_end]
    val_set = train_split[train_end:val_end]
    test_set = train_split[val_end:]

    return train_set, val_set, test_set

## Dataset Preparation Functions

Helper functions to process the dataset, split it into train/val/test sets, and preview examples.

In [None]:
train_set, val_set, test_set = init_dataset(sample_fraction=0.01)

len(train_set), len(val_set), len(test_set)

In [None]:
print("Problem:")
print(train_set[0]['problem'])
print("\n\nSolution:")
print(train_set[0]['solution'])
print("\n\nAnswer:")
print(train_set[0]['answer'])

In [None]:
print(test_set[0]['problem'])
print("\n\nAnswer:")
print(test_set[0]['answer'])

In [None]:
class GenerateResponse(dspy.Signature):
    """Solve the problem and provide the answer in the correct format."""
    problem = dspy.InputField()
    answer = dspy.OutputField()

program = dspy.ChainOfThought(GenerateResponse)

## Baseline Chain-of-Thought Program

Create a simple baseline using DSPy's Chain-of-Thought module to establish initial performance.

In [None]:
def metric(
    example: dspy.Example, 
    prediction: dspy.Prediction, 
    trace=None, 
    pred_name=None, 
    pred_trace=None
) -> int:
    """
    Evaluation metric comparing model prediction against ground truth.
    
    Extracts integer answers from both example and prediction, returning 1 for
    exact match and 0 for mismatch or parsing failures.
    
    Args:
        example: DSPy Example containing ground truth 'answer'
        prediction: DSPy Prediction containing model's 'answer'
        trace: Optional trace information (unused)
        pred_name: Optional prediction name (unused)
        pred_trace: Optional prediction trace (unused)
    
    Returns:
        1 if answers match exactly, 0 otherwise
    """
    # Extract ground truth as integer
    correct_answer = int(example['answer'])
    
    try:
        # Attempt to parse model's answer as integer
        llm_answer = int(prediction.answer)
    except ValueError as e:
        # Return 0 if answer can't be parsed
        return 0
    
    # Return 1 for exact match, 0 for mismatch
    return int(correct_answer == llm_answer)

In [None]:
import dspy
evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metric,
    num_threads=32,
    display_table=True,
    display_progress=True
)

evaluate(program)

## Evaluation Metric

Define the evaluation metric to compare model predictions against ground truth answers.

In [None]:
# SYSTEMATIC DEBUGGING - Step 1: Test program on single example (FIXED)
print("=== STEP 1: Testing program on single example ===")
test_example = test_set[0]
print(f"Input problem: {test_example.problem[:100]}...")
print(f"Expected answer: {test_example.answer}")

try:
    # FIX: Use keyword argument matching signature field name
    prediction = program(problem=test_example.problem)
    print(f"Program prediction: {prediction}")
    print(f"Prediction answer: {prediction.answer}")
    print(f"Prediction type: {type(prediction.answer)}")
    print("✅ Program works!")
except Exception as e:
    print(f"❌ Program failed: {e}")
    import traceback
    traceback.print_exc()

## Baseline Evaluation

Evaluate the baseline Chain-of-Thought program to establish our starting accuracy before optimization.

In [None]:
def metric_with_feedback(
    example: dspy.Example, 
    prediction: dspy.Prediction, 
    trace=None, 
    pred_name=None, 
    pred_trace=None
) -> dspy.Prediction:
    """
    Enhanced evaluation metric with detailed feedback for GEPA optimization.
    
    Evaluates predictions and generates targeted feedback including error analysis
    and the complete solution for learning. Feedback helps GEPA identify failure
    patterns and improve prompts.
    
    Args:
        example: DSPy Example with ground truth answer and solution
        prediction: DSPy Prediction with model's answer
        trace: Optional trace information (unused)
        pred_name: Optional prediction name (unused)
        pred_trace: Optional prediction trace (unused)
    
    Returns:
        DSPy Prediction with score (0 or 1) and detailed feedback text
    """
    # Extract ground truth and solution
    correct_answer = int(example['answer'])
    written_solution = example.get('solution', '')
    
    try:
        # Attempt to parse model's answer
        llm_answer = int(prediction.answer)
    except ValueError as e:
        # Handle parsing failure with detailed feedback
        feedback_text = (
            f"The final answer must be a valid integer and nothing else. "
            f"You responded with '{prediction.answer}', which couldn't be parsed as a python integer. "
            f"Please ensure your answer is a valid integer without any additional text or formatting."
        )
        feedback_text += f" The correct answer is '{correct_answer}'."
        
        # Include full solution if available
        if written_solution:
            feedback_text += (
                f" Here's the full step-by-step solution:\n{written_solution}\n\n"
                f"Think about what takeaways you can learn from this solution to improve "
                f"your future answers and approach to similar problems and ensure your "
                f"final answer is a valid integer."
            )
        return dspy.Prediction(score=0, feedback=feedback_text)

    # Score: 1 for correct, 0 for incorrect
    score = int(correct_answer == llm_answer)

    # Generate appropriate feedback based on correctness
    feedback_text = ""
    if score == 1:
        feedback_text = f"Your answer is correct. The correct answer is '{correct_answer}'."
    else:
        feedback_text = f"Your answer is incorrect. The correct answer is '{correct_answer}'."
    
    # Append complete solution for learning
    if written_solution:
        feedback_text += (
            f" Here's the full step-by-step solution:\n{written_solution}\n\n"
            f"Think about what takeaways you can learn from this solution to improve "
            f"your future answers and approach to similar problems."
        )

    return dspy.Prediction(score=score, feedback=feedback_text)

In [None]:
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,
    auto="heavy",
    num_threads=32,
    track_stats=True,
    reflection_minibatch_size=16,
    track_best_outputs=True,
    add_format_failure_as_feedback=True,
    reflection_lm=reflection_lm
)


## GEPA Optimization

Apply GEPA optimizer with error-driven feedback to automatically improve the prompt and boost performance.

In [None]:
optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)

In [None]:
print(optimized_program.predict.signature.instructions)

## Optimized Program Evaluation

Evaluate the GEPA-optimized program to measure the improvement in accuracy and effectiveness.

In [None]:
evaluate(optimized_program)