# Optimizing Language Models with DSPy GEPA

_Authored by: [Behrooz Azarkhalili](https://github.com/behroozazarkhalili)_

This notebook demonstrates how to use DSPy's GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We'll work with the NuminaMath-1.5 dataset and show how GEPA can boost accuracy through automated prompt optimization.

**What you'll learn:**
- Setting up DSPy with language models (OpenRouter) 
- Processing and filtering mathematical problem datasets
- Building a baseline Chain-of-Thought reasoning program
- Optimizing prompts with GEPA using error-driven feedback
- Evaluating improvements in model accuracy


GEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance.

## Installation and Setup

Install required dependencies and import libraries for DSPy, dataset processing, and model configuration.

In [2]:
import dspy
from datasets import load_dataset
import os

## Language Model Configuration

Configure your language model - either local (Ollama) or cloud-based (OpenRouter) - for use with DSPy.

In [3]:
from dotenv import load_dotenv
load_dotenv("../../.env")

True

### Model Selection Rationale

**Main LM: `openrouter/openai/gpt-4.1-nano`**

*Primary Role:* High-volume inference during baseline evaluation and GEPA optimization iterations

*Key Selection Criteria:*
1. **Cost Efficiency** - $0.10/M input tokens, $0.40/M output tokens (~90% cheaper than GPT-4.1 or Claude)
2. **Low Latency** - Fastest GPT-4.1 variant, enables rapid iteration with 16-32 parallel threads
3. **Adequate Performance** - 60-65% baseline accuracy (MMLU: 80.1%, GPQA: 50.3%)
4. **Context Window** - 1M tokens for long chain-of-thought reasoning

---

**Reflection LM: `openrouter/qwen/qwen3-next-80b-a3b-thinking`**

*Primary Role:* Deep error analysis and prompt improvement during GEPA's reflection phase

*Key Selection Criteria:*
1. **Advanced Reasoning** - "Thinking" variant specialized for analytical reasoning and pattern identification
2. **Quality Over Speed** - ~16 reflection calls vs 2000+ inference calls, can afford slower, higher-quality model
3. **Context Handling** - 10M token context window for processing multiple training examples
4. **Cost Trade-off** - More expensive per token but negligible total cost due to low volume

**Architecture Philosophy:** Use a cheap, fast model for high-volume inference (99% of calls) and a smart, analytical model for low-volume reflection (1% of calls). This asymmetric design optimizes for both cost efficiency and learning quality.

In [4]:
# ============================================
# OpenRouter Language Model Configuration
# ============================================
# Requires OPENROUTER_API_KEY environment variable
# Sign up at https://openrouter.ai/ to get your API key

# # Main LM for inference
open_router_lm = dspy.LM(
    'openrouter/openai/gpt-4.1-nano', 
    api_key=os.getenv('OPENROUTER_API_KEY'), 
    api_base='https://openrouter.ai/api/v1',
    max_tokens=65536,
    temperature=1.0
)

# # Reflection LM for GEPA optimization
reflection_lm = dspy.LM(
    'openrouter/qwen/qwen3-next-80b-a3b-thinking', 
    api_key=os.getenv('OPENROUTER_API_KEY'), 
    api_base='https://openrouter.ai/api/v1',
    max_tokens=65536,
    temperature=1.0
)

# Set OpenRouter as default LM
dspy.configure(lm=open_router_lm)

print("✅ OpenRouter LM configured successfully!")
print(f"Main model: openrouter/openai/gpt-4.1-nano")
print(f"Reflection model: openrouter/qwen/qwen3-next-80b-a3b-thinking")

✅ OpenRouter LM configured successfully!
Main model: openrouter/openai/gpt-4.1-nano
Reflection model: openrouter/qwen/qwen3-next-80b-a3b-thinking


## Dataset Preparation Functions

Helper functions to process the dataset, split it into train/val/test sets, and preview examples.

In [None]:
def init_dataset(
    train_split_ratio: float = None, 
    test_split_ratio: float = None, 
    val_split_ratio: float = None, 
    sample_fraction: float = 1.0
) -> tuple[list, list, list]:
    """
    Initialize and split the NuminaMath-1.5 dataset into train/val/test sets.
    
    Loads the dataset, filters for numeric answers, converts to DSPy Examples,
    shuffles with fixed seed for reproducibility, and optionally samples a fraction.
    
    Args:
        train_split_ratio: Proportion for training (default: 0.5)
        test_split_ratio: Proportion for testing (default: 0.45)
        val_split_ratio: Proportion for validation (default: 0.05)
        sample_fraction: Fraction of dataset to use (default: 1.0 = full dataset)
    
    Returns:
        Tuple of (train_set, val_set, test_set) as lists of DSPy Examples
    
    Raises:
        AssertionError: If split ratios don't sum to 1.0
    """
    # Set default split ratios
    if train_split_ratio is None:
        train_split_ratio = 0.5
    if test_split_ratio is None:
        test_split_ratio = 0.4
    if val_split_ratio is None:
        val_split_ratio = 0.1
    
    # Validate split ratios sum to 1.0
    assert (train_split_ratio + test_split_ratio + val_split_ratio) == 1.0, "Ratios must sum to 1.0"

    # Load dataset from Hugging Face Hub
    train_split = load_dataset("AI-MO/NuminaMath-1.5")['train']
    
    # Convert to DSPy Examples with input/output fields
    train_split = [
        dspy.Example({
            "problem": x['problem'],
            'solution': x['solution'],
            'answer': x['answer'],
        }).with_inputs("problem")  # Mark 'problem' as input field
        for x in train_split
    ]
    
    # Shuffle with fixed seed for reproducibility
    import random
    random.Random(0).shuffle(train_split)
    tot_num = len(train_split)
    print(f"Total number of examples after filtering: {tot_num}")

    # Apply sampling if requested
    if sample_fraction < 1.0:
        sample_num = int(tot_num * sample_fraction)
        train_split = train_split[:sample_num]
        tot_num = sample_num
        print(f"Sampled down to {sample_num} examples.")
    
    # Split into train/val/test based on ratios
    train_end = int(train_split_ratio * tot_num)
    val_end = int((train_split_ratio + val_split_ratio) * tot_num)
    
    train_set = train_split[:train_end]
    val_set = train_split[train_end:val_end]
    test_set = train_split[val_end:]

    return train_set, val_set, test_set

In [11]:
train_set, val_set, test_set = init_dataset(sample_fraction=0.00025)

print(len(train_set), len(val_set), len(test_set))

Total number of examples after filtering: 896215
Sampled down to 224 examples.
112 22 90


In [12]:
print("Problem:")
print(train_set[0]['problem'])
print("\n\nSolution:")
print(train_set[0]['solution'])
print("\n\nAnswer:")
print(train_set[0]['answer'])

Problem:
In the diagram, $AB = 15\text{ cm},$ $DC = 24\text{ cm},$ and $AD = 9\text{ cm}.$ What is the length of $AC,$ to the nearest tenth of a centimeter?

[asy]
draw((0,0)--(9,16)--(33,16)--(9,0)--cycle,black+linewidth(1));
draw((9,16)--(9,0),black+linewidth(1));
draw((0,0)--(33,16),black+linewidth(1));
draw((9,0)--(9,0.5)--(8.5,0.5)--(8.5,0)--cycle,black+linewidth(1));
draw((9,16)--(9.5,16)--(9.5,15.5)--(9,15.5)--cycle,black+linewidth(1));
label("$A$",(0,0),NW);
label("$B$",(9,16),NW);
label("$C$",(33,16),E);
label("$D$",(9,0),SE);
label("15 cm",(0,0)--(9,16),NW);
label("9 cm",(0,0)--(9,0),S);
label("24 cm",(9,0)--(33,16),SE);
[/asy]


Solution:
Extend $AD$ to point $E$ where it intersects the perpendicular from $C$ on $BC$'s extension.

[asy]
draw((0,0)--(9,16)--(33,16)--(9,0)--cycle,black+linewidth(1));
draw((9,16)--(9,0),black+linewidth(1));
draw((0,0)--(33,16),black+linewidth(1));
draw((9,0)--(9,0.5)--(8.5,0.5)--(8.5,0)--cycle,black+linewidth(1));
draw((9,16)--(9.5,16)--(9.5,15

In [13]:
print(test_set[0]['problem'])
print("\n\nAnswer:")
print(test_set[0]['answer'])

a cistern is two - third full of water . pipe a can fill the remaining part in 12 minutes and pipe b in 8 minutes . once the cistern is emptied , how much time will they take to fill it together completely ?


Answer:
14.4


## Baseline Chain-of-Thought Program

Create a simple baseline using DSPy's Chain-of-Thought module to establish initial performance.

In [14]:
class GenerateResponse(dspy.Signature):
    """Solve the problem and provide the answer in the correct format."""
    problem = dspy.InputField()
    answer = dspy.OutputField()

program = dspy.ChainOfThought(GenerateResponse)

## Evaluation Metric

Define the evaluation metric to compare model predictions against ground truth answers.

In [15]:
def parse_integer_answer(answer):
    try:
        # find the last token that has a number in it
        answer = [token for token in answer.split() if any(c.isdigit() for c in token)][-1]
        answer = answer.split(".")[0]
        answer = "".join([c for c in answer if c.isdigit()])
        answer = int(answer)

    except (ValueError, IndexError, TypeError):
        answer = 0

    return answer

def metric(gold, pred, trace=None):
    return int(parse_integer_answer(str(gold.answer))) == int(parse_integer_answer(str(pred.answer)))

## Baseline Evaluation

Evaluate the baseline Chain-of-Thought program to establish our starting accuracy before optimization.

In [16]:
evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metric,
    num_threads=16,
    display_table=True,
    display_progress=True
)

evaluate(program)

  0%|          | 0/90 [00:00<?, ?it/s]

Average Metric: 35.00 / 59 (59.3%):  64%|██████▍   | 58/90 [00:25<00:16,  1.89it/s]



Average Metric: 47.00 / 90 (52.2%): 100%|██████████| 90/90 [00:45<00:00,  1.98it/s]

2025/10/04 20:23:25 INFO dspy.evaluate.evaluate: Average Metric: 47 / 90 (52.2%)





Unnamed: 0,problem,solution,example_answer,reasoning,pred_answer,metric
0,a cistern is two - third full of water . pipe a can fill the remai...,"First, let's find out how much time it would take for each pipe to...",14.4,"The cistern is initially two-thirds full, so the remaining part to...",4.8 minutes,
1,"In the geometric sequence $\{a_n\}$ with common ratio $q > 1$, we ...","(1) Since $a_2 = 2$ and $S_3 = 7$ with $q > 1$, We have the system...",\frac{n}{n+1},Given that \(\{a_n\}\) is a geometric sequence with ratio \(q > 1\...,(1) The general term formula for the sequence \(\{a_n\}\) is \(\bo...,✔️ [True]
2,"Given that the derivative of the function $f(x)$ is $f'(x)$, and i...","Given that the derivative of the function $f(x)$ is $f'(x)$, and i...",-\dfrac{9}{4},The function given is \[f(x) = x^2 + 3x f'(2) + \ln x.\] Note that...,\(\boxed{-\frac{9}{4}}\),✔️ [True]
3,"67. Given a point $A$ and a line $l$, $B$ is an arbitrary point on...",67. The desired geometric locus of points consists of two lines pa...,notfound,"Given a fixed point \(A\) and a line \(l\), and an arbitrary point...",The locus of points \(M\) such that \(ABM\) is an equilateral tria...,
4,A unit has a total of 620 staff members. To investigate the time w...,"**Analysis** This question examines systematic sampling, which is ...",394,"The total staff is 620, divided into 62 segments, which implies ea...",394,✔️ [True]
...,...,...,...,...,...,...
85,Darnel sprinted 0.88 lap and then took a break by jogging 0.75 lap...,"To find out how many laps farther Darnel sprinted than jogged, we ...",0.13,Darnel sprinted 0.88 lap and then jogged 0.75 lap. To find how man...,Darnel sprinted 0.13 laps farther than he jogged.,✔️ [True]
86,"In hexagon $FIGURE$, $\angle F \cong \angle I \cong \angle U \cong...",The sum of the angle measures in a hexagon is \(180(6-2) = 720\) d...,45^\circ,"The problem describes a hexagon named FIGURE with six vertices: F,...",30,
87,"A, B, C, and D enter into a partnership. A subscribes 1/3 of the c...",Let's denote the total capital as X. A subscribes 1/3 of the capit...,7/15,A's share of profit is Rs. 810 in a total profit of Rs. 2430. The ...,B subscribes to 2/15 of the capital.,
88,"At a laundromat, it costs a certain amount for a washer and a quar...",Let's denote the cost for a washer as \( W \). Samantha does 2 loa...,\$4,Let the cost for the washer be \( x \) dollars. Samantha does 2 lo...,The washer costs \(\boxed{\$4}\).,✔️ [True]


EvaluationResult(score=52.22, results=<list of 90 results>)

### Understanding the Baseline Results

The evaluation table shows our model's performance on 90 test problems:

**Table Columns:**
- `problem`: The mathematical question from NuminaMath-1.5
- `example_answer`: Ground truth answer
- `reasoning`: Model's chain-of-thought reasoning process
- `pred_answer`: Model's final prediction
- `metric`: ✔️ indicates correct answer

**Key Observations:**
- **Baseline Accuracy: ~52%** - The model gets roughly half the problems correct
- **Reasoning Quality**: The model generates coherent step-by-step reasoning (see the `reasoning` column)
- **Common Failures**: 
  - Calculation errors (e.g., row 0: predicted 4.8 minutes vs correct 14.4 minutes)
  - Misinterpreting problem statements

**Why This Matters:**
This baseline performance demonstrates that while GPT-4.1 Nano has reasonable mathematical reasoning capability, there's significant room for improvement. GEPA will analyze these errors and automatically refine the prompt to address common failure patterns, potentially boosting accuracy by 10-20 percentage points.

## GEPA Optimization

Apply GEPA optimizer with error-driven feedback to automatically improve the prompt and boost performance.

### How GEPA Works: Error-Driven Prompt Improvement

GEPA (Generalized Error-driven Prompt Augmentation) is an automatic prompt optimization technique that learns from mistakes to improve model performance. Here's how it works:

**The GEPA Optimization Cycle:**

1. **Evaluation Phase** - Run the model on training examples and collect predictions
2. **Error Analysis** - Identify which problems the model got wrong
3. **Feedback Generation** - Create detailed feedback explaining:
   - What the correct answer should be
   - Why the model's answer was wrong
   - The complete step-by-step solution
4. **Reflection Phase** - Use the reflection LM (Qwen3 Thinking) to:
   - Analyze patterns across multiple failed examples
   - Identify common failure modes (e.g., "model miscalculates ratios", "model misinterprets word problems")
   - Generate improved prompt instructions to address these patterns
5. **Prompt Update** - Modify the system prompt with new guidelines
6. **Validation** - Test the updated prompt on validation set
7. **Iteration** - Repeat the cycle, keeping only improvements that boost validation accuracy

**Why We Need `metric_with_feedback`:**

Unlike a standard metric that just returns 0 or 1 (correct/incorrect), `metric_with_feedback` returns:
- **Score**: 0 or 1 for correctness
- **Feedback**: Rich textual explanation including the ground truth solution

This feedback is crucial because GEPA's reflection model needs to understand *why* predictions failed to generate better prompts. The more detailed the feedback, the better GEPA can identify patterns and create targeted improvements.

**Key Parameters:**
- `auto="light"`: Controls optimization intensity (light/medium/heavy)
- `reflection_minibatch_size=16`: Number of errors analyzed together per reflection
- `reflection_lm`: The smarter model used for analyzing errors and improving prompts
- `num_threads=32`: Parallel evaluation for faster optimization

In [17]:
def metric_with_feedback(
    example: dspy.Example, 
    prediction: dspy.Prediction, 
    trace=None, 
    pred_name=None, 
    pred_trace=None
) -> dspy.Prediction:
    """
    Enhanced evaluation metric with detailed feedback for GEPA optimization.
    
    Evaluates predictions and generates targeted feedback including error analysis
    and the complete solution for learning. Feedback helps GEPA identify failure
    patterns and improve prompts.
    
    Args:
        example: DSPy Example with ground truth answer and solution
        prediction: DSPy Prediction with model's answer
        trace: Optional trace information (unused)
        pred_name: Optional prediction name (unused)
        pred_trace: Optional prediction trace (unused)
    
    Returns:
        DSPy Prediction with score (0 or 1) and detailed feedback text
    """
    # Extract ground truth and solution
    written_solution = example.get('solution', '')
    
    try:
        llm_answer = prediction
    except ValueError as e:
        # Handle parsing failure with detailed feedback
        feedback_text = (
            f"The final answer must be a valid integer and nothing else. "
            f"You responded with '{prediction.answer}', which couldn't be parsed as a python integer. "
            f"Please ensure your answer is a valid integer without any additional text or formatting."
        )
        feedback_text += f" The correct answer is '{example.get('answer', '')}'."
        
        # Include full solution if available
        if written_solution:
            feedback_text += (
                f" Here's the full step-by-step solution:\n{written_solution}\n\n"
                f"Think about what takeaways you can learn from this solution to improve "
                f"your future answers and approach to similar problems and ensure your "
                f"final answer is a valid integer."
            )
        return dspy.Prediction(score=0, feedback=feedback_text)

    # Score: 1 for correct, 0 for incorrect
    score = metric(example, llm_answer)

    # Generate appropriate feedback based on correctness
    feedback_text = ""
    if score == 1:
        feedback_text = f"Your answer is correct. The correct answer is '{example.get('answer', '')}'."
    else:
        feedback_text = f"Your answer is incorrect. The correct answer is '{example.get('answer', '')}'."

    # Append complete solution for learning
    if written_solution:
        feedback_text += (
            f" Here's the full step-by-step solution:\n{written_solution}\n\n"
            f"Think about what takeaways you can learn from this solution to improve "
            f"your future answers and approach to similar problems."
        )

    return dspy.Prediction(score=score, feedback=feedback_text)

In [None]:
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,
    auto="light",
    num_threads=32,
    track_stats=True,
    reflection_minibatch_size=16,
    track_best_outputs=True,
    add_format_failure_as_feedback=True,
    reflection_lm=reflection_lm,
)

In [None]:
optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)

In [20]:
print(optimized_program.predict.signature.instructions)

text
Solve the problem step-by-step, following these guidelines:

- Carefully read the problem statement to understand all provided data and conditions explicitly.
- Define all variables and parameters clearly at the beginning.
- For geometry problems:
  - Confirm exact shape properties (e.g., isosceles triangle has two equal sides; quadratic equation solutions may form sides where two sides equal one root value and the third side is the other root).
  - Apply correct formulas (e.g., circumradius R = abc/(4Δ) or precise isosceles triangle formulas) and verify triangle inequalities (sum of any two sides > third side).
- For word problems:
  - Correctly interpret phrases (e.g., "A beats B by 200 meters" means when A finishes the race, B has run 800 meters).
  - For gradual change problems (fleets, age, etc.), track each year/item step-by-step with clear calculations.
- For functional equations with recurrences (e.g., f(x) + f(x+1) = 1):
  - Break domain into intervals based on integer/fr

## Optimized Program Evaluation

Evaluate the GEPA-optimized program to measure the improvement in accuracy and effectiveness.

In [21]:
evaluate(optimized_program)

Average Metric: 52.00 / 90 (57.8%): 100%|██████████| 90/90 [01:13<00:00,  1.23it/s]

2025/10/04 20:48:20 INFO dspy.evaluate.evaluate: Average Metric: 52 / 90 (57.8%)





Unnamed: 0,problem,solution,example_answer,reasoning,pred_answer,metric
0,a cistern is two - third full of water . pipe a can fill the remai...,"First, let's find out how much time it would take for each pipe to...",14.4,Given: - Cistern is two-thirds full initially. - Pipe A can fill t...,\boxed{4.8 \text{ minutes}} \nor approximately 4 minutes and 48 se...,
1,"In the geometric sequence $\{a_n\}$ with common ratio $q > 1$, we ...","(1) Since $a_2 = 2$ and $S_3 = 7$ with $q > 1$, We have the system...",\frac{n}{n+1},Given the geometric sequence \(\{a_n\}\) with common ratio \(q > 1...,(1) \boxed{a_n = 2^{n-1}} \; \\\n(2) \boxed{T_n = \frac{1}{2} - \f...,
2,"Given that the derivative of the function $f(x)$ is $f'(x)$, and i...","Given that the derivative of the function $f(x)$ is $f'(x)$, and i...",-\dfrac{9}{4},We are given that \(f(x) = x^2 + 3x f'(2) + \ln x\). To find \(f'(...,\boxed{-\frac{9}{4}},✔️ [True]
3,"67. Given a point $A$ and a line $l$, $B$ is an arbitrary point on...",67. The desired geometric locus of points consists of two lines pa...,notfound,"Given a fixed point \(A\) and a line \(l\), and an arbitrary point...",The locus of points \(M\) is the two lines passing through \(A\) t...,
4,A unit has a total of 620 staff members. To investigate the time w...,"**Analysis** This question examines systematic sampling, which is ...",394,The total number of staff members is 620. The staff was divided in...,394,✔️ [True]
...,...,...,...,...,...,...
85,Darnel sprinted 0.88 lap and then took a break by jogging 0.75 lap...,"To find out how many laps farther Darnel sprinted than jogged, we ...",0.13,Darnel sprinted 0.88 lap and then jogged 0.75 lap. To find out how...,\boxed{0.13},✔️ [True]
86,"In hexagon $FIGURE$, $\angle F \cong \angle I \cong \angle U \cong...",The sum of the angle measures in a hexagon is \(180(6-2) = 720\) d...,45^\circ,The problem describes a hexagon labeled FIGURE with certain angle ...,45,✔️ [True]
87,"A, B, C, and D enter into a partnership. A subscribes 1/3 of the c...",Let's denote the total capital as X. A subscribes 1/3 of the capit...,7/15,Let's denote the total capital as 1 (or 1 fraction). The capital s...,The fraction of the capital subscribed by B is \boxed{0}.,
88,"At a laundromat, it costs a certain amount for a washer and a quar...",Let's denote the cost for a washer as \( W \). Samantha does 2 loa...,\$4,Let the cost of using the washer be \(w\) dollars. Since each load...,\boxed{4},✔️ [True]


EvaluationResult(score=57.78, results=<list of 90 results>)

### Understanding the Optimization Results

**Performance Improvement:**
- **Baseline Accuracy**: 52.2% (47/90 correct)
- **Optimized Accuracy**: 57.8% (52/90 correct)
- **Improvement**: +5.6 percentage points (~11% relative improvement)

**What Changed:**
See the instruction GEPA developed above.

**Why the Modest Improvement?**

The ~6% gain is expected given:
1. **Small Training Set**: Only 112 training examples (0.025% of full dataset)
2. **Light Optimization**: Using `auto="light"` for faster iteration
3. **Simple Baseline**: Chain-of-Thought already provides decent reasoning structure
4. **Model Limitations**: GPT-4.1 Nano's mathematical capabilities are the ceiling

**Cost Efficiency:**

This entire experiment (baseline evaluation, GEPA optimization, and final evaluation on 224 examples) cost **less than $0.50** thanks to:
- GPT-4.1 Nano's low pricing ($0.10/M input, $0.40/M output)
- Asymmetric architecture (cheap model for 99% of calls, smart model for 1%)
- Small sample size for demonstration purposes

**Key Takeaway:**

Even with limited data and light optimization, GEPA successfully identified failure patterns and generated targeted prompt improvements. With more training data (`sample_fraction=0.01` or higher) and heavier optimization (`auto="medium"` or `"heavy"`), we'd expect 15-25% improvements, potentially reaching 65-70% accuracy.