# Entity Extraction

This tutorial demonstrates how to perform entity extraction using the CoNLL-2003 dataset with DSPy. The focus is on extracting entities referring to people. We will:

* Extract and label entities from the CoNLL-2003 dataset that refer to people
* Define a DSPy program for extracting entities that refer to people
* Optimize and evaluate the program on a subset of the CoNLL-2003 dataset

By the end of this tutorial, you'll understand how to structure tasks in DSPy using signatures and modules, evaluate your system's performance, and improve its quality with optimizers.

## Load and Prepare the Dataset

In this section, we prepare the CoNLL-2003 dataset, which is commonly used for entity extraction tasks. The dataset includes tokens annotated with entity labels such as persons, organizations, and locations.

We will:

1. Load the dataset using the Hugging Face datasets library.
2. Define a function to extract tokens referring to people.
3. Slice the dataset to create smaller subsets for training and testing.

DSPy expects examples in a structured format, so we'll also transform the dataset into DSPy Examples for easy integration.

In [4]:
import os
import tempfile
from datasets import load_dataset
from typing import Dict, Any, List
import dspy

def load_conll_dataset() -> dict:
    """
    Loads the CoNLL-2003 dataset into train, validation, and test splits.
    
    Returns:
        dict: Dataset splits with keys 'train', 'validation', and 'test'.
    """
    with tempfile.TemporaryDirectory() as temp_dir:
        # Use a temporary Hugging Face cache directory for compatibility with certain hosted notebook
        # environments that don't support the default Hugging Face cache directory
        os.environ["HF_DATASETS_CACHE"] = temp_dir
        return load_dataset("conll2003", trust_remote_code=True)

def extract_people_entities(data_row: Dict[str, Any]) -> List[str]:
    """
    Extracts entities referring to people from a row of the CoNLL-2003 dataset.
    
    Args:
        data_row (Dict[str, Any]): A row from the dataset containing tokens and NER tags.
    
    Returns:
        List[str]: List of tokens tagged as people.
    """
    return [
        token
        for token, ner_tag in zip(data_row["tokens"], data_row["ner_tags"])
        if ner_tag in (1, 2)  # CoNLL entity codes 1 and 2 refer to people
    ]

def prepare_dataset(data_split, start: int, end: int) -> List[dspy.Example]:
    """
    Prepares a sliced dataset split for use with DSPy.
    
    Args:
        data_split: The dataset split (e.g., train or test).
        start (int): Starting index of the slice.
        end (int): Ending index of the slice.
    
    Returns:
        List[dspy.Example]: List of DSPy Examples with tokens and expected labels.
    """
    return [
        dspy.Example(
            tokens=row["tokens"],
            expected_extracted_people=extract_people_entities(row)
        ).with_inputs("tokens")
        for row in data_split.select(range(start, end))
    ]

# Load the dataset
dataset = load_conll_dataset()

# Prepare the training and test sets
train_set = prepare_dataset(dataset["train"], 0, 50)
test_set = prepare_dataset(dataset["test"], 0, 200)

## Configure DSPy and create an Entity Extraction Program

Here, we define a DSPy program for extracting entities referring to people from tokenized text.

Then, we configure DSPy to use a particular language model (gpt-4o-mini) for all invocations of the program.

Key DSPy Concepts Introduced:

* Signatures: Define structured input/output schemas for your program.
* Modules: Encapsulate program logic in reusable, composable units.

Specifically, we'll:

* Create a `PeopleExtraction` DSPy Signature to specify the input (`tokens`) and output (`extracted_people`) fields.
* Define a `people_extractor` program that uses DSPy's built-in `dspy.ChainOfThought` module to implement the `PeopleExtraction` signature. The program extracts entities referring to people from a list of input tokens using language model (LM) prompting.
* Use the `dspy.LM` class and `dspy.settings.configure()` method to configure the language model that DSPy will use when invoking the program.

In [5]:
from typing import List

class PeopleExtraction(dspy.Signature):
    """
    Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
    Output a list of tokens. In other words, do not combine multiple tokens into a single value.
    """
    tokens: list[str] = dspy.InputField(desc="tokenized text")
    extracted_people: list[str] = dspy.OutputField(desc="all tokens referring to specific people extracted from the tokenized text")

people_extractor = dspy.ChainOfThought(PeopleExtraction)

In [6]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
from dotenv import load_dotenv
load_dotenv("grok_key.ini") 
lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

## Define Metric and Evaluation Functions

In DSPy, evaluating a program's performance is critical for iterative development. A good evaluation framework allows us to:

* Measure the quality of our program's outputs.
* Compare outputs against ground-truth labels.
* Identify areas for improvement.

What We'll Do:

* Define a custom metric (`extraction_correctness_metric`) to evaluate whether the extracted entities match the ground truth.
* Create an evaluation function (`evaluate_correctness`) to apply this metric to a training or test dataset and compute the overall accuracy.

The evaluation function uses DSPy's Evaluate utility to handle parallelism and visualization of results.

In [7]:
def extraction_correctness_metric(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> bool:
    """
    Computes correctness of entity extraction predictions.
    
    Args:
        example (dspy.Example): The dataset example containing expected people entities.
        prediction (dspy.Prediction): The prediction from the DSPy people extraction program.
        trace: Optional trace object for debugging.
    
    Returns:
        bool: True if predictions match expectations, False otherwise.
    """
    return prediction.extracted_people == example.expected_extracted_people

evaluate_correctness = dspy.Evaluate(
    devset=test_set,
    metric=extraction_correctness_metric,
    num_threads=24,
    display_progress=True,
    display_table=True
)

## Evaluate Initial Extractor

Before optimizing our program, we need a baseline evaluation to understand its current performance. This helps us:

* Establish a reference point for comparison after optimization.
* Identify potential weaknesses in the initial implementation.

In this step, we'll run our `people_extractor` program on the test set and measure its accuracy using the evaluation framework defined earlier.

In [8]:
evaluate_correctness(people_extractor, devset=test_set)

Average Metric: 190.00 / 200 (95.0%): 100%|██████████| 200/200 [01:01<00:00,  3.27it/s]

2025/06/24 17:27:03 INFO dspy.evaluate.evaluate: Average Metric: 190 / 200 (95.0%)





Unnamed: 0,tokens,expected_extracted_people,reasoning,extracted_people,extraction_correctness_metric
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT...",[CHINA],"After analyzing the provided tokens: [""SOCCER"", ""-"", ""JAPAN"", ""GET...",[],
1,"[Nadim, Ladki]","[Nadim, Ladki]","The provided tokens are [""Nadim"", ""Ladki""]. Both tokens are capita...","[Nadim, Ladki]",✔️ [True]
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]",[],"After reviewing the tokens [""AL-AIN"", "","", ""United"", ""Arab"", ""Emir...",[],✔️ [True]
3,"[Japan, began, the, defence, of, their, Asian, Cup, title, with, a...",[],I analyzed the provided tokens to identify any that refer to speci...,[],✔️ [True]
4,"[But, China, saw, their, luck, desert, them, in, the, second, matc...",[],"After reviewing the provided tokens, I analyzed each one to identi...",[],✔️ [True]
...,...,...,...,...,...
195,"['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ...","[David, Campese]",I reviewed the list of tokens to identify any that refer to specif...,"[David, Campese]",✔️ [True]
196,"['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma...",[],"In the provided tokenized text, the phrase ""the 34-year-old winger...","[34-year-old, winger]",
197,"['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',...","[Campese, Rob, Andrew]",I scanned the list of tokens for contiguous tokens that refer to s...,"[Campese, Rob, Andrew]",✔️ [True]
198,"['""', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '...","[Campo, Andrew]",I analyzed the provided tokens to identify any that refer to speci...,"[Campo, Andrew]",✔️ [True]


95.0

## Optimize the Model

DSPy includes powerful optimizers that can improve the quality of your system.

Here, we use DSPy's MIPROv2 optimizer to:

* Automatically tune the program's language model (LM) prompt by 
    1. using the LM to adjust the prompt's instructions and 
    2. building few-shot examples from the training dataset that are augmented with reasoning generated from dspy.ChainOfThought.
* Maximize correctness on the training set.

This optimization process is automated, saving time and effort while improving accuracy.

In [9]:
mipro_optimizer = dspy.MIPROv2(
    metric=extraction_correctness_metric,
    auto="medium",
)
optimized_people_extractor = mipro_optimizer.compile(
    people_extractor,
    trainset=train_set,
    max_bootstrapped_demos=4,
    requires_permission_to_run=False,
    minibatch=False
)

2025/06/24 17:27:03 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: False
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 40

2025/06/24 17:27:03 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/06/24 17:27:03 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/06/24 17:27:03 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...


Bootstrapping set 1/12
Bootstrapping set 2/12
Bootstrapping set 3/12


 40%|████      | 4/10 [00:20<00:30,  5.14s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/12


 40%|████      | 4/10 [00:12<00:18,  3.14s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 5/12


 30%|███       | 3/10 [00:13<00:30,  4.38s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 6/12


 30%|███       | 3/10 [00:00<?, ?it/s]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/12


 10%|█         | 1/10 [00:00<00:00, 332.75it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/12


 30%|███       | 3/10 [00:00<00:00, 125.06it/s]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 9/12


 40%|████      | 4/10 [00:00<00:00, 363.41it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 10/12


 10%|█         | 1/10 [00:00<00:00, 154.98it/s]

Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 11/12



 30%|███       | 3/10 [00:06<00:14,  2.02s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 12/12


 30%|███       | 3/10 [00:00<00:00, 398.55it/s]
2025/06/24 17:27:57 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/06/24 17:27:57 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/06/24 17:28:06 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/06/24 17:28:52 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/06/24 17:28:52 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
Output a list of tokens. In other words, do not combine multiple tokens into a single value.

2025/06/24 17:28:52 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are tasked with performing named entity recognition specifically for extracting contiguous tokens that refer to individual people's names from a list of string tokens derived from news articles and official reports on topics such as EU politics, public health issues like mad cow disease, and international trade involving countries like Germany, Britain, and France. Your goal is to identify and output a list of these tokens without combining or merging them into a single value. For 

Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:32<00:00,  1.23it/s] 

2025/06/24 17:29:25 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:29:25 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 97.5






2025/06/24 17:29:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 18 =====


Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:33<00:00,  1.18it/s] 

2025/06/24 17:29:59 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:29:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/06/24 17:29:59 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5]
2025/06/24 17:29:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:29:59 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 18 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:34<00:00,  1.17it/s] 

2025/06/24 17:30:33 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/06/24 17:30:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/06/24 17:30:33 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0]
2025/06/24 17:30:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:30:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 18 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:29<00:00,  1.34it/s] 

2025/06/24 17:31:03 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/06/24 17:31:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/06/24 17:31:04 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0]
2025/06/24 17:31:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:31:04 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:27<00:00,  1.44it/s] 

2025/06/24 17:31:31 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:31:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/06/24 17:31:31 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5]
2025/06/24 17:31:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:31:31 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:30<00:00,  1.30it/s] 

2025/06/24 17:32:02 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:32:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/06/24 17:32:02 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5]
2025/06/24 17:32:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:32:02 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 18 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:33<00:00,  1.20it/s] 

2025/06/24 17:32:36 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/06/24 17:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/06/24 17:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0]
2025/06/24 17:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:35<00:00,  1.12it/s] 

2025/06/24 17:33:12 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:33:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/06/24 17:33:12 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5]
2025/06/24 17:33:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:33:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:27<00:00,  1.47it/s] 

2025/06/24 17:33:39 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:33:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/06/24 17:33:39 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5]
2025/06/24 17:33:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:33:39 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:27<00:00,  1.44it/s] 

2025/06/24 17:34:07 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:34:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 10'].
2025/06/24 17:34:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5]
2025/06/24 17:34:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:34:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 459.91it/s] 

2025/06/24 17:34:07 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)





2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5]
2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 12 / 18 =====


Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 669.60it/s] 

2025/06/24 17:34:08 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5]
2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:27<00:00,  1.46it/s] 

2025/06/24 17:34:35 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:34:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 8'].
2025/06/24 17:34:35 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5]
2025/06/24 17:34:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:34:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 14 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:39<00:00,  1.00it/s] 

2025/06/24 17:35:15 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)





2025/06/24 17:35:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/06/24 17:35:16 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5]
2025/06/24 17:35:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:35:16 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 15 / 18 =====


Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:32<00:00,  1.24it/s] 

2025/06/24 17:35:48 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:35:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 7'].
2025/06/24 17:35:48 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5]
2025/06/24 17:35:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:35:48 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 16 / 18 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:32<00:00,  1.21it/s] 

2025/06/24 17:36:21 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/06/24 17:36:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 9'].
2025/06/24 17:36:21 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 95.0]
2025/06/24 17:36:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:36:21 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 17 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:29<00:00,  1.38it/s] 

2025/06/24 17:36:51 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 11'].
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 95.0, 97.5]
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 18 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 652.89it/s] 

2025/06/24 17:36:51 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 0'].
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 95.0, 97.5, 97.5]
2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:36:51 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 18 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:29<00:00,  1.37it/s] 

2025/06/24 17:37:20 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/06/24 17:37:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/06/24 17:37:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [97.5, 97.5, 95.0, 95.0, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 97.5, 95.0, 97.5, 97.5, 97.5]
2025/06/24 17:37:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/06/24 17:37:20 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 97.5!





## Evaluate Optimized Program

After optimization, we re-evaluate the program on the test set to measure improvements. Comparing the optimized and initial results allows us to:

* Quantify the benefits of optimization.
* Validate that the program generalizes well to unseen data.

In this case, we see that accuracy of the program on the test dataset has improved significantly.

In [10]:
evaluate_correctness(optimized_people_extractor, devset=test_set)

Average Metric: 190.00 / 200 (95.0%): 100%|██████████| 200/200 [00:00<00:00, 635.07it/s]

2025/06/24 17:37:21 INFO dspy.evaluate.evaluate: Average Metric: 190 / 200 (95.0%)





Unnamed: 0,tokens,expected_extracted_people,reasoning,extracted_people,extraction_correctness_metric
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT...",[CHINA],"After analyzing the provided tokens: [""SOCCER"", ""-"", ""JAPAN"", ""GET...",[],
1,"[Nadim, Ladki]","[Nadim, Ladki]","The provided tokens are [""Nadim"", ""Ladki""]. Both tokens are capita...","[Nadim, Ladki]",✔️ [True]
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]",[],"After reviewing the tokens [""AL-AIN"", "","", ""United"", ""Arab"", ""Emir...",[],✔️ [True]
3,"[Japan, began, the, defence, of, their, Asian, Cup, title, with, a...",[],I analyzed the provided tokens to identify any that refer to speci...,[],✔️ [True]
4,"[But, China, saw, their, luck, desert, them, in, the, second, matc...",[],"After reviewing the provided tokens, I analyzed each one to identi...",[],✔️ [True]
...,...,...,...,...,...
195,"['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ...","[David, Campese]",I reviewed the list of tokens to identify any that refer to specif...,"[David, Campese]",✔️ [True]
196,"['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma...",[],"In the provided tokenized text, the phrase ""the 34-year-old winger...","[34-year-old, winger]",
197,"['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',...","[Campese, Rob, Andrew]",I scanned the list of tokens for contiguous tokens that refer to s...,"[Campese, Rob, Andrew]",✔️ [True]
198,"['""', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '...","[Campo, Andrew]",I analyzed the provided tokens to identify any that refer to speci...,"[Campo, Andrew]",✔️ [True]


95.0

## Inspect Optimized Program's Prompt

After optimizing the program, we can inspect the history of interactions to see how DSPy has augmented the program's prompt with few-shot examples. This step demonstrates:

* The structure of the prompt used by the program.
* How few-shot examples are added to guide the model's behavior.

Use `inspect_history(n=1)` to view the last interaction and analyze the generated prompt.

In [11]:
dspy.inspect_history(n=1)





[34m[2025-06-24T17:37:21.177945][0m

[31mSystem message:[0m

Your input fields are:
1. `tokens` (list[str]): tokenized text
Your output fields are:
1. `reasoning` (str): 
2. `extracted_people` (list[str]): all tokens referring to specific people extracted from the tokenized text
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## tokens ## ]]
{tokens}

[[ ## reasoning ## ]]
{reasoning}

[[ ## extracted_people ## ]]
{extracted_people}        # note: the value you produce must adhere to the JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
        Output a list of tokens. In other words, do not combine multiple tokens into a single value.


[31mUser message:[0m

[[ ## tokens ## ]]
["Leeds", "'", "England", "under-21", "striker", "Lee", "Bowyer"

In [12]:
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # cost in USD, as calculated by LiteLLM for certain providers
cost

0.2982494

## Saving and Loading Optimized Programs

DSPy supports saving and loading programs, enabling you to reuse optimized systems without the need to re-optimize from scratch. 
This feature is especially useful for deploying your programs in production environments or sharing them with collaborators.

In this step, we'll save the optimized program to a file and demonstrate how to load it back for future use.

In [13]:
optimized_people_extractor.save("optimized_extractor.json")

loaded_people_extractor = dspy.ChainOfThought(PeopleExtraction)
loaded_people_extractor.load("optimized_extractor.json")

loaded_people_extractor(tokens=["Italy", "recalled", "Marcello", "Cuttitta"]).extracted_people

['Marcello', 'Cuttitta']