# ImpPres LLM Baseline

You have to implement in this notebook a baseline for ImpPres classification using an LLM.
This baseline must be implemented using DSPy.



In [29]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
import os
import dspy


In [30]:
with open("grok_key.ini") as f:
        for line in f:
            if "XAI_API_KEY" in line and not line.strip().startswith("#"):
                key_value = line.strip().split("=")
                if len(key_value) == 2:
                    os.environ["XAI_API_KEY"] = key_value[1].split()[0]

with open("gemini_key.ini") as f:
        for line in f:
            if "GEMINI_API_KEY" in line and not line.strip().startswith("#"):
                key_value = line.strip().split("=")
                if len(key_value) == 2:
                    os.environ["GEMINI_API_KEY"] = key_value[1].split()[0]

In [31]:

lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

Let's try a simple CoT optimized with MIPROv2

In [32]:
from typing import Literal

#defining the module
#joint prompt module, rebuilding best modul in 1.4
class anli_classification_signature(dspy.Signature):

    """Label the relationship between given premise and hypothesis."""
    
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField()

joint_prompt = dspy.ChainOfThought(anli_classification_signature)

## Implement the DSPy program to classify pairs (premise, hypothesis) as entailment, contradiction, or neutral.

Using the previous dataset as training

In [33]:
from datasets import load_dataset

train_dataset = load_dataset("facebook/anli", split="train_r1")
dataset = train_dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

train_dataset

Dataset({
    features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
    num_rows: 16946
})

In [40]:
from dspy.teleprompt import MIPROv2
import random

label_map = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

#defining metric for evaluation
def binary_metrics(example, prediction, trace=None):
    print(example["label"])
    print(prediction.label)
    return example["label"] == label_map[prediction.label]


In [35]:
prediction = joint_prompt(premise = train_dataset[0]['premise'], hypothesis = train_dataset[0]['hypothesis'])
print(prediction)

Prediction(
    reasoning='The premise explicitly states that the Parma trolleybus system comprises four urban routes. The hypothesis claims that the system has over 2 urban routes. Since four is greater than two, the information in the premise directly confirms the hypothesis, making it true based on the given details.',
    label='entailment'
)


In [36]:

print(binary_metrics(train_dataset[0], joint_prompt(premise = train_dataset[0]['premise'], hypothesis = train_dataset[0]['hypothesis'])))

sample = random.sample(list(train_dataset), 50)

example_trainset = []
for example in sample:
    example_trainset.append(dspy.Example(premise=example["premise"], hypothesis=example["hypothesis"], label=example["label"], reason=example["reason"]).with_inputs("premise", "hypothesis"))

0
entailment
True


In [43]:
teleprompter = MIPROv2(
    metric = binary_metrics,
    auto = "light"
)
optimized_program = teleprompter.compile(
    joint_prompt,
    trainset=example_trainset,
    requires_permission_to_run=False,
)

2025/08/10 09:24:11 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: False
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 40

2025/08/10 09:24:11 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/10 09:24:11 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/08/10 09:24:11 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


 10%|█         | 1/10 [00:05<00:49,  5.54s/it]

0
neutral


 20%|██        | 2/10 [00:09<00:35,  4.45s/it]

2
entailment


 30%|███       | 3/10 [00:12<00:28,  4.05s/it]

2
contradiction


 40%|████      | 4/10 [00:17<00:25,  4.27s/it]

0
neutral


 50%|█████     | 5/10 [00:21<00:20,  4.04s/it]

1
neutral


 60%|██████    | 6/10 [00:30<00:23,  5.82s/it]

0
neutral


 70%|███████   | 7/10 [00:33<00:15,  5.09s/it]

2
contradiction


 80%|████████  | 8/10 [00:37<00:09,  4.70s/it]


1
neutral
Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Bootstrapping set 4/6


 10%|█         | 1/10 [00:03<00:31,  3.49s/it]

2
contradiction


 20%|██        | 2/10 [00:08<00:34,  4.26s/it]

0
neutral


 30%|███       | 3/10 [00:12<00:31,  4.44s/it]

0
entailment


 40%|████      | 4/10 [00:17<00:27,  4.61s/it]

1
neutral


 50%|█████     | 5/10 [00:21<00:21,  4.28s/it]

2
entailment


 60%|██████    | 6/10 [00:26<00:18,  4.67s/it]

0
neutral


 70%|███████   | 7/10 [00:31<00:14,  4.72s/it]

1
entailment


 80%|████████  | 8/10 [00:37<00:10,  5.04s/it]

0
neutral


 90%|█████████ | 9/10 [00:40<00:04,  4.55s/it]


1
neutral
Bootstrapped 4 full traces after 9 examples for up to 1 rounds, amounting to 9 attempts.
Bootstrapping set 5/6


 10%|█         | 1/10 [00:03<00:34,  3.88s/it]

2
contradiction


 20%|██        | 2/10 [00:07<00:28,  3.58s/it]


0
entailment
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/6


 10%|█         | 1/10 [00:05<00:52,  5.83s/it]

0
contradiction


 20%|██        | 2/10 [00:09<00:35,  4.44s/it]

2
contradiction


 30%|███       | 3/10 [00:12<00:29,  4.26s/it]
2025/08/10 09:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/10 09:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


2
contradiction
Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/08/10 09:26:03 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/08/10 09:26:34 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/10 09:26:34 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Label the relationship between given premise and hypothesis.

2025/08/10 09:26:34 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Act as a meticulous fact-checker analyzing real-world statements: Given a detailed premise and a concise hypothesis, first reason step by step to evaluate their relationship—considering whether the hypothesis is fully supported (entailment), neither supported nor contradicted (neutral), or directly opposed (contradiction). Then, assign a clear label based on the reasoning: "entailment" if the premise confirms the hypothesis, "neutral" if it's unrelated or inconclusive, or "contradiction" if there's a conflict. Make your response engaging and thorough to ensure accurate natural language inference.

2025/08/10 

  0%|          | 0/40 [00:00<?, ?it/s]1
neutral
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:05<03:21,  5.17s/it]1
neutral
Average Metric: 2.00 / 2 (100.0%):   5%|▌         | 2/40 [00:13<04:28,  7.07s/it]0
entailment
Average Metric: 3.00 / 3 (100.0%):   8%|▊         | 3/40 [00:13<02:29,  4.03s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):  10%|█         | 4/40 [00:14<01:29,  2.49s/it]1
neutral
Average Metric: 5.00 / 5 (100.0%):  12%|█▎        | 5/40 [00:14<00:58,  1.67s/it]2
contradiction
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:14<00:41,  1.23s/it]0
entailment
Average Metric: 7.00 / 7 (100.0%):  18%|█▊        | 7/40 [00:14<00:30,  1.08it/s]0
entailment
Average Metric: 8.00 / 8 (100.0%):  20%|██        | 8/40 [00:15<00:23,  1.37it/s]1
contradiction
Average Metric: 8.00 / 9 (88.9%):  22%|██▎       | 9/40 [00:16<00:23,  1.32it/s] 2
contradiction
Average Metric: 9.00 / 10 (90.0%):  25%|██▌       | 10/40 [00:17<00:32,  1.09s/it]1
neutral
Average Metric

2025/08/10 09:27:10 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/08/10 09:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 92.5

2025/08/10 09:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]0
entailment
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:04<02:46,  4.28s/it]0
entailment
Average Metric: 2.00 / 2 (100.0%):   2%|▎         | 1/40 [00:04<02:46,  4.28s/it]1
neutral
1
neutral
Average Metric: 3.00 / 3 (100.0%):   5%|▌         | 2/40 [00:04<02:42,  4.28s/it]1
neutral
Average Metric: 3.00 / 3 (100.0%):   8%|▊         | 3/40 [00:04<00:49,  1.33s/it]1
neutral
Average Metric: 6.00 / 6 (100.0%):  12%|█▎        | 5/40 [00:04<00:46,  1.33s/it]2
contradiction
Average Metric: 7.00 / 7 (100.0%):  18%|█▊        | 7/40 [00:05<00:14,  2.22it/s]1
neutral
Average Metric: 8.00 / 8 (100.0%):  18%|█▊        | 7/40 [00:05<00:14,  2.22it/s]0
entailment
Average Metric: 9.00 / 9 (100.0%):  22%|██▎       | 9/40 [00:07<00:20,  1.49it/s]1
neutral
Average Metric: 10.00 / 10 (100.0%):  25%|██▌       | 10/40 [00:08<00:22,  1.32it/s]2
contradiction
Average Metric: 11.00 / 11 (100.0%):  28%|██▊       | 11/40 [00:09<00:23,  1.25it/s]0
entailment
A

2025/08/10 09:27:36 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/08/10 09:27:36 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 95.0
2025/08/10 09:27:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/08/10 09:27:36 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0]
2025/08/10 09:27:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:27:36 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]1
neutral
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:06<03:58,  6.11s/it]0
entailment
0
entailment
Average Metric: 3.00 / 3 (100.0%):   5%|▌         | 2/40 [00:06<01:47,  2.82s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):  10%|█         | 4/40 [00:07<00:46,  1.28s/it]1
neutral
Average Metric: 5.00 / 5 (100.0%):  12%|█▎        | 5/40 [00:07<00:32,  1.09it/s]2
contradiction
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:07<00:22,  1.49it/s]1
contradiction
Average Metric: 6.00 / 7 (85.7%):  15%|█▌        | 6/40 [00:07<00:22,  1.49it/s] 1
neutral
Average Metric: 7.00 / 8 (87.5%):  20%|██        | 8/40 [00:08<00:21,  1.52it/s]0
entailment
Average Metric: 8.00 / 9 (88.9%):  22%|██▎       | 9/40 [00:12<00:43,  1.42s/it]2
contradiction
Average Metric: 9.00 / 10 (90.0%):  25%|██▌       | 10/40 [00:13<00:36,  1.23s/it]0
entailment
Average Metric: 10.00 / 11 (90.9%):  28%|██▊       | 11/40 [00:13<00:28,  1.02it/s]2
contradi

2025/08/10 09:28:12 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/08/10 09:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/08/10 09:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0]
2025/08/10 09:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]0
entailment
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:03<02:13,  3.43s/it]1
neutral
1
neutral
Average Metric: 3.00 / 3 (100.0%):   5%|▌         | 2/40 [00:04<01:21,  2.13s/it]11
neutral
Average Metric: 4.00 / 4 (100.0%):  10%|█         | 4/40 [00:04<00:32,  1.12it/s]
neutral
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 4/40 [00:04<00:32,  1.12it/s]1
neutral
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:05<00:19,  1.76it/s]0
entailment
Average Metric: 7.00 / 7 (100.0%):  15%|█▌        | 6/40 [00:05<00:19,  1.76it/s]2
contradiction
Average Metric: 8.00 / 8 (100.0%):  20%|██        | 8/40 [00:06<00:18,  1.71it/s]0
entailment
Average Metric: 9.00 / 9 (100.0%):  22%|██▎       | 9/40 [00:07<00:18,  1.68it/s]0
entailment
Average Metric: 10.00 / 10 (100.0%):  25%|██▌       | 10/40 [00:08<00:19,  1.51it/s]2
contradiction
Average Metric: 11.00 / 11 (100.0%):  28%|██▊       | 11/40 [00:08<00:15,  1.82it/s]1
neutral
A

2025/08/10 09:28:39 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/08/10 09:28:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2025/08/10 09:28:39 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5]
2025/08/10 09:28:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:28:39 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]1
neutral
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:04<02:58,  4.57s/it]0
entailment
Average Metric: 2.00 / 2 (100.0%):   5%|▌         | 2/40 [00:04<01:14,  1.95s/it]1
neutral
Average Metric: 3.00 / 3 (100.0%):   8%|▊         | 3/40 [00:06<01:08,  1.85s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):  10%|█         | 4/40 [00:06<00:46,  1.28s/it]1
neutral
Average Metric: 5.00 / 5 (100.0%):  12%|█▎        | 5/40 [00:07<00:34,  1.02it/s]0
entailment
2
contradiction
Average Metric: 7.00 / 7 (100.0%):  15%|█▌        | 6/40 [00:07<00:26,  1.29it/s]1
neutral
Average Metric: 8.00 / 8 (100.0%):  20%|██        | 8/40 [00:08<00:19,  1.68it/s]1
neutral
Average Metric: 9.00 / 9 (100.0%):  22%|██▎       | 9/40 [00:10<00:32,  1.06s/it]0
entailment
Average Metric: 10.00 / 10 (100.0%):  25%|██▌       | 10/40 [00:10<00:24,  1.25it/s]1
neutral
Average Metric: 11.00 / 11 (100.0%):  28%|██▊       | 11/40 [00:11<00:20,  1.41it/s]0
entailment
Average

2025/08/10 09:29:12 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/08/10 09:29:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/08/10 09:29:12 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5]
2025/08/10 09:29:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:29:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]0
entailment
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:03<02:34,  3.95s/it]1
neutral
Average Metric: 2.00 / 2 (100.0%):   2%|▎         | 1/40 [00:03<02:34,  3.95s/it]1
neutral
Average Metric: 3.00 / 3 (100.0%):   8%|▊         | 3/40 [00:04<00:39,  1.07s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 3/40 [00:04<00:39,  1.07s/it]0
entailment
Average Metric: 5.00 / 5 (100.0%):  12%|█▎        | 5/40 [00:04<00:22,  1.58it/s]1
neutral
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:04<00:18,  1.88it/s]1
neutral
Average Metric: 7.00 / 7 (100.0%):  18%|█▊        | 7/40 [00:05<00:15,  2.13it/s]2
contradiction
Average Metric: 8.00 / 8 (100.0%):  20%|██        | 8/40 [00:05<00:15,  2.08it/s]0
entailment
Average Metric: 9.00 / 9 (100.0%):  22%|██▎       | 9/40 [00:06<00:20,  1.50it/s]0
entailment
Average Metric: 10.00 / 10 (100.0%):  25%|██▌       | 10/40 [00:07<00:20,  1.43it/s]1
neutral
Average Metric: 11.0

2025/08/10 09:29:37 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5]
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 10 =====



2
contradiction
1
neutral
0
entailment
1
contradiction
1
neutral
1
neutral
1
neutral
0
entailment
0
entailment
0
entailment
2
contradiction
2
contradiction
1
neutral
1
neutral
1
neutral
0
entailment
2
contradiction
1
neutral
0
entailment
1
neutral
2
neutral
0
entailment
1
neutral
0
entailment
1
neutral
0
entailment
2
contradiction
1
neutral
0
entailment
0
entailment
2
contradiction
0
neutral
2
contradiction
0
entailment
2
contradiction
2
contradiction
0
entailment
1
neutral
0
entailment
1
contradiction
Average Metric: 36.00 / 40 (90.0%): 100%|██████████| 40/40 [00:00<00:00, 507.09it/s] 

2025/08/10 09:29:37 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5, 90.0]
2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]1
neutral
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:03<02:30,  3.85s/it]1
neutral
Average Metric: 2.00 / 2 (100.0%):   5%|▌         | 2/40 [00:04<01:08,  1.79s/it]0
entailment
Average Metric: 3.00 / 3 (100.0%):   5%|▌         | 2/40 [00:04<01:08,  1.79s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 3/40 [00:05<01:06,  1.79s/it]2
contradiction
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 4/40 [00:05<00:37,  1.04s/it]0
entailment
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:05<00:22,  1.50it/s]1
neutral
Average Metric: 7.00 / 7 (100.0%):  18%|█▊        | 7/40 [00:07<00:33,  1.02s/it]1
contradiction
Average Metric: 7.00 / 8 (87.5%):  20%|██        | 8/40 [00:08<00:29,  1.10it/s] 0
entailment
Average Metric: 8.00 / 9 (88.9%):  22%|██▎       | 9/40 [00:09<00:31,  1.03s/it]2
neutral
Average Metric: 8.00 / 10 (80.0%):  25%|██▌       | 10/40 [00:10<00:25,  1.18it/s]2
contradiction
Average Metric

2025/08/10 09:30:14 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/08/10 09:30:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/08/10 09:30:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5, 90.0, 90.0]
2025/08/10 09:30:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/08/10 09:30:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 10 =====



  0%|          | 0/40 [00:00<?, ?it/s]0
entailment
Average Metric: 1.00 / 1 (100.0%):   2%|▎         | 1/40 [00:03<02:18,  3.56s/it]1
neutral
Average Metric: 2.00 / 2 (100.0%):   5%|▌         | 2/40 [00:04<01:07,  1.77s/it]1
neutral
Average Metric: 3.00 / 3 (100.0%):   8%|▊         | 3/40 [00:04<00:37,  1.02s/it]1
neutral
Average Metric: 4.00 / 4 (100.0%):  10%|█         | 4/40 [00:04<00:28,  1.27it/s]1
neutral
Average Metric: 5.00 / 5 (100.0%):  12%|█▎        | 5/40 [00:05<00:27,  1.28it/s]0
entailment
Average Metric: 6.00 / 6 (100.0%):  15%|█▌        | 6/40 [00:05<00:22,  1.54it/s]1
neutral
Average Metric: 7.00 / 7 (100.0%):  15%|█▌        | 6/40 [00:05<00:22,  1.54it/s]2
contradiction
Average Metric: 8.00 / 8 (100.0%):  20%|██        | 8/40 [00:06<00:17,  1.78it/s]0
entailment
Average Metric: 9.00 / 9 (100.0%):  20%|██        | 8/40 [00:06<00:17,  1.78it/s]0
entailment
Average Metric: 10.00 / 10 (100.0%):  25%|██▌       | 10/40 [00:08<00:18,  1.65it/s]2
contradiction
Average Metric

2025/08/10 09:30:41 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 97.5
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5, 90.0, 90.0, 97.5]
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 10 =====



2
contradiction
1
neutral
0
entailment
1
contradiction
1
neutral
1
neutral
1
neutral
0
entailment
0
entailment
0
entailment
2
contradiction
2
neutral
1
neutral
1
neutral
1
neutral
0
entailment
2
contradiction
1
neutral
1
neutral
2
contradiction
0
entailment
1
neutral
0
entailment
1
neutral
0
entailment
2
contradiction
1
neutral
0
entailment
0
entailment
2
contradiction
0
neutral
0
entailment
2
contradiction
0
entailment
2
contradiction
2
contradiction
0
entailment
1
neutral
0
entailment
1
contradiction
Average Metric: 36.00 / 40 (90.0%): 100%|██████████| 40/40 [00:00<00:00, 582.27it/s] 

2025/08/10 09:30:41 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5, 90.0, 90.0, 97.5, 90.0]
2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/08/10 09:30:41 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 10 =====



2
contradiction
0
entailment
1
neutral
1
neutral
1
neutral
1
neutral
0
entailment
0
entailment
1
neutral
2
contradiction
1
neutral
1
neutral
0
entailment
0
entailment
1
neutral
2
contradiction
2
contradiction
1
neutral
1
neutral
2
contradiction
0
entailment
0
entailment
2
contradiction
0
entailment
Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/40 [00:00<?, ?it/s]
entailment
1
neutral
1
neutral
2
contradiction
2
contradiction
0
entailment
0
entailment
1
neutral
0
entailment
0
entailment
1
neutral
0
entailment
0
entailment
2
contradiction
Average Metric: 2.00 / 2 (100.0%):   2%|▎         | 1/40 [00:00<00:03, 10.30it/s]2
contradiction
1
contradiction
Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 182.67it/s]

2025/08/10 09:30:42 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/08/10 09:30:42 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2025/08/10 09:30:42 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 90.0, 92.5, 92.5, 92.5, 90.0, 90.0, 97.5, 90.0, 97.5]
2025/08/10 09:30:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/08/10 09:30:42 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 97.5!





## Load ImpPres dataset

In [37]:
from datasets import load_dataset

sections = ['presupposition_all_n_presupposition', 
            'presupposition_both_presupposition', 
            'presupposition_change_of_state', 
            'presupposition_cleft_existence', 
            'presupposition_cleft_uniqueness', 
            'presupposition_only_presupposition', 
            'presupposition_possessed_definites_existence', 
            'presupposition_possessed_definites_uniqueness', 
            'presupposition_question_presupposition']

dataset = {}
for section in sections:
    print(f"Loading dataset for section: {section}")
    dataset[section] = load_dataset("facebook/imppres", section)

Loading dataset for section: presupposition_all_n_presupposition
Loading dataset for section: presupposition_both_presupposition
Loading dataset for section: presupposition_change_of_state
Loading dataset for section: presupposition_cleft_existence
Loading dataset for section: presupposition_cleft_uniqueness
Loading dataset for section: presupposition_only_presupposition
Loading dataset for section: presupposition_possessed_definites_existence
Loading dataset for section: presupposition_possessed_definites_uniqueness
Loading dataset for section: presupposition_question_presupposition


In [38]:
dataset

{'presupposition_all_n_presupposition': DatasetDict({
     all_n_presupposition: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_both_presupposition': DatasetDict({
     both_presupposition: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_change_of_state': DatasetDict({
     change_of_state: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_cleft_existence': DatasetDict({
     cleft_existence: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UI

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [39]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

## Your Turn

Compute the classification metrics on the baseline LLM model on each test section of the ANLI dataset for samples that have a non-empty 'reason' field.

You also must show a comparison between the DeBERTa baseline model and this LLM baseline model. The comparison metric should compute the agreement between the two models:
* On how many samples they are both correct [Correct]
* On how many samples Model1 is correct and Model2 is incorrect [Correct1]
* On how many samples Model1 is incorrect and Model2 is correct [Correct2]
* On how many samples both are incorrect [Incorrect]

In [55]:
import json

with open("dataset_with_baseline_evaluation.json", "r") as f:
    dataset_with_baseline_evaluation = json.load(f)

In [59]:
from tqdm import tqdm

for section_name, entries in dataset_with_baseline_evaluation.items():
    for entry in tqdm(filter(lambda ex: 0 <= ex['paradigmID'] <= 9, entries), total = 190):
        prediction = optimized_program(premise=entry['premise'], hypothesis=entry['hypothesis'])
        entry['llm_pred_label'] = prediction.label
        entry['CoT_reasoning'] = prediction.reasoning


  0%|          | 0/190 [00:00<?, ?it/s]

100%|██████████| 190/190 [14:30<00:00,  4.58s/it]
100%|██████████| 190/190 [14:41<00:00,  4.64s/it]
100%|██████████| 190/190 [14:24<00:00,  4.55s/it]
100%|██████████| 190/190 [14:39<00:00,  4.63s/it]
100%|██████████| 190/190 [14:14<00:00,  4.50s/it]
100%|██████████| 190/190 [14:01<00:00,  4.43s/it]
100%|██████████| 190/190 [15:07<00:00,  4.78s/it]
100%|██████████| 190/190 [17:46<00:00,  5.61s/it]
100%|██████████| 190/190 [14:14<00:00,  4.50s/it]


In [72]:
from collections import Counter

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

def display_metrics(section_evaluation_results, section_name):
    # Prepare predictions and gold labels
    baseline_preds = [label_map[e["baseline_pred_label"]] for e in section_evaluation_results]
    llm_preds = [label_map[e["llm_pred_label"]] for e in section_evaluation_results]
    gold_labels = [label_map[e["gold_label"]] for e in section_evaluation_results]

    # Standard metrics for baseline
    acc = accuracy.compute(predictions=llm_preds, references=gold_labels)["accuracy"]
    prec = precision.compute(predictions=llm_preds, references=gold_labels, average="weighted", zero_division=0)["precision"]
    rec = recall.compute(predictions=llm_preds, references=gold_labels, average="weighted")["recall"]
    f1_score = f1.compute(predictions=llm_preds, references=gold_labels, average="weighted")["f1"]

    # Overlap statistics
    both_correct = sum(
        1 for b, l, g in zip(baseline_preds, llm_preds, gold_labels) if b == g and l == g
    )
    correct1_wrong2 = sum(
        1 for b, l, g in zip(baseline_preds, llm_preds, gold_labels) if b == g and l != g
    )
    wrong1_correct2 = sum(
        1 for b, l, g in zip(baseline_preds, llm_preds, gold_labels) if b != g and l == g
    )
    both_wrong = sum(
        1 for b, l, g in zip(baseline_preds, llm_preds, gold_labels) if b != g and l != g
    )

    # Print results
    print(f"\n=== Metrics for {section_name} ===")
    # Counts of each label predicted:
    print("Gold label counts:", Counter(gold_labels))
    print("Baseline prediction counts:", Counter(baseline_preds))
    print("LLM prediction counts:", Counter(llm_preds))
    print(f"{'Accuracy:':<25} {acc:.4f}")
    print(f"{'Precision (weighted):':<25} {prec:.4f}")
    print(f"{'Recall (weighted):':<25} {rec:.4f}")
    print(f"{'F1-score (weighted):':<25} {f1_score:.4f}")
    print("\n=== Model Agreement/Disagreement ===")
    print(f"{'Both correct:':<25} {both_correct}")
    print(f"{'Baseline correct, LLM wrong:':<25} {correct1_wrong2}")
    print(f"{'Baseline wrong, LLM correct:':<25} {wrong1_correct2}")
    print(f"{'Both wrong:':<25} {both_wrong}")


In [73]:
for section_name, entries in dataset_with_baseline_evaluation.items():
    display_metrics(entries[:190], section_name)


=== Metrics for presupposition_all_n_presupposition ===
Gold label counts: Counter({1: 80, 2: 60, 0: 50})
Baseline prediction counts: Counter({2: 110, 1: 55, 0: 25})
LLM prediction counts: Counter({1: 93, 2: 59, 0: 38})
Accuracy:                 0.9105
Precision (weighted):     0.9214
Recall (weighted):        0.9105
F1-score (weighted):      0.9095

=== Model Agreement/Disagreement ===
Both correct:             82
Baseline correct, LLM wrong: 4
Baseline wrong, LLM correct: 91
Both wrong:               13

=== Metrics for presupposition_both_presupposition ===
Gold label counts: Counter({1: 80, 2: 60, 0: 50})
Baseline prediction counts: Counter({2: 103, 1: 69, 0: 18})
LLM prediction counts: Counter({1: 87, 2: 58, 0: 45})
Accuracy:                 0.9632
Precision (weighted):     0.9661
Recall (weighted):        0.9632
F1-score (weighted):      0.9631

=== Model Agreement/Disagreement ===
Both correct:             73
Baseline correct, LLM wrong: 2
Baseline wrong, LLM correct: 110
Both 