# ANLI Baseline with LLM

You have to implement in this notebook a baseline for ANLI classification using an LLM.
This baseline must be implemented using DSPy.



In [None]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
# "gemini/gemini-2.5-flash"
import os
import dspy

In [21]:
with open("grok_key.ini") as f:
        for line in f:
            if "XAI_API_KEY" in line and not line.strip().startswith("#"):
                key_value = line.strip().split("=")
                if len(key_value) == 2:
                    os.environ["XAI_API_KEY"] = key_value[1].split()[0]

with open("gemini_key.ini") as f:
        for line in f:
            if "GEMINI_API_KEY" in line and not line.strip().startswith("#"):
                key_value = line.strip().split("=")
                if len(key_value) == 2:
                    os.environ["GEMINI_API_KEY"] = key_value[1].split()[0]

In [48]:


lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

In [69]:
from typing import Literal

#defining the module
class anli_classification_signature(dspy.Signature):

    """Lable the relationship between given premise and hypothesis."""
    
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField()
    reason: str = dspy.OutputField()

classify = dspy.ChainOfThought(anli_classification_signature)


## Load ANLI dataset

In [32]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

In [33]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [34]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [35]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [36]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline LLM model on each test section of the ANLI dataset for samples that have a non-empty 'reason' field.

You also must show a comparison between the DeBERTa baseline model and this LLM baseline model. The comparison metric should compute the agreement between the two models:
* On how many samples they are both correct [Correct]
* On how many samples Model1 is correct and Model2 is incorrect [Correct1]
* On how many samples Model1 is incorrect and Model2 is correct [Correct2]
* On how many samples both are incorrect [Incorrect]

In [87]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

label_map = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

#defining metric for evaluation
def binary_metrics(example, prediction, trace=None):
    print(example.label)
   # print(prediction["label"])
    return example.label == label_map[prediction["label"]]

example_trainset = []
for example in dataset['dev_r3']:
    example_trainset.append(dspy.Example(premise=example["premise"], hypothesis=example["hypothesis"], label=example["label"], reason=example["reason"]).with_inputs("premise", "hypothesis"))
    if len(example_trainset) >=50:
        break




In [89]:

#optimizing based on dev-r3

dspy.configure_cache(
    enable_disk_cache=True,
    enable_memory_cache=True,
)

config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10, num_threads=4)

teleprompter = BootstrapFewShotWithRandomSearch(metric=binary_metrics, **config)
optimized_program = teleprompter.compile(classify, trainset=example_trainset)



Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 10 candidate sets.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:08<06:37,  8.11s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:09<03:17,  4.12s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:09<01:56,  2.47s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:15<02:45,  3.60s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:18<02:31,  3.37s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:19<01:53,  2.57s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:20<01:30,  2.10s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:24<01:52,  2.67s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:27<02:02,  2.99s/it]0
Average Metric: 9.00 / 10 (90.0%):  20%|██        | 10/50 [00:31<02:01,  3.05s/it]0
Average Metric: 10.00 / 11 (90

2025/08/05 17:07:06 INFO dspy.evaluate.evaluate: Average Metric: 34 / 50 (68.0%)



New best score: 68.0 for seed -3
Scores so far: [68.0]
Best score so far: 68.0
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:24,  5.40s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:06<02:23,  2.99s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:09<02:07,  2.71s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:09<01:24,  1.84s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:13<01:51,  2.47s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:15<01:39,  2.26s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:15<01:11,  1.65s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:18<01:24,  2.01s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:20<01:26,  2.12s/it]0
Average Metric: 10.00 / 10 (100.0%):  20%|██        | 10/50 [00:21<01:13,  1.84s/it]0
Average Metric: 10.00 / 11 (90.9%):  22%|██▏   

2025/08/05 17:08:52 INFO dspy.evaluate.evaluate: Average Metric: 30 / 50 (60.0%)



Scores so far: [68.0, 60.0]
Best score so far: 68.0


  0%|          | 0/50 [00:00<?, ?it/s]

0
0


  8%|▊         | 4/50 [00:10<02:02,  2.66s/it]


0
0
Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:05,  5.02s/it]0
Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 1/50 [00:05<04:05,  5.02s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:06<01:18,  1.67s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:08<01:22,  1.78s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:37,  2.16s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:12<01:26,  1.96s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:14<01:19,  1.86s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:17<01:31,  2.19s/it]0
0
Average Metric: 9.00 / 10 (90.0%):  18%|█▊        | 9/50 [00:20<01:41,  2.48s/it]0
Average Metric: 9.00 / 11 (81.8%):  20%|██        | 10/50 [00:20<01:39,  2.48s/it]0
Average Metric: 10.00 / 12 (83.3%)

2025/08/05 17:11:37 INFO dspy.evaluate.evaluate: Average Metric: 34 / 50 (68.0%)



Scores so far: [68.0, 60.0, 68.0]
Best score so far: 68.0


  2%|▏         | 1/50 [00:08<06:33,  8.04s/it]

1


  4%|▍         | 2/50 [00:17<07:11,  8.99s/it]

0


  6%|▌         | 3/50 [00:28<07:51, 10.03s/it]

0


  8%|▊         | 4/50 [00:35<06:36,  8.62s/it]

0


 10%|█         | 5/50 [00:41<05:54,  7.87s/it]

0


 12%|█▏        | 6/50 [00:49<05:45,  7.84s/it]

0


 14%|█▍        | 7/50 [00:59<06:05,  8.50s/it]


0
Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:18,  5.28s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:05<02:01,  2.54s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:07<01:33,  1.99s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:09<01:32,  2.01s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:24,  1.87s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:02,  1.41s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:14<01:27,  2.02s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:14<01:00,  1.45s/it]0
Average Metric: 8.00 / 9 (88.9%):  18%|█▊        | 9/50 [00:17<01:08,  1.66s/it] 0
Average Metric: 9.00 / 10 (90.0%):  20%|██        | 10/50 [00:17<00:50,  1.26s/it]0
Average Metric: 10.00 / 11 (90.9%):  2

2025/08/05 17:14:04 INFO dspy.evaluate.evaluate: Average Metric: 34 / 50 (68.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0]
Best score so far: 68.0


  2%|▏         | 1/50 [00:05<04:40,  5.72s/it]

0


  4%|▍         | 2/50 [00:12<04:59,  6.24s/it]


0
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:46,  5.85s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:06<02:04,  2.59s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:06<01:19,  1.69s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:07<00:54,  1.19s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:12<02:05,  2.79s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:13<01:27,  1.98s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:14<01:21,  1.90s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:16<01:10,  1.69s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:18<01:14,  1.82s/it]0
Average Metric: 9.00 / 10 (90.0%):  20%|██        | 10/50 [00:19<01:04,  1.62s/it]0
Average Metric: 10.00 / 11 (90.9%):  2

2025/08/05 17:15:54 INFO dspy.evaluate.evaluate: Average Metric: 29 / 50 (58.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0]
Best score so far: 68.0


  2%|▏         | 1/50 [00:08<07:00,  8.58s/it]

0


  4%|▍         | 2/50 [00:16<06:41,  8.37s/it]

0


  6%|▌         | 3/50 [00:24<06:20,  8.09s/it]

1


  8%|▊         | 4/50 [00:31<05:58,  7.80s/it]


0
Bootstrapped 1 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:06<05:09,  6.32s/it]0
Average Metric: 1.00 / 2 (50.0%):   4%|▍         | 2/50 [00:06<02:13,  2.79s/it] 0
Average Metric: 2.00 / 3 (66.7%):   4%|▍         | 2/50 [00:06<02:13,  2.79s/it]0
Average Metric: 3.00 / 4 (75.0%):   8%|▊         | 4/50 [00:09<01:24,  1.84s/it]0
Average Metric: 4.00 / 5 (80.0%):  10%|█         | 5/50 [00:12<01:45,  2.35s/it]0
Average Metric: 5.00 / 6 (83.3%):  12%|█▏        | 6/50 [00:13<01:28,  2.00s/it]0
Average Metric: 6.00 / 7 (85.7%):  14%|█▍        | 7/50 [00:14<01:04,  1.49s/it]0
Average Metric: 7.00 / 8 (87.5%):  16%|█▌        | 8/50 [00:14<00:47,  1.13s/it]0
Average Metric: 7.00 / 9 (77.8%):  18%|█▊        | 9/50 [00:18<01:27,  2.13s/it]0
Average Metric: 7.00 / 10 (70.0%):  20%|██        | 10/50 [00:19<01:11,  1.78s/it]0
Average Metric: 8.00 / 11 (72.7%):  22%|██▏  

2025/08/05 17:18:02 INFO dspy.evaluate.evaluate: Average Metric: 30 / 50 (60.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0]
Best score so far: 68.0


  2%|▏         | 1/50 [00:06<05:15,  6.43s/it]

0


  4%|▍         | 2/50 [00:14<05:56,  7.42s/it]

1


  6%|▌         | 3/50 [00:22<05:51,  7.48s/it]

0


  8%|▊         | 4/50 [00:27<05:06,  6.67s/it]

0


 10%|█         | 5/50 [00:34<05:10,  6.90s/it]


0
Bootstrapped 2 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:08<07:06,  8.69s/it]0
Average Metric: 1.00 / 2 (50.0%):   4%|▍         | 2/50 [00:10<03:35,  4.48s/it] 0
Average Metric: 2.00 / 3 (66.7%):   6%|▌         | 3/50 [00:13<03:00,  3.84s/it]0
Average Metric: 3.00 / 4 (75.0%):   8%|▊         | 4/50 [00:13<01:56,  2.53s/it]0
Average Metric: 4.00 / 5 (80.0%):  10%|█         | 5/50 [00:13<01:14,  1.67s/it]0
Average Metric: 5.00 / 6 (83.3%):  12%|█▏        | 6/50 [00:18<01:59,  2.73s/it]0
Average Metric: 6.00 / 7 (85.7%):  14%|█▍        | 7/50 [00:20<01:44,  2.43s/it]0
Average Metric: 7.00 / 8 (87.5%):  16%|█▌        | 8/50 [00:21<01:26,  2.05s/it]0
Average Metric: 8.00 / 9 (88.9%):  18%|█▊        | 9/50 [00:22<01:06,  1.63s/it]0
Average Metric: 8.00 / 10 (80.0%):  20%|██        | 10/50 [00:26<01:29,  2.24s/it]0
Average Metric: 9.00 / 11 (81.8%):  20%|██   

2025/08/05 17:20:17 INFO dspy.evaluate.evaluate: Average Metric: 36 / 50 (72.0%)



New best score: 72.0 for seed 3
Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:05<04:48,  5.88s/it]

1


  4%|▍         | 2/50 [00:16<06:25,  8.03s/it]


0
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
0
Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 1/50 [00:05<04:20,  5.31s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:06<01:20,  1.71s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:07<01:12,  1.58s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:27,  1.94s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:14,  1.68s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:12<01:08,  1.60s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:13<00:51,  1.23s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:16<01:21,  1.99s/it]0
Average Metric: 10.00 / 10 (100.0%):  20%|██        | 10/50 [00:17<01:02,  1.57s/it]0
Average Metric: 10.00 / 11 (90.9%):  22%|██▏       | 11/50 [00:18<00:52,  1.34s/it] 0
Average Metric: 11.00 / 12 (91.

2025/08/05 17:22:02 INFO dspy.evaluate.evaluate: Average Metric: 36 / 50 (72.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:05<04:40,  5.72s/it]

0


  4%|▍         | 2/50 [00:11<04:34,  5.73s/it]

0


  6%|▌         | 3/50 [00:16<04:21,  5.56s/it]


0
Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:39,  5.71s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:06<02:07,  2.65s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:07<01:29,  1.91s/it]0
Average Metric: 4.00 / 4 (100.0%):   6%|▌         | 3/50 [00:07<01:29,  1.91s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:22,  1.84s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:06,  1.52s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:12<00:52,  1.22s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:14<01:02,  1.50s/it]0
Average Metric: 8.00 / 9 (88.9%):  18%|█▊        | 9/50 [00:17<01:20,  1.96s/it] 0
Average Metric: 9.00 / 10 (90.0%):  20%|██        | 10/50 [00:17<01:02,  1.56s/it]0
Average Metric: 9.00 / 11 (81.8%):  22

2025/08/05 17:23:44 INFO dspy.evaluate.evaluate: Average Metric: 32 / 50 (64.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0, 64.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:07<06:10,  7.57s/it]

1


  4%|▍         | 2/50 [00:16<06:38,  8.30s/it]

0


  6%|▌         | 3/50 [00:23<06:07,  7.83s/it]

0


  8%|▊         | 4/50 [00:30<05:53,  7.69s/it]


1
Bootstrapped 1 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:17,  5.25s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:06<02:24,  3.01s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:07<01:25,  1.81s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:07<01:00,  1.31s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:25,  1.91s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:10,  1.61s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:13<01:16,  1.79s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:15<01:20,  1.92s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:18<01:23,  2.04s/it]0
Average Metric: 9.00 / 10 (90.0%):  20%|██        | 10/50 [00:18<01:03,  1.60s/it]0
Average Metric: 10.00 / 11 (90.9%):  2

2025/08/05 17:25:50 INFO dspy.evaluate.evaluate: Average Metric: 31 / 50 (62.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0, 64.0, 62.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:06<05:00,  6.13s/it]

0


  4%|▍         | 2/50 [00:12<05:03,  6.32s/it]

0


  6%|▌         | 3/50 [00:20<05:28,  6.98s/it]

0


  8%|▊         | 4/50 [00:26<05:00,  6.53s/it]

0


 10%|█         | 5/50 [00:32<04:44,  6.32s/it]

0


 12%|█▏        | 6/50 [00:40<05:08,  7.02s/it]

1


 14%|█▍        | 7/50 [00:45<04:28,  6.24s/it]

0


 16%|█▌        | 8/50 [00:51<04:28,  6.40s/it]

0


 18%|█▊        | 9/50 [00:59<04:29,  6.57s/it]


0
Bootstrapped 3 full traces after 9 examples for up to 1 rounds, amounting to 9 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:04<04:04,  4.99s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:05<01:53,  2.36s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:07<01:32,  1.98s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:07<01:09,  1.52s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:20,  1.79s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:08,  1.56s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:13<01:15,  1.75s/it]0
Average Metric: 8.00 / 8 (100.0%):  14%|█▍        | 7/50 [00:13<01:15,  1.75s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:15<01:02,  1.52s/it]0
Average Metric: 10.00 / 10 (100.0%):  20%|██        | 10/50 [00:16<00:55,  1.40s/it]0
Average Metric: 11.00 / 11 (100.0%):

2025/08/05 17:28:16 INFO dspy.evaluate.evaluate: Average Metric: 32 / 50 (64.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0, 64.0, 62.0, 64.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:06<05:38,  6.90s/it]

1


  4%|▍         | 2/50 [00:11<04:41,  5.86s/it]


0
Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:24,  5.39s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:05<01:55,  2.40s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:06<01:27,  1.87s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:09<01:31,  1.98s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:10<01:11,  1.60s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:11<01:14,  1.68s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:14<01:31,  2.14s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:16<01:18,  1.88s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:16<00:56,  1.39s/it]0
Average Metric: 10.00 / 10 (100.0%):  20%|██        | 10/50 [00:17<00:54,  1.37s/it]0
Average Metric: 11.00 / 11 (100.0%):

2025/08/05 17:29:58 INFO dspy.evaluate.evaluate: Average Metric: 34 / 50 (68.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0, 64.0, 62.0, 64.0, 68.0]
Best score so far: 72.0


  2%|▏         | 1/50 [00:06<05:16,  6.47s/it]

1


  4%|▍         | 2/50 [00:15<06:33,  8.19s/it]

0


  6%|▌         | 3/50 [00:21<05:28,  6.99s/it]

0


  8%|▊         | 4/50 [00:28<05:21,  6.99s/it]

0


 10%|█         | 5/50 [00:36<05:33,  7.41s/it]

1


 12%|█▏        | 6/50 [00:43<05:17,  7.22s/it]

0


 14%|█▍        | 7/50 [00:48<04:42,  6.57s/it]

0


 16%|█▌        | 8/50 [00:53<04:43,  6.75s/it]


0
Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
  0%|          | 0/50 [00:00<?, ?it/s]0
Average Metric: 1.00 / 1 (100.0%):   2%|▏         | 1/50 [00:05<04:44,  5.80s/it]0
Average Metric: 2.00 / 2 (100.0%):   4%|▍         | 2/50 [00:06<02:03,  2.58s/it]0
Average Metric: 3.00 / 3 (100.0%):   6%|▌         | 3/50 [00:06<01:14,  1.58s/it]0
Average Metric: 4.00 / 4 (100.0%):   8%|▊         | 4/50 [00:08<01:17,  1.68s/it]0
Average Metric: 5.00 / 5 (100.0%):  10%|█         | 5/50 [00:11<01:36,  2.15s/it]0
Average Metric: 6.00 / 6 (100.0%):  12%|█▏        | 6/50 [00:13<01:28,  2.01s/it]0
Average Metric: 7.00 / 7 (100.0%):  14%|█▍        | 7/50 [00:14<01:12,  1.69s/it]0
Average Metric: 8.00 / 8 (100.0%):  16%|█▌        | 8/50 [00:15<01:05,  1.57s/it]0
Average Metric: 9.00 / 9 (100.0%):  18%|█▊        | 9/50 [00:17<01:07,  1.66s/it]0
Average Metric: 10.00 / 10 (100.0%):  20%|██        | 10/50 [00:17<00:54,  1.37s/it]0
Average Metric: 10.00 / 11 (90.9%): 

2025/08/05 17:32:22 INFO dspy.evaluate.evaluate: Average Metric: 34 / 50 (68.0%)



Scores so far: [68.0, 60.0, 68.0, 68.0, 58.0, 60.0, 72.0, 72.0, 64.0, 62.0, 64.0, 68.0, 68.0]
Best score so far: 72.0
13 candidate programs found.


In [None]:
#next on the agenda: 
#create a list of evaluations - pred/gold labels - on test_r3 ( same as 1.2 baseline)
#combine with list made in 1.2 - both pred labels and reasons
#Compare the results with the baseline and provide agreement metrics between the two models.
#agreement metrics: maybe follow GPT, seems reasonable.