If you are using this notebook on google colab, you might want to uncomment the following two cells

In [1]:
# from google.colab import userdata

In [2]:
# !pip install "dspy>=3.0.3" sacrebleu tqdm rich

## LM Setup and Intro

The API key for [ai.ufal.mff.cuni.cz](https://chat.ai.e-infra.cz/) can be found like this:
1. Click on the profile picture at the top right corner
2. Select Settings
3. Select the Account tab
4. Expand API keys
5. Click the Copy to Clipboard icon

You should copy the key and add it as a new Google Colab secret (left row, icon with a key) under the name "CHAT_AI_API_KEY". If using the notebook locally, you can store the keys in files, for example in ~/.ssh/chat_ai_api.key, ideally, this key should be set to be readable only by you current user. You could also directly copy the key and use it as a string here, but it is a bad practice in case you forget about it and share the Jupyter notebook with someone later (e.g. through git).

In [3]:
import dspy

# OpenAI servers
# lm = dspy.LM("gpt-4o-mini", api_key=userdata.get("OPENAI_API_KEY"))

# CESNET servers (https://docs.cerit.io/en/docs/web-apps/chat-ai)
lm = dspy.LM(
    "openai/gpt-oss-120b",
    base_url="https://llm.ai.e-infra.cz/v1/",
    # api_key=userdata.get("CHAT_AI_API_KEY")  # load the API key from Colab's user data
    api_key=open("/home/mhn/.ssh/chat_ai_api.key").read().strip(),  # load the API key from a local file
    max_length=16000, # we increase the token limit to 16k, because gpt120-oss takes up a lot of tokens when thinking
)

dspy.configure(lm=lm)

In [4]:
lm("Hello world!")

[{'text': 'Hello! üëã How can I assist you today?',
  'reasoning_content': 'The user just says "Hello world!" Probably a greeting. We should respond friendly. Maybe ask how we can help.'}]

DSPy uses cache by default. If you run the exact same request multiple times, you will get the same answer.

In [5]:
print(lm("Why is the sky blue!", temperature=0.7)[0])

{'text': '**Short answer:**  \nThe sky looks blue because the Earth‚Äôs atmosphere scatters short‚Äëwavelength (blue‚Äëviolet) light from the Sun much more efficiently than it scatters longer‚Äëwavelength (red, orange, yellow) light. Our eyes perceive that scattered blue light coming from every direction overhead.\n\n---\n\n## The physics behind the color\n\n| Step | What happens | Why it matters |\n|------|--------------|----------------|\n| **1. Sunlight is white** | Sunlight contains a continuous spectrum of colors (red\u202f‚Üí\u202fviolet). | ‚ÄúWhite‚Äù light is just a mix of all visible wavelengths. |\n| **2. Light enters the atmosphere** | Photons encounter molecules (N‚ÇÇ, O‚ÇÇ) and tiny particles. | These scatter the light in all directions. |\n| **3. Rayleigh scattering** | For particles much smaller than the light‚Äôs wavelength, the scattering intensity follows an inverse‚Äëfourth‚Äëpower law:\u202f\\(I \\propto 1/Œª^{4}\\). | Shorter wavelengths (blue ~450\u202fnm, violet

In [6]:
lm("Why is the sky blue!", temperature=0.1)

[{'text': '**Short answer:**  \nThe sky looks blue because the Earth‚Äôs atmosphere scatters short‚Äëwavelength (blue‚Äëviolet) sunlight much more efficiently than it scatters longer‚Äëwavelength (red, orange, yellow) light. Our eyes end up receiving a lot of that scattered blue light from every direction.\n\n---\n\n### The physics in a nutshell  \n\n1. **Sunlight is a mixture of colors**  \n   Sunlight contains all visible wavelengths (roughly 380\u202fnm\u202f‚Äì\u202f750\u202fnm). If you pass it through a prism you see a rainbow, which shows that the Sun emits roughly equal amounts of red, orange, yellow, green, blue, indigo, and violet light.\n\n2. **What the atmosphere does: Rayleigh scattering**  \n   - The air is filled with molecules (N‚ÇÇ, O‚ÇÇ, etc.) that are *much* smaller than the wavelength of visible light.  \n   - When light hits these tiny particles, it is scattered in all directions.  \n   - The scattering intensity follows **Rayleigh‚Äôs law**, which says the amount o

If you want to sample multiple responses with the same sampling parameters, you can use `rollout_id`. You might get different reponses, but they will still get cached for a given `rollout_id` Note that this only makes sense with sampling (e.g. temperature > 0).

In [7]:
lm("Why is the sky blue!", temperature=0.7, rollout_id=1)

[{'text': '**Short answer:**  \nThe sky looks blue because the Earth‚Äôs atmosphere scatters short‚Äëwavelength (blue‚Äëviolet) light from the Sun much more efficiently than it scatters longer‚Äëwavelength (red, orange, yellow) light. Our eyes perceive that scattered blue light coming from every direction overhead.\n\n---\n\n## The physics behind the color\n\n| Step | What happens | Why it matters |\n|------|--------------|----------------|\n| **1. Sunlight is white** | Sunlight contains a continuous spectrum of colors (red\u202f‚Üí\u202fviolet). | ‚ÄúWhite‚Äù light is just a mix of all visible wavelengths. |\n| **2. Light enters the atmosphere** | Photons encounter molecules (N‚ÇÇ, O‚ÇÇ) and tiny particles. | These scatter the light in all directions. |\n| **3. Rayleigh scattering** | For particles much smaller than the light‚Äôs wavelength, the scattering intensity follows an inverse‚Äëfourth‚Äëpower law:\u202f\\(I \\propto 1/Œª^{4}\\). | Shorter wavelengths (blue ~450\u202fnm, viole

In [8]:
lm("Why is the sky blue!", temperature=0.7, rollout_id=2)

[{'text': '**Short answer:**  \nThe sky looks blue because the Earth‚Äôs atmosphere scatters short‚Äëwavelength (blue‚Äëviolet) light from the Sun much more efficiently than it scatters longer‚Äëwavelength (red, orange, yellow) light. Our eyes perceive that scattered blue light coming from every direction overhead.\n\n---\n\n## The physics behind the color\n\n| Step | What happens | Why it matters |\n|------|--------------|----------------|\n| **1. Sunlight is white** | Sunlight contains a continuous spectrum of colors (red\u202f‚Üí\u202fviolet). | ‚ÄúWhite‚Äù light is just a mix of all visible wavelengths. |\n| **2. Light enters the atmosphere** | Photons encounter molecules (N‚ÇÇ, O‚ÇÇ) and tiny particles. | These scatter the light in all directions. |\n| **3. Rayleigh scattering** | For particles much smaller than the light‚Äôs wavelength, the scattering intensity follows an inverse‚Äëfourth‚Äëpower law:\u202f\\(I \\propto 1/Œª^{4}\\). | Shorter wavelengths (blue ~450\u202fnm, viole

## DSPy Signatures

Signatures specify the inputs and outputs and their types.

In [9]:
from typing import Literal
# string based definition
Translate1 = "src_lang, tgt_lang, src -> tgt"

# class based definition (can contain initial prompt, description of the fields, etc.)
class Translate2(dspy.Signature):
    """Translate the src text from src_lang to tgt_lang."""
    src_lang: str = dspy.InputField(desc="Source language")
    tgt_lang: str = dspy.InputField(desc="Target language")
    src: str = dspy.InputField()

    tgt: str = dspy.OutputField()
    terminology_list: list[tuple[str, str, Literal["standard translation", "transliteration", "other"]]] = dspy.OutputField(desc="list of triplets of (src term, tgt term, reasoning)")

translate2 = dspy.Predict(Translate2)
pred = translate2(src_lang="English", tgt_lang="French", src="""Prague (/Ààpr…ëÀê…°/ PRAHG; Czech: Praha [Ààpra…¶a] ‚ìò)[a] is the capital and largest city of the Czech Republic[9] and the historical capital of Bohemia. Prague, located on the Vltava River, has a population of about 1.4 million, while its metropolitan area is home to approximately 2.3 million people.""")

In [10]:
print(pred)

Prediction(
    tgt='Prague (/Ààpr…ëÀê…°/ PRAHG\u202f; tch√®que\u202f: Praha [Ààpra…¶a] ‚ìò)[a] est la capitale et la plus grande ville de la R√©publique tch√®que et l‚Äôancienne capitale historique de la Boh√™me. Prague, situ√©e sur le fleuve Vltava, compte une population d‚Äôenviron 1,4\u202fmillion d‚Äôhabitants, tandis que son agglom√©ration abrite environ\u202f2,3\u202fmillions de personnes.',
    terminology_list=[('Prague', 'Prague', 'standard translation'), ('Czech Republic', 'R√©publique tch√®que', 'standard translation'), ('Bohemia', 'Boh√™me', 'standard translation'), ('Vltava River', 'rivi√®re Vltava', 'standard translation')]
)


In [11]:
dspy.inspect_history()





[34m[2026-02-27T02:48:30.892970][0m

[31mSystem message:[0m

Your input fields are:
1. `src_lang` (str): Source language
2. `tgt_lang` (str): Target language
3. `src` (str):
Your output fields are:
1. `tgt` (str): 
2. `terminology_list` (list[tuple[str, str, Literal['standard translation', 'transliteration', 'other']]]): list of triplets of (src term, tgt term, reasoning)
All interactions will be structured in the following way, with the appropriate values filled in.

Inputs will have the following structure:

[[ ## src_lang ## ]]
{src_lang}

[[ ## tgt_lang ## ]]
{tgt_lang}

[[ ## src ## ]]
{src}

Outputs will be a JSON object with the following fields.

{
  "tgt": "{tgt}",
  "terminology_list": "{terminology_list}        # note: the value you produce must adhere to the JSON schema: {\"type\": \"array\", \"items\": {\"type\": \"array\", \"maxItems\": 3, \"minItems\": 3, \"prefixItems\": [{\"type\": \"string\"}, {\"type\": \"string\"}, {\"type\": \"string\", \"enum\": [\"standar

In [12]:
# To use signatures, we can wrap them in built-in modules, such as dspy.Predict and dspy.ChainOfThought
translate1 = dspy.Predict(Translate1)
translate1_cot = dspy.ChainOfThought(Translate1)
translate2 = dspy.Predict(Translate2)

In [13]:
translate1(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")

Prediction(
    tgt='Praha je hlavn√≠ mƒõsto ƒåesk√© republiky.'
)

In [14]:
translate1_cot(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")

Prediction(
    reasoning='The sentence states that Prague is the capital of the Czech Republic. In Czech, "Prague" is "Praha", "is" translates to the verb "je", "the capital" becomes "hlavn√≠m mƒõstem", and "of the Czech Republic" is "ƒåesk√© republiky". The proper case and word order for this statement in Czech is: "Praha je hlavn√≠m mƒõstem ƒåesk√© republiky."',
    tgt='Praha je hlavn√≠m mƒõstem ƒåesk√© republiky.'
)

Let's explore what does the actual prompt sent to the LLM look like.

In [15]:
import json
adapter = dspy.ChatAdapter()
print(json.dumps(adapter.format(translate2.signature, demos=[], inputs=dict(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")), indent=2))

[
  {
    "role": "system",
    "content": "Your input fields are:\n1. `src_lang` (str): Source language\n2. `tgt_lang` (str): Target language\n3. `src` (str):\nYour output fields are:\n1. `tgt` (str): \n2. `terminology_list` (list[tuple[str, str, Literal['standard translation', 'transliteration', 'other']]]): list of triplets of (src term, tgt term, reasoning)\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## src_lang ## ]]\n{src_lang}\n\n[[ ## tgt_lang ## ]]\n{tgt_lang}\n\n[[ ## src ## ]]\n{src}\n\n[[ ## tgt ## ]]\n{tgt}\n\n[[ ## terminology_list ## ]]\n{terminology_list}        # note: the value you produce must adhere to the JSON schema: {\"type\": \"array\", \"items\": {\"type\": \"array\", \"maxItems\": 3, \"minItems\": 3, \"prefixItems\": [{\"type\": \"string\"}, {\"type\": \"string\"}, {\"type\": \"string\", \"enum\": [\"standard translation\", \"transliteration\", \"other\"]}]}}\n\n[[ ## completed ## ]]\nIn adhering to t

In [16]:
for message in adapter.format(translate1.signature, demos=[], inputs=dict(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")):
  print(f"**{message['role']}**")
  print(message['content'])
  print("\n")

**system**
Your input fields are:
1. `src_lang` (str): 
2. `tgt_lang` (str): 
3. `src` (str):
Your output fields are:
1. `tgt` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## src_lang ## ]]
{src_lang}

[[ ## tgt_lang ## ]]
{tgt_lang}

[[ ## src ## ]]
{src}

[[ ## tgt ## ]]
{tgt}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `src_lang`, `tgt_lang`, `src`, produce the fields `tgt`.


**user**
[[ ## src_lang ## ]]
English

[[ ## tgt_lang ## ]]
Czech

[[ ## src ## ]]
Prague is the capital of the Czech Republic.

Respond with the corresponding output fields, starting with the field `[[ ## tgt ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.




Let's also look at the variant with the initial prompt

In [17]:
for message in adapter.format(translate2.signature, demos=[], inputs=dict(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")):
  print(f"**{message['role']}**")
  print(message['content'])
  print("\n")

**system**
Your input fields are:
1. `src_lang` (str): Source language
2. `tgt_lang` (str): Target language
3. `src` (str):
Your output fields are:
1. `tgt` (str): 
2. `terminology_list` (list[tuple[str, str, Literal['standard translation', 'transliteration', 'other']]]): list of triplets of (src term, tgt term, reasoning)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## src_lang ## ]]
{src_lang}

[[ ## tgt_lang ## ]]
{tgt_lang}

[[ ## src ## ]]
{src}

[[ ## tgt ## ]]
{tgt}

[[ ## terminology_list ## ]]
{terminology_list}        # note: the value you produce must adhere to the JSON schema: {"type": "array", "items": {"type": "array", "maxItems": 3, "minItems": 3, "prefixItems": [{"type": "string"}, {"type": "string"}, {"type": "string", "enum": ["standard translation", "transliteration", "other"]}]}}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Translate the src text from src_lang to tgt_lang.



## Metrics

Before we get around to using optimizers, we have to specify some metrics we can optimize. Let's try using our LM to also score the translations.

In [18]:
class EvaluateTranslation(dspy.Signature):
    """Assign an integer score (0-10) to the translation. 0 - completely wrong, 10 - perfect."""
    src_lang: str = dspy.InputField()
    tgt_lang: str = dspy.InputField()
    src: str = dspy.InputField()
    tgt: str = dspy.InputField()

    score: int = dspy.OutputField()
evaluate_translation = dspy.Predict(EvaluateTranslation)

def metric(gold, pred, trace=None):  # -> bool, int, float, higher is better
    src_lang = gold.src_lang
    tgt_lang = gold.tgt_lang
    src = gold.src
    tgt = pred.tgt

    eval_output = evaluate_translation(src_lang=src_lang, tgt_lang=tgt_lang, src=src, tgt=tgt)
    score = eval_output.score

    return score/10

evaluate_translation(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.", tgt="Praha je hlavn√≠m mƒõstem ƒåesk√© republiky.")

Prediction(
    score=10
)

In [19]:
evaluate_translation(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.", tgt="Praha je hlavn√≠m ƒåesk√© republiky.")

Prediction(
    score=2
)

In [20]:
evaluate_translation(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.", tgt="Praha je hlavn√≠m mƒõstem ƒåesk√© Republiky.")

Prediction(
    score=9
)

## Data preparation

In [21]:
evaluate_translation(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.", tgt="Praha je mƒõstem ƒåesk√© republiky.")

Prediction(
    score=3
)

In [22]:
import pandas as pd
# df = pd.read_json("/content/drive/MyDrive/DGT/wmt24_esa.jsonl.zst", lines=True)  # load from google drive
df = pd.read_json("../data/wmt24_esa.jsonl.zst", lines=True)  # load from local file
import random
random.seed(42)
df=df[df["langs"]=="en-cs"]
df=df[df["system"] == "refA"]
data = df.to_dict("records")
random.shuffle(data)

trainset = data[:300]
testset = data[300:]

print(f"{len(trainset)=}, {len(testset)=}")
print(trainset[0])

len(trainset)=300, len(testset)=619
{'langs': 'en-cs', 'line_id': 787, 'src': "Hi everyone, welcome back to its Dwight Cooking Show. Today I'll be giving you a tuna macaroni salad recipe. Nice and easy. Just a simple recipe. Here is a list of my ingredients, my bell pepper, my green onions, white onions, half a teaspoon of salt for taste, black pepper, or if you have ground pepper, you can use it, my macaroni, three eggs that were going to be boiled and my tuna.", 'tgt': 'Ahoj v≈°ichni, v√≠t√°m v√°s u Dwightova po≈ôadu o va≈ôen√≠. Dnes v√°s nauƒç√≠m recept na tƒõstovinov√Ω sal√°t s tu≈à√°kem. Dobr√Ω a jednoduch√Ω. Prostƒõ snadn√Ω recept. Tady je seznam p≈ô√≠sad: paprika, jarn√≠ cibulka, b√≠l√° cibule, p≈Øl l≈æiƒçky soli pro lep≈°√≠ chu≈•, ƒçern√Ω pep≈ô, nebo pokud m√°te mlet√Ω pep≈ô, m≈Ø≈æete pou≈æ√≠t ten. D√°le tƒõstoviny, t≈ôi vejce, kter√° uva≈ô√≠me, a tu≈à√°ka.', 'doc_id': 'test-en-speech_WLS2EoW96t4_000', 'domain': 'speech', 'esa_spans': [], 'esa_score': 100, 'system': 'refA', 'an

In [23]:
def convert_example(row):
    return dspy.Example(
        src_lang="English",
        tgt_lang="Czech",
        src=row["src"],
        tgt=row["tgt"],
    ).with_inputs("src_lang", "tgt_lang", "src")

trainset = [convert_example(row) for row in trainset]
testset = [convert_example(row) for row in testset]

print(trainset[0])

Example({'src_lang': 'English', 'tgt_lang': 'Czech', 'src': "Hi everyone, welcome back to its Dwight Cooking Show. Today I'll be giving you a tuna macaroni salad recipe. Nice and easy. Just a simple recipe. Here is a list of my ingredients, my bell pepper, my green onions, white onions, half a teaspoon of salt for taste, black pepper, or if you have ground pepper, you can use it, my macaroni, three eggs that were going to be boiled and my tuna.", 'tgt': 'Ahoj v≈°ichni, v√≠t√°m v√°s u Dwightova po≈ôadu o va≈ôen√≠. Dnes v√°s nauƒç√≠m recept na tƒõstovinov√Ω sal√°t s tu≈à√°kem. Dobr√Ω a jednoduch√Ω. Prostƒõ snadn√Ω recept. Tady je seznam p≈ô√≠sad: paprika, jarn√≠ cibulka, b√≠l√° cibule, p≈Øl l≈æiƒçky soli pro lep≈°√≠ chu≈•, ƒçern√Ω pep≈ô, nebo pokud m√°te mlet√Ω pep≈ô, m≈Ø≈æete pou≈æ√≠t ten. D√°le tƒõstoviny, t≈ôi vejce, kter√° uva≈ô√≠me, a tu≈à√°ka.'}) (input_keys={'src', 'tgt_lang', 'src_lang'})


## Optimization

### MIPROv2

In [24]:
optimizer_mipro = dspy.MIPROv2(
    metric=metric,
    auto="light",
    num_threads=16,
)

optimized_program_mipro = optimizer_mipro.compile(
    translate1,
    trainset=trainset,
)

# Save the optimized program for future use
optimized_program_mipro.save("../optimized_b2_mipro.json")

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


  0%|          | 0/60 [00:00<?, ?it/s]

  7%|‚ñã         | 4/60 [00:00<00:02, 23.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/6


  3%|‚ñé         | 2/60 [00:00<00:01, 49.63it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/6


  7%|‚ñã         | 4/60 [00:00<00:01, 52.34it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/6


  7%|‚ñã         | 4/60 [00:00<00:01, 55.74it/s]
2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `src_lang`, `tgt_lang`, `src`, produce the fields `tgt`.

2026/02/27 02:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are an expert bilingual translator specializing in English‚ÄëCzech translation.  
Your task is to translate the given source sentence (`Src`) from the source language (`Src Lang`) into the target language (`Tgt Lang`) while preserving the original meaning, tone, style, and any domain‚Äëspecific terminology.  

**Guidelines**
1. **Preserve immutable items** ‚Äì Keep URLs, email addresses, Twitter handles, brand names, and other tokens that should not be translated exactly as they appear.
2. **Czech typographic conventions** ‚Äì Use Czech quotation marks (‚Äû‚Ä¶‚Äú), proper case agreement, corr

Average Metric: 84.90 / 100 (84.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:08<00:00, 11.21it/s]

2026/02/27 02:48:42 INFO dspy.evaluate.evaluate: Average Metric: 84.9 / 100 (84.9%)
2026/02/27 02:48:42 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 84.9

  sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
2026/02/27 02:48:42 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==



Average Metric: 29.20 / 35 (83.4%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:44<00:00,  1.27s/it]

2026/02/27 02:49:26 INFO dspy.evaluate.evaluate: Average Metric: 29.2 / 35 (83.4%)
2026/02/27 02:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 83.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2026/02/27 02:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43]
2026/02/27 02:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9]
2026/02/27 02:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 84.9


2026/02/27 02:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==



Average Metric: 27.90 / 35 (79.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:43<00:00,  1.24s/it]

2026/02/27 02:50:10 INFO dspy.evaluate.evaluate: Average Metric: 27.900000000000002 / 35 (79.7%)
2026/02/27 02:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2026/02/27 02:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71]
2026/02/27 02:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9]
2026/02/27 02:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 84.9


2026/02/27 02:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==



Average Metric: 26.80 / 34 (78.8%):  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 34/35 [00:38<00:01,  1.39s/it]

2026/02/27 02:50:48 ERROR dspy.utils.parallelizer: Error for Example({'src_lang': 'English', 'tgt_lang': 'Czech', 'src': 'It is December 1997, and the Imperial Sugar Company is acquiring a new production site at Port Wentworth, from Savannah Foods and Industries Incorporated. There is nothing really of note here, it was doing what businesses do, and that is acquiring to expand. The site has been home to food production and processing since the early 1900s. Savannah Industries Incorporated, began construction of granulated sugar production facilities at Port Wentworth during the 1910s, completing it in 1917.', 'tgt': 'P√≠≈°e se prosinec 1997 a spoleƒçnosti Imperial Sugar Company kupuje od Savannah Foods and Industries Incorporated nov√Ω v√Ωrobn√≠ z√°vod v Port Wentworth. Na tom nen√≠ opravdu nic v√Ωznamn√©ho. Dƒõlali jen to, co bƒõ≈ænƒõ spoleƒçnosti dƒõlaj√≠ ‚Äì z√≠sk√°vali majetek, aby mohli expandovat. V tomto z√°vodƒõ se j√≠dlo vyr√°bƒõlo a zpracov√°valo u≈æ od poƒç√°tku 20. stolet√≠

Average Metric: 26.80 / 34 (78.8%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:38<00:00,  1.10s/it]

2026/02/27 02:50:48 INFO dspy.evaluate.evaluate: Average Metric: 26.8 / 35 (76.6%)
2026/02/27 02:50:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2026/02/27 02:50:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57]
2026/02/27 02:50:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9]
2026/02/27 02:50:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 84.9


2026/02/27 02:50:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==



Average Metric: 28.90 / 35 (82.6%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:36<00:00,  1.06s/it]

2026/02/27 02:51:25 INFO dspy.evaluate.evaluate: Average Metric: 28.9 / 35 (82.6%)
2026/02/27 02:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2026/02/27 02:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57]
2026/02/27 02:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9]
2026/02/27 02:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 84.9


2026/02/27 02:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==



Average Metric: 30.70 / 35 (87.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:27<00:00,  1.26it/s]

2026/02/27 02:51:53 INFO dspy.evaluate.evaluate: Average Metric: 30.7 / 35 (87.7%)
2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 87.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71]
2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9]
2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 84.9


2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2026/02/27 02:51:53 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 87.71) from minibatch trials...



Average Metric: 88.30 / 100 (88.3%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:46<00:00,  2.13it/s]

2026/02/27 02:52:40 INFO dspy.evaluate.evaluate: Average Metric: 88.3 / 100 (88.3%)
2026/02/27 02:52:40 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 88.3
2026/02/27 02:52:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:52:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3
2026/02/27 02:52:40 INFO dspy.teleprompt.mipro_optimizer_v2: 

2026/02/27 02:52:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==



Average Metric: 27.20 / 35 (77.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:26<00:00,  1.31it/s]

2026/02/27 02:53:07 INFO dspy.evaluate.evaluate: Average Metric: 27.2 / 35 (77.7%)
2026/02/27 02:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2026/02/27 02:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71, 77.71]
2026/02/27 02:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3


2026/02/27 02:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 13 - Minibatch ==



Average Metric: 29.30 / 35 (83.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:19<00:00,  1.78it/s]

2026/02/27 02:53:27 INFO dspy.evaluate.evaluate: Average Metric: 29.3 / 35 (83.7%)
2026/02/27 02:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 83.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2026/02/27 02:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71, 77.71, 83.71]
2026/02/27 02:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3


2026/02/27 02:53:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==



Average Metric: 29.10 / 35 (83.1%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:39<00:00,  1.13s/it]

2026/02/27 02:54:06 INFO dspy.evaluate.evaluate: Average Metric: 29.1 / 35 (83.1%)
2026/02/27 02:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 83.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2026/02/27 02:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71, 77.71, 83.71, 83.14]
2026/02/27 02:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3


2026/02/27 02:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==



Average Metric: 30.70 / 35 (87.7%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:23<00:00,  1.47it/s]

2026/02/27 02:54:30 INFO dspy.evaluate.evaluate: Average Metric: 30.7 / 35 (87.7%)
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 87.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 4'].
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71, 77.71, 83.71, 83.14, 87.71]
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3


2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==



Average Metric: 30.90 / 35 (88.3%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:00<00:00, 945.87it/s]

2026/02/27 02:54:30 INFO dspy.evaluate.evaluate: Average Metric: 30.9 / 35 (88.3%)
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [83.43, 79.71, 76.57, 82.57, 87.71, 77.71, 83.71, 83.14, 87.71, 88.29]
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3]
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3


2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2026/02/27 02:54:30 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 87.71) from minibatch trials...



Average Metric: 86.90 / 100 (86.9%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:48<00:00,  2.07it/s]

2026/02/27 02:55:18 INFO dspy.evaluate.evaluate: Average Metric: 86.9 / 100 (86.9%)
2026/02/27 02:55:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [84.9, 88.3, 86.9]
2026/02/27 02:55:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.3
2026/02/27 02:55:18 INFO dspy.teleprompt.mipro_optimizer_v2: 

2026/02/27 02:55:18 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 88.3!





In [25]:
optimized_program_mipro(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")

Prediction(
    tgt='Praha je hlavn√≠m mƒõstem ƒåesk√© republiky.'
)

In [26]:
dspy.inspect_history()





[34m[2026-02-27T02:55:19.827694][0m

[31mSystem message:[0m

Your input fields are:
1. `src_lang` (str): 
2. `tgt_lang` (str): 
3. `src` (str):
Your output fields are:
1. `tgt` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## src_lang ## ]]
{src_lang}

[[ ## tgt_lang ## ]]
{tgt_lang}

[[ ## src ## ]]
{src}

[[ ## tgt ## ]]
{tgt}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `src_lang`, `tgt_lang`, `src`, produce the fields `tgt`.


[31mUser message:[0m

[[ ## src_lang ## ]]
English

[[ ## tgt_lang ## ]]
Czech

[[ ## src ## ]]
Damn this rusty K1100! Is there a single screw which has not seized?!


[31mAssistant message:[0m

[[ ## tgt ## ]]
Sakra, ten rezav√Ω K1100! Existuje v≈Øbec jedin√Ω ≈°roub, kter√Ω se nep≈ôipletl?!

[[ ## completed ## ]]


[31mUser message:[0m

[[ ## src_lang ## ]]
English

[[ ## tgt_lang ## ]]
Czech

[[ ## src ## ]]
The thick cloud cover

### SIMBA

In [None]:
optimizer_simba = dspy.SIMBA(
    metric=metric,
    num_threads=16,
)

optimized_program_simba = optimizer_simba.compile(
    translate1,
    trainset=trainset,
)

# Save optimize program for future use
optimized_program_simba.save("../optimized_b2_simba.json")

2026/02/27 02:55:19 INFO dspy.teleprompt.simba: Starting batch 1 of 8.
2026/02/27 02:55:20 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 43 / 192 examples:  22%|‚ñà‚ñà‚ñè       | 42/192 [00:19<00:20,  7.27it/s]


LM Response: {"final":"{ \"score\": 3 }"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 45 / 192 examples:  23%|‚ñà‚ñà‚ñé       | 44/192 [00:21<00:31,  4.68it/s]


LM Response: {"final":{"score":4}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 68 / 192 examples:  35%|‚ñà‚ñà‚ñà‚ñç      | 67/192 [00:24<00:16,  7.81it/s]


LM Response: {"final":{"score":4}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 99 / 192 examples:  51%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 98/192 [00:27<00:05, 17.75it/s]


LM Response: {"final":{"score":4}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 131 / 192 examples:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 130/192 [00:28<00:02, 26.65it/s]


LM Response: {"final":{"score":4}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 163 / 192 examples:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 162/192 [00:29<00:00, 39.43it/s]


LM Response: {"final":{"score":4}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 182 / 192 examples:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 182/192 [00:33<00:02,  4.17it/s]


LM Response: {"final": "{\"score\": 8}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 183 / 192 examples:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 182/192 [00:33<00:02,  4.17it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 185 / 192 examples:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 184/192 [00:34<00:01,  4.17it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 191 / 192 examples:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 191/192 [00:36<00:00,  3.37it/s]


LM Response: {"final":{"score":8}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [00:37<00:00,  5.09it/s]

2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Baseline mini-batch score: 0.7911458333333333

2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Processing bucket #1, with max score 0.9, max-to-min gap 0.9, and max-to-avg gap 0.6000000000000001.
2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Invoking strategy: append_a_demo_
2026/02/27 02:55:58 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 02:55:58 INFO dspy.teleprompt.simba: 

2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Processing bucket #2, with max score 0.9, max-to-min gap 0.20000000000000007, and max-to-avg gap 0.1166666666666667.
2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Invoking strategy: append_a_demo_
2026/02/27 02:55:58 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 02:55:58 INFO dspy.teleprompt.simba: 

2026/02/27 02:55:58 INFO dspy.teleprompt.simba: Batch 1: Processing bucket #3,




2026/02/27 02:56:08 INFO dspy.teleprompt.simba_utils: Advice for self: When the input contains a proper noun or abbreviation (e.g., "CSA"), first check whether a localized form exists in the target language (e.g., "Cosa"). If a local name is known, replace the source term with that name; otherwise keep the original term unchanged. Also adapt generic phrases to the target language idiom (e.g., use "ve mƒõstƒõ ‚Ä¶" instead of a literal "v ‚Ä¶"). Ensure the translation preserves punctuation and spacing exactly as in the source, and avoid unintentionally altering case. If you have control over LM parameters, lower the temperature for deterministic output or generate multiple candidates (n>1) and select the one that best respects proper‚Äënoun handling and natural Czech phrasing.
2026/02/27 02:56:08 INFO dspy.teleprompt.simba: 

2026/02/27 02:56:08 INFO dspy.teleprompt.simba: Batch 1: Processing bucket #4, with max score 0.7, max-to-min gap 0.19999999999999996, and max-to-avg gap 0.06666666

Processed 58 / 224 examples:  25%|‚ñà‚ñà‚ñå       | 57/224 [00:44<02:23,  1.17it/s]


LM Response: {"final{"      :

  "score:   8"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 164 / 224 examples:  73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 164/224 [02:03<00:55,  1.08it/s]


LM Response: {"final{" 	:	"score"		} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 192 / 224 examples:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 191/224 [02:20<00:20,  1.64it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:41<00:00,  1.38it/s]

2026/02/27 02:58:54 INFO dspy.teleprompt.simba: Scores after 1 batches: [0.8625, 0.784375, 0.828125, 0.846875, 0.815625, 0.75, 0.8125], Best: 0.8625

2026/02/27 02:58:54 INFO dspy.teleprompt.simba: Starting batch 2 of 8.





2026/02/27 02:58:55 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:35<00:00,  2.02it/s]

2026/02/27 03:00:30 INFO dspy.teleprompt.simba: Batch 2: Baseline mini-batch score: 0.8635416666666668

2026/02/27 03:00:30 INFO dspy.teleprompt.simba: Batch 2: Processing bucket #1, with max score 1.0, max-to-min gap 1.0, and max-to-avg gap 0.33333333333333337.
2026/02/27 03:00:30 INFO dspy.teleprompt.simba: Batch 2: Invoking strategy: append_a_rule





2026/02/27 03:00:35 INFO dspy.teleprompt.simba_utils: Advice for self: If the `src` value matches a URL pattern (e.g., starts with `http://` or `https://` and contains typical URL characters), then set `tgt` to be exactly the same string as `src` without any modification. Do **not** attempt to fetch or summarise the page, and avoid generating messages like "Nemohu z√≠skat obsah‚Ä¶". For any other plain‚Äëtext input, proceed with normal translation. Ensure that the output preserves all punctuation, spacing, and case exactly as in the source. When in doubt about whether the input is a URL, use a simple regex check such as `^https?://`.
2026/02/27 03:00:35 INFO dspy.teleprompt.simba: 

2026/02/27 03:00:35 INFO dspy.teleprompt.simba: Batch 2: Processing bucket #2, with max score 0.9, max-to-min gap 0.6000000000000001, and max-to-avg gap 0.25.
2026/02/27 03:00:35 INFO dspy.teleprompt.simba: Batch 2: Invoking strategy: append_a_rule
2026/02/27 03:00:42 INFO dspy.teleprompt.simba_utils: Advic

Processed 93 / 224 examples:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 93/224 [01:12<01:29,  1.47it/s]


LM Response: {"final":{"score":9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 136 / 224 examples:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 135/224 [01:44<01:03,  1.40it/s]


LM Response: {"final": "{\"score\": 4}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 162 / 224 examples:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 162/224 [02:05<00:44,  1.38it/s]


LM Response: {"final":"{\"score\": 8}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:45<00:00,  1.35it/s]

2026/02/27 03:03:50 INFO dspy.teleprompt.simba: Scores after 2 batches: [0.83125, 0.875, 0.8625, 0.878125, 0.8, 0.88125, 0.859375], Best: 0.88125

2026/02/27 03:03:50 INFO dspy.teleprompt.simba: Starting batch 3 of 8.





2026/02/27 03:03:52 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 66 / 192 examples:  34%|‚ñà‚ñà‚ñà‚ñç      | 65/192 [00:49<01:26,  1.48it/s]


LM Response: {"final{"

: 

"score"

 
  


  

 
  	
	
	    

	  	
	
	
 			



	 	
 	   

  	
	
 

 
  	
	
	
	
		

 

 
 





			  

 
 

 
 

 
 

  
  









 
  



 
    

 
  	 


 
 

 
 

 
 

 
 

 
 

 
 

 
 

  
 

 
 

 
 

 
  

  






 

 
  



 

 
 

 
 

 
 

 
 

  
  





 




  

 
  









 
  
  







 

 
  






 

 
 



 










 
  






 

 
 

 
 

 
 




  

 
  

 

 
 




 
 




 

 
 

 
 
 
 








 
 

 
 

 
 





 
 




 








 

 
 

 
 






 


 
 





 
 

 
 















 







 








 
 


















 
 




 
 

 
 








 
 










 

 



 



 



 
 




 











 







 








 





 
 




 
 

 
 






 
 




 







 



 



 
 







 
 




 




 
 








 
 


 

 
 


    
		
	 
	
	
	
	
 


	
 
 




 
 








 
 











 
 





 
 





 








 
 





 
 






 








 


 
 
















 





 
 




 




 




  













  






  





Processed 93 / 192 examples:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 93/192 [00:58<00:25,  3.82it/s]


LM Response: {"final": {"score": 10}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 155 / 192 examples:  81%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 155/192 [01:27<00:16,  2.26it/s]


LM Response: {"final{"

: 

"score"

 
  


  

 
  	
	
	    

	  	
	
	
 			



	 	
 	   

  	
	
 

 
  	
	
	
	
		

 

 
 





			  

 
 

 
 

 
 

  
  









 
  



 
    

 
  	 


 
 

 
 

 
 

 
 

 
 

 
 

 
 

  
 

 
 

 
 

 
  

  






 

 
  



 

 
 

 
 

 
 

 
 

  
  





 




  

 
  









 
  
  







 

 
  






 

 
 



 










 
  






 

 
 

 
 

 
 




  

 
  

 

 
 




 
 




 

 
 

 
 
 
 








 
 

 
 

 
 





 
 




 








 

 
 

 
 






 


 
 





 
 

 
 















 







 








 
 


















 
 




 
 

 
 








 
 










 

 



 



 



 
 




 











 







 








 





 
 




 
 

 
 






 
 




 







 



 



 
 







 
 




 




 
 








 
 


 

 
 


    
		
	 
	
	
	
	
 


	
 
 




 
 








 
 











 
 





 
 





 








 
 





 
 






 








 


 
 
















 





 
 




 




 




  













  






  





Processed 173 / 192 examples:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 173/192 [01:29<00:02,  8.64it/s]


LM Response: {"final": {"score": 10}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 191 / 192 examples:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 191/192 [01:40<00:00,  1.14it/s]


LM Response: {"final{"










































































  : 9} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:40<00:00,  1.90it/s]

2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Baseline mini-batch score: 0.803125

2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Processing bucket #1, with max score 1.0, max-to-min gap 1.0, and max-to-avg gap 0.33333333333333337.
2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Invoking strategy: append_a_demo_
2026/02/27 03:05:33 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:05:33 INFO dspy.teleprompt.simba: 

2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Processing bucket #2, with max score 0.9, max-to-min gap 0.9, and max-to-avg gap 0.35.
2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Invoking strategy: append_a_demo_
2026/02/27 03:05:33 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:05:33 INFO dspy.teleprompt.simba: 

2026/02/27 03:05:33 INFO dspy.teleprompt.simba: Batch 3: Processing bucket #3, with max score 0.9, max-to-min gap 0.9




2026/02/27 03:05:39 INFO dspy.teleprompt.simba_utils: Advice for self: If the input contains the phrase "light rail systems", translate it as "lehk√© ≈æelezniƒçn√≠ syst√©my" (not "lehk√© kolejov√© dopravy"). If it contains "adaptive reuse", use the established Czech term "adaptivn√≠ opƒõtovn√© vyu≈æit√≠". When the source mentions "adaptations" (or similar), prefer "p≈ôizp≈Øsoben√≠" instead of literal words like "p≈ôechody". For "smartphone communication" render it as "komunikaci na smartphonech" (or "komunikaci p≈ôes smartphony"). In general, prioritize idiomatic Czech constructions: keep proper nouns unchanged, ensure subject‚Äëverb agreement, and avoid overly literal word‚Äëby‚Äëword mappings. Choose natural word order and prefer terms that are commonly used in Czech technical and everyday language.
2026/02/27 03:05:39 INFO dspy.teleprompt.simba: 

2026/02/27 03:05:39 INFO dspy.teleprompt.simba: Batch 3: Processing bucket #4, with max score 0.8, max-to-min gap 0.4, and max-to-avg gap

Processed 67 / 224 examples:  30%|‚ñà‚ñà‚ñâ       | 67/224 [00:52<03:56,  1.51s/it]


LM Response: {"final":{"score":8}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 82 / 224 examples:  37%|‚ñà‚ñà‚ñà‚ñã      | 82/224 [01:00<01:17,  1.84it/s]


LM Response: {"final": {"score": 10}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 192 / 224 examples:  86%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 192/224 [02:21<00:22,  1.41it/s]


LM Response: {"final":{"score":8}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:43<00:00,  1.37it/s]

2026/02/27 03:08:33 INFO dspy.teleprompt.simba: Scores after 3 batches: [0.834375, 0.821875, 0.75625, 0.81875, 0.825, 0.75625, 0.8], Best: 0.834375

2026/02/27 03:08:33 INFO dspy.teleprompt.simba: Starting batch 4 of 8.





2026/02/27 03:08:35 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:25<00:00,  2.25it/s]

2026/02/27 03:10:00 INFO dspy.teleprompt.simba: Batch 4: Baseline mini-batch score: 0.8500000000000001

2026/02/27 03:10:00 INFO dspy.teleprompt.simba: Batch 4: Processing bucket #1, with max score 0.9, max-to-min gap 0.6000000000000001, and max-to-avg gap 0.33333333333333337.
2026/02/27 03:10:00 INFO dspy.teleprompt.simba: Batch 4: Invoking strategy: append_a_demo_
2026/02/27 03:10:00 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:10:00 INFO dspy.teleprompt.simba: 

2026/02/27 03:10:00 INFO dspy.teleprompt.simba: Batch 4: Processing bucket #2, with max score 0.9, max-to-min gap 0.5, and max-to-avg gap 0.15000000000000002.
2026/02/27 03:10:00 INFO dspy.teleprompt.simba: Batch 4: Invoking strategy: append_a_rule





2026/02/27 03:10:08 INFO dspy.teleprompt.simba_utils: Advice for self: When the input contains idiomatic English expressions, proper nouns, or technical phrases, translate them following the provided Czech style guidelines. Specifically: 
1. Preserve proper nouns unchanged (e.g., "Basil" stays "Basil").
2. Use correct Czech diacritics and word forms ‚Äì "nep≈ô√≠tel" not "nepr√≠tel", genitive "nep≈ô√≠tele" for "of my enemy".
3. Prefer idiomatic constructions: map "would be heavily polluted" ‚Üí "bude silnƒõ zneƒçi≈°tƒõna", use "sp√≠≈°e" instead of "sp√≠≈°", and phrase contrasts with "sp√≠≈°e".
4. Keep subject‚Äëverb agreement and natural word order; avoid adding unnecessary pronouns like "ti".
5. Translate descriptive clauses naturally, e.g., "spojen√Ω svou nelibost√≠" ‚Üí "spojeni sv√Ωm odporem", and "co ho zastavuje" ‚Üí a clearer expression such as "co br√°n√≠ tomu".
6. Maintain consistent Czech quotation marks and punctuation (use ‚Äû‚Ä¶‚Äú). 
If the source sentence reads "The enemy

Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:21<00:00,  1.58it/s]

2026/02/27 03:12:49 INFO dspy.teleprompt.simba: Scores after 4 batches: [0.8375, 0.878125, 0.8625, 0.884375, 0.8718750000000001, 0.865625, 0.8625], Best: 0.884375

2026/02/27 03:12:49 INFO dspy.teleprompt.simba: Starting batch 5 of 8.





2026/02/27 03:12:51 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 145 / 192 examples:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 145/192 [01:09<00:14,  3.17it/s]


LM Response: {"final":"{\"score\": 9}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 173 / 192 examples:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 173/192 [01:17<00:04,  4.57it/s]


LM Response: {"final{"

 		

 	    

    
     

     
	

  

 	  

  	 

 	 


 	 

 	 

 

   

 	 
  	 

 	  

 
  	 

  	
 

 
 	  

   	 

 	 

 	 

  	 

 
 	 

 
  











	  

  	



  



  



 	 

  









	



 

 
 	 

  	



 	





 	  

  	 

 
 

 	

 

 
 	 







 	



	



 	  

 
	 
 	



	 


				



     

 	



 	








 









  



	 





	





 










 	



	 




 




 







 
 




 





 





 





	



 










  







 
 


 	




 




 
 	



			


	 




 
  










    



	

 
      


 





 
		

 
  



 

 

 
  



 


 
    

   

 
         

  









 








 
  










 
 									



  



 			
 







  










 
 



 



 	



  



  




 
















   

        

















 







    











  

  

 





  

 	



       
 












  
 

 
 











 





 	






 
	




 











 
        

    

 






                 








 

















 



 

Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:29<00:00,  2.14it/s]

2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Baseline mini-batch score: 0.8041666666666667

2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Processing bucket #1, with max score 0.9, max-to-min gap 0.9, and max-to-avg gap 0.21666666666666679.
2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Invoking strategy: append_a_demo_
2026/02/27 03:14:21 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:14:21 INFO dspy.teleprompt.simba: 

2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Processing bucket #2, with max score 0.9, max-to-min gap 0.9, and max-to-avg gap 0.19999999999999996.
2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Invoking strategy: append_a_demo_
2026/02/27 03:14:21 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:14:21 INFO dspy.teleprompt.simba: 

2026/02/27 03:14:21 INFO dspy.teleprompt.simba: Batch 5: Processing bucket #3, with max scor




2026/02/27 03:14:24 INFO dspy.teleprompt.simba_utils: Advice for self: If the input contains casual filler words such as "Yea" or "so" (or similar informal English connectors), then translate them as the informal Czech filler "tak" rather than "tak≈æe". Preserve the colloquial tone (e.g., keep "Jo" for "Yea") and maintain natural Czech word order, placing "dnes" after the object pronoun when appropriate. This avoids overly formal or stilted phrasing and yields more idiomatic translations.
2026/02/27 03:14:24 INFO dspy.teleprompt.simba: 

2026/02/27 03:14:24 INFO dspy.teleprompt.simba: Batch 5: Processing bucket #6, with max score 0.7, max-to-min gap 0.49999999999999994, and max-to-avg gap 0.2666666666666666.
2026/02/27 03:14:24 INFO dspy.teleprompt.simba: Batch 5: Invoking strategy: append_a_demo_
2026/02/27 03:14:24 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:14:24 INFO dspy.teleprompt.simba: 

2026/02/27 03:14:24 INFO dspy.telepromp

Processed 122 / 224 examples:  54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 122/224 [01:27<01:35,  1.06it/s]


LM Response: {"final":"{\n  \"score\": 9\n}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 145 / 224 examples:  65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 145/224 [01:38<00:32,  2.46it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:35<00:00,  1.45it/s]

2026/02/27 03:16:59 INFO dspy.teleprompt.simba: Scores after 5 batches: [0.8125, 0.8250000000000001, 0.81875, 0.7875, 0.83125, 0.846875, 0.8562500000000001], Best: 0.8562500000000001

2026/02/27 03:16:59 INFO dspy.teleprompt.simba: Starting batch 6 of 8.





2026/02/27 03:17:02 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 46 / 192 examples:  24%|‚ñà‚ñà‚ñç       | 46/192 [00:24<00:38,  3.78it/s]


LM Response: {"final":"{\"score\": 9}"} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:18<00:00,  2.44it/s]

2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Baseline mini-batch score: 0.86875

2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Processing bucket #1, with max score 1.0, max-to-min gap 1.0, and max-to-avg gap 0.20000000000000007.
2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Invoking strategy: append_a_demo_
2026/02/27 03:18:21 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:18:21 INFO dspy.teleprompt.simba: 

2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Processing bucket #2, with max score 1.0, max-to-min gap 0.6, and max-to-avg gap 0.18333333333333324.
2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Invoking strategy: append_a_demo_
2026/02/27 03:18:21 INFO dspy.teleprompt.simba_utils: Added 1 demos (one each) across all predictors.
2026/02/27 03:18:21 INFO dspy.teleprompt.simba: 

2026/02/27 03:18:21 INFO dspy.teleprompt.simba: Batch 6: Processing bucket #3, with max score 0.8, max-




2026/02/27 03:18:25 INFO dspy.teleprompt.simba_utils: Advice for self: If the input sentence contains informal or colloquial English expressions such as ‚Äúit must suck to be ‚Ä¶‚Äù, then translate them using Czech idioms that convey the same level of vulgarity or strong feeling (e.g., use ‚Äúna hovno‚Äù instead of a neutral phrase like ‚Äúna nic‚Äù). Preserve the overall sentence structure but avoid redundant words (e.g., do not repeat ‚Äúmus√≠‚Äù) and choose Czech adjectives that match the intensity of the English word (e.g., prefer ‚Äústra≈°n√Ω‚Äù for ‚Äúawful‚Äù when the tone is harsh). For standard phrases like ‚ÄúI can't imagine ‚Ä¶‚Äù, keep the literal translation ‚Äúnedok√°≈æu si p≈ôedstavit‚Äù. In short, prioritize idiomatic, tone‚Äëpreserving equivalents over literal, word‚Äëfor‚Äëword translations, especially for slang or profanity.
2026/02/27 03:18:25 INFO dspy.teleprompt.simba: 

2026/02/27 03:18:25 INFO dspy.teleprompt.simba: Batch 6: Processing bucket #5, with max score 

Processed 142 / 224 examples:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 141/224 [01:26<00:55,  1.49it/s]


LM Response: {"final{"  		



 	



 

 	



 	



 

 	



 	



 

 	



 	



 

 	



 	



 	



  	



  





 	



 	



 	



 	



 	



 	



 	



  	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 	



 



  	



  



 	



  



  



  



  



  



  



 	



  





 	



 	



  





  





  





  





 	



  





  



  



  



  



  	



  





  



  



  	



 	



  












	



 	



	







	



 	



 	



 	



 	







	



 	



	



	



	



	



	



 	



 	



	



 	



	



 	



	



	



	



	



	



	



 	



 	



 	



	



	

 	



	



 	



 	



 	



 	




	



 	



	





	






 	













 	



	



	



 	



	



	



	 	
   




  











 	



  



 	



  











   




 	 




 	



 	

















 	



 	



 	



 	



 	



 
   


   





 





 	







  
 
 





 	








 	 









  



 	



 	




 













  




 	

Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:23<00:00,  1.56it/s]

2026/02/27 03:21:03 INFO dspy.teleprompt.simba: Scores after 6 batches: [0.85, 0.85, 0.86875, 0.83125, 0.86875, 0.859375, 0.865625], Best: 0.86875

2026/02/27 03:21:03 INFO dspy.teleprompt.simba: Starting batch 7 of 8.





2026/02/27 03:21:06 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 192 / 192 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 192/192 [01:28<00:00,  2.16it/s]

2026/02/27 03:22:35 INFO dspy.teleprompt.simba: Batch 7: Baseline mini-batch score: 0.8682291666666667

2026/02/27 03:22:35 INFO dspy.teleprompt.simba: Batch 7: Processing bucket #1, with max score 1.0, max-to-min gap 0.7, and max-to-avg gap 0.16666666666666663.
2026/02/27 03:22:35 INFO dspy.teleprompt.simba: Batch 7: Invoking strategy: append_a_rule





2026/02/27 03:22:40 INFO dspy.teleprompt.simba_utils: Advice for self: If the module receives a `src` that contains English hashtags (e.g., "#DIY #HomeRenovation"), then it should:
1. Split the string on whitespace to isolate each hashtag.
2. For each hashtag, strip the leading "#" and separate camel‚Äëcased or concatenated words (e.g., "HomeRenovation" ‚Üí ["Home", "Renovation"]).
3. Translate each word using a Czech domain‚Äëspecific dictionary: keep common abbreviations like "DIY" unchanged, but map "Home" ‚Üí "Dom√°c√≠" and "Renovation" ‚Üí "Renovace".
4. Re‚Äëcombine the translated words into a single CamelCase token (e.g., "Dom√°c√≠Renovace") and prepend "#".
5. Preserve the original hashtag order and any unchanged hashtags (e.g., "#DIY" stays as "#DIY").
6. Ensure no extra characters are added or dropped, and that the final output respects Czech orthography and idiom usage.
Following these steps will produce translations such as "#DIY #Dom√°c√≠Renovace", matching the successful 

Processed 84 / 224 examples:  37%|‚ñà‚ñà‚ñà‚ñã      | 83/224 [00:56<02:02,  1.15it/s]


LM Response: {"final":{"score":9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 224 / 224 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [02:36<00:00,  1.43it/s]

2026/02/27 03:25:50 INFO dspy.teleprompt.simba: Scores after 7 batches: [0.853125, 0.84375, 0.846875, 0.865625, 0.86875, 0.86875, 0.846875], Best: 0.86875

2026/02/27 03:25:50 INFO dspy.teleprompt.simba: Starting batch 8 of 8.





2026/02/27 03:25:54 INFO dspy.teleprompt.simba: Sampling program trajectories on 32 examples x 6 samples.


Processed 31 / 192 examples:  16%|‚ñà‚ñå        | 31/192 [00:22<01:18,  2.06it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 79 / 192 examples:  41%|‚ñà‚ñà‚ñà‚ñà      | 78/192 [00:47<01:37,  1.17it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 142 / 192 examples:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç  | 142/192 [01:09<00:18,  2.78it/s]


LM Response: {"final": {"score": 9}} 

Expected to find output fields in the LM response: [score] 

Actual output fields parsed from the LM response: [] 




Processed 187 / 192 examples:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 187/192 [01:27<00:03,  1.52it/s]

In [None]:
optimized_program_simba(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")

Prediction(
    tgt='Praha je hlavn√≠m mƒõstem ƒåesk√© republiky.'
)

In [None]:
dspy.inspect_history()





[34m[2025-09-22T14:38:43.791122][0m

[31mSystem message:[0m

Your input fields are:
1. `src_lang` (str): 
2. `tgt_lang` (str): 
3. `src` (str):
Your output fields are:
1. `tgt` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## src_lang ## ]]
{src_lang}

[[ ## tgt_lang ## ]]
{tgt_lang}

[[ ## src ## ]]
{src}

[[ ## tgt ## ]]
{tgt}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `src_lang`, `tgt_lang`, `src`, produce the fields `tgt`.
        
        If the input `src` contains mostly repeated characters and informal language (like 'waaaaahoooo!! lol'), then avoid attempting a translation and instead return the original `src` string as the `tgt`. Focus on preserving the original input when it appears to be non-standard or already in the target language's character set.


[31mUser message:[0m

[[ ## src_lang ## ]]
English

[[ ## tgt_lang ## ]]
Czech

[[ ## src ## ]]
A

In [None]:
for message in adapter.format(optimized_program_simba.signature, demos=[], inputs=dict(src_lang="English", tgt_lang="Czech", src="Prague is the capital of the Czech Republic.")):
  print(f"**{message['role']}**")
  print(message['content'])
  print("\n")

## Evaluation

In [None]:
# this can be used to estimate USD cost for commercial models such as
# gpt-4o-mini, it is not relevant for our local models
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])
cost

0

In [None]:
lm.history[-1]

In [None]:
output_base = translate1.batch(testset, num_threads=16)
output_mipro = optimized_program_mipro.batch(testset, num_threads=16)
output_optimized_simba = optimized_program_simba.batch(testset, num_threads=16)

Processed 619 / 619 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 619/619 [01:45<00:00,  5.86it/s]
Processed 619 / 619 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 619/619 [02:04<00:00,  4.97it/s]
Processed 619 / 619 examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 619/619 [01:46<00:00,  5.83it/s]


In [None]:
from sacrebleu import corpus_bleu
print("Base:", corpus_bleu([ex.tgt for ex in testset], [[out.tgt for out in output_base]]))
print("MIPRO:", corpus_bleu([ex.tgt for ex in testset], [[out.tgt for out in output_mipro]]))
print("SIMBA:", corpus_bleu([ex.tgt for ex in testset], [[out.tgt for out in output_optimized_simba]]))

Base: BLEU = 28.72 60.2/34.6/22.2/14.7 (BP = 1.000 ratio = 1.020 hyp_len = 24776 ref_len = 24298)
MIPRO: BLEU = 29.04 60.6/35.0/22.5/14.9 (BP = 1.000 ratio = 1.002 hyp_len = 24776 ref_len = 24734)
SIMBA: BLEU = 28.93 60.2/34.9/22.4/14.9 (BP = 1.000 ratio = 1.026 hyp_len = 24776 ref_len = 24149)


Place {{1}} in the {{2}}.