## DSPy로 LLM을 프롬프팅이 아니라 프로그래밍하기 <small>(Programming, not Prompting)</small>
  
  Reference : [Notebook](https://github.com/ALucek/dspy-breakdown/blob/main/dspy_breakdown.ipynb), [YouTube](https://youtu.be/Zv4LjO8teqE?si=Sp0EJ2cvwwtZ-BT6)
  
**[DSPy(Declarative Self-improving Python)](https://dspy.ai/)** 는 스탠퍼드 NLP에서 개발한 프레임워크로, 언어 모델을 프롬프트 템플릿이 아니라 프로그래밍 가능한 함수로 다룬다. DSPy는 PyTorch와 유사한 인터페이스를 제공하여 LLM 연산을 정의하고, 조합하며, 최적화할 수 있게 한다.

개발자는 복잡한 프롬프트를 직접 작성하고 유지하는 대신, **입·출력 시그니처만 선언**하면 되고, **프롬프트 엔지니어링과 최적화는 DSPy가 자동으로 처리**한다. 이 프레임워크는 자동 프롬프트 튜닝과 자기 개선(self-improvement)과 같은 기법을 통해 LLM 파이프라인을 체계적으로 개선할 수 있도록 한다.

<img src="./Media/dspy_workflow.png" width="500">

DSPy의 워크플로우는 다음의 네 가지 핵심 단계로 구성된다:

* 시그니처(Signature) 와 모듈(Module)을 사용해 프로그램을 정의한다.
* 프로그램의 성능을 명확하게 보여줄 수 있는 측정 가능한 성공 지표를 설계한다.
* 프로그램을 컴파일하고, 설정한 성공 지표를 기준으로 최적화한다.
* 추가 데이터를 수집하고, 이를 바탕으로 반복적으로 개선한다.

이 노트북에서는 이러한 단계 전반에 걸쳐 DSPy가 제공하는 다양한 접근 방식들을 살펴보고 실제로 적용해본다.

### Setup

<img src="./Media/dspy.png" width="200">

In [2]:
import dspy
import warnings
warnings.filterwarnings("ignore")

### Configure LLM

DSPy는 기본적으로 환경 전반에 걸쳐 모델과 응답을 캐시한다. 별도로 명시하지 않는 한, 한 번 언어 모델을 설정하면 이후의 모든 호출에서는 해당 언어 모델이 자동으로 사용된다.

In [3]:
import os
from dotenv import load_dotenv

_ = load_dotenv()

In [4]:
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

In [5]:
lm(messages=[{"role":"user", "content":"Say this is a test!"}])

['This is a test! How can I assist you further?']

---

### 시그니처(Signatures)

DSPy의 시그니처는 일반적인 함수 시그니처와 동일한 개념을 따르지만, 자연어로 정의된다는 점이 특징이다. 이는 DSPy가 대체하고자 하는 기존의 “프롬프팅(prompting)” 방식의 핵심에 해당한다. LLM에게 무엇을 하라고 지시하는 대신, LLM이 무엇을 하게 될지를 선언적으로 정의하는 접근을 취한다.

기본 형식은 다음과 같다.

'input -> output'

입력과 출력은 원하는 어떤 형태든 정의할 수 있으며, 여러 개의 입력과 출력, 타입 정보, 혹은 더 명확하게 구조화된 스키마를 함께 선언하는 것도 가능하다.

<img src="./Media/signatures.png" width="400">

내부적으로는 여전히 언어 모델을 위한 프롬프트가 사용되지만, 자연어로 정의한 시그니처를 기반으로 고정된 프롬프트가 아니라 모듈화된 형태로 동작한다. 즉, 시그니처에 따라 표현 방식과 구조가 동적으로 바뀌도록 설계되어 있다.

프롬프팅을 추상화한다는 점에서 다소 직관에 어긋나게 느껴질 수 있지만, DSPy는 이러한 구조를 통해 모델을 쉽게 교체할 수 있고, 이후에 살펴볼 알고리즘 수준의 최적화도 가능하도록 설계되었다.

### 단순한 입력과 출력(Simple Input & Output)

In [6]:
qna = dspy.Predict('question -> answer')
response = qna(question="Why is the sky blue")
print("Response: ", response.answer)

Response:  The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it collides with molecules and small particles in the air. Sunlight is made up of many colors, each having different wavelengths. Blue light waves are shorter and are scattered in all directions more than other colors with longer wavelengths, such as red and orange. This scattering causes us to see the sky as blue during the day.


In [7]:
lm.inspect_history()





[34m[2026-01-25T15:47:57.898021][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str):
Your output fields are:
1. `answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.


[31mUser message:[0m

[[ ## question ## ]]
Why is the sky blue

Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## answer ## ]]
The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it collides with molecules and small particles in the air. Sunlight is made up of many colors, each having different wavelengths. Blue light waves are shorter a

In [8]:
summ = dspy.Predict('document -> summary')

document = """
The market for our products is intensely competitive and is characterized by rapid technological change and evolving industry standards. 
We believe that theprincipal competitive factors in this market are performance, breadth of product offerings, access to customers and partners and distribution channels, softwaresupport, conformity to industry standard APIs, manufacturing capabilities, processor pricing, and total system costs. 
We believe that our ability to remain competitive will depend on how well we are able to anticipate the features and functions that customers and partners will demand and whether we are able todeliver consistent volumes of our products at acceptable levels of quality and at competitive prices. 
We expect competition to increase from both existing competitors and new market entrants with products that may be lower priced than ours or may provide better performance or additional features not provided by our products. 
In addition, it is possible that new competitors or alliances among competitors could emerge and acquire significant market share.
A significant source of competition comes from companies that provide or intend to provide GPUs, CPUs, DPUs, embedded SoCs, and other accelerated, AI computing processor products, and providers of semiconductor-based high-performance interconnect products based on InfiniBand, Ethernet, Fibre Channel,and proprietary technologies. 
Some of our competitors may have greater marketing, financial, distribution and manufacturing resources than we do and may bemore able to adapt to customers or technological changes. 
We expect an increasingly competitive environment in the future.
"""

response = summ(document=document)
print("Summary: ", response.summary)

Summary:  The market for our products is highly competitive and influenced by rapid technological advancements and changing industry standards. Key competitive factors include product performance, range, customer access, and distribution channels. Our competitiveness hinges on our ability to predict customer demands, deliver quality products consistently, and maintain competitive pricing. We anticipate increased competition from both current players and new entrants offering lower prices or superior features, as well as potential alliances among competitors. Significant competition arises from companies focused on GPUs, CPUs, DPUs, and other advanced computing products. Some rivals may possess greater resources, making it challenging to keep pace with market changes. The competitive landscape is expected to become more intense in the future.


### Multiple Inputs and Outputs

<img src='./Media/multiple_signature.png' width="400">

In [9]:
multi = dspy.Predict('question, context -> answer, citation')

question = "What's my name?"
context = "The user you're talking to is Adam Lucek, AI youtuber extraordinare"

response = multi(question=question, context=context)

print("Answer: ", response.answer)
print("\nCitation: ", response.citation)

Answer:  Your name is Adam Lucek.

Citation:  The user identified as Adam Lucek in the context provided.


### Type Hints with Outputs

<img src="./Media/input_type.png" width="400">

In [10]:
emotion = dspy.Predict('input -> sentiment: str, confidence: float, reasoning: str')

text = "I don't quite know, I didn't really like it"

response = emotion(input=text)

print("Sentiment Classification: ", response.sentiment)
print("\nConfidence: ", response.confidence)
print("\nReasoning: ", response.reasoning)

Sentiment Classification:  negative

Confidence:  0.85

Reasoning:  The phrase "I didn't really like it" clearly expresses dissatisfaction or discontent, indicating a negative sentiment. The use of "didn't really like" suggests a strong sense of disapproval, thus reinforcing the negative sentiment. The confidence level is high due to the clarity of the expression.


### 클래스 기반 시그니처(Class-Based Signatures)

보다 고급 시그니처를 위해, DSPy는 단순한 인라인 문자열 방식 대신 Pydantic 클래스 또는 데이터 구조 스키마를 정의할 수 있도록 지원한다. 이러한 클래스는 기본적으로 dspy.Signature를 상속받아야 하며, 입력 필드는 dspy.InputField(), 출력 필드는 dspy.OutputField()를 사용해 명시적으로 정의해야 한다.

각 필드에는 선택적으로 desc 인자를 전달할 수 있으며, 이를 통해 해당 필드에 대한 추가적인 맥락이나 설명을 함께 제공할 수 있다.

In [11]:
from typing import Literal

class TextStyleTransfer(dspy.Signature):
    """Transfer text between different writing styles while preserving content."""
    text: str = dspy.InputField()
    source_style: Literal["academic", "casual", "business", "poetic"] = dspy.InputField()
    target_style: Literal["academic", "casual", "business", "poetic"] = dspy.InputField()
    preserved_keywords: list[str] = dspy.OutputField()
    transformed_text: str = dspy.OutputField()
    style_metrics: dict[str, float] = dspy.OutputField(desc="Scores for formality, complexity, emotiveness")

text = "This coffee shop makes the best lattes ever! Their new barista really knows what he's doing with the espresso machine."

style_transfer = dspy.Predict(TextStyleTransfer)

response = style_transfer(
    text=text,
    source_style="casual",
    target_style="poetic"
)

print("Transformed Text: ", response.transformed_text)
print("\nStyle Metrics: ", response.style_metrics)
print("\nPreserver Keywords: ", response.preserved_keywords)

Transformed Text:  In the haven of fragrant brews, a sanctuary for souls to meet,  
The coffee shop whispers secrets of lattes, a delightful treat.  
A bard behind the counter, with skilled hands and a knowing heart,  
Crafts magic with the espresso machine, a true work of art.

Style Metrics:  {'formality': 0.6, 'complexity': 0.75, 'emotiveness': 0.8}

Preserver Keywords:  ['coffee shop', 'lattes', 'barista', 'espresso machine']


### 모듈(Modules)

<img src="./Media/modules.png">

모듈은 시그니처에 다양한 프롬프팅 전략을 적용하는 계층이다. 앞선 시그니처 예제에서는 기본적인 Predict 모듈을 사용했지만, DSPy에는 이 외에도 널리 사용되는 여러 전략과 그 변형들이 준비되어 있다. 다음은 현재 제공되는 주요 모듈들이다.

* **ChainOfThought**: 출력을 생성하기 전에 추론 단계를 먼저 유도하는 체인-오브-쏘트(chain-of-thought) 프롬프팅을 구현한다. 이 모듈은 모델이 구조화된 사고를 하도록 "Let's think step by step"과 같은 문구를 자동으로 추가한다. 복잡한 문제를 여러 단계로 나누어 사고해야 하는 경우에 적합하다.
  
* **ProgramOfThought**: 문제 해결을 위해 실행 가능한 Python 코드를 생성하며, 오류 처리와 코드 재생성 기능을 기본적으로 포함한다. 수학적 문제나 알고리즘적 문제처럼 실제 코드 실행을 통해 해결하는 것이 더 효과적인 경우에 사용한다.

* **ReAct**: 사고(Reasoning), 행동(Acting: 도구 사용), 관찰(Observation)을 구조화된 루프 형태로 번갈아 수행하는 ReAct 방식을 구현한다.여러 단계의 추론과 외부 도구나 API와의 상호작용이 필요한 작업에 적합하다.

그리고 몇 가지 헬퍼 모듈:

* **MultiChainComparison**: 여러 번의 추론 시도(기본값 3회)를 수행한 뒤, 서로 다른 추론 경로를 비교하여 더 정확한 하나의 응답으로 통합한다. 문제 해결의 정확도가 중요하고, 여러 번의 시도를 감당할 수 있는 경우에 적합하다.

* **Majority**: 여러 개의 응답(completion)을 입력으로 받아 텍스트를 정규화한 뒤, 가장 많이 등장한 응답을 반환하는 유틸리티 함수다. 여러 번의 생성 결과에 대해 간단한 투표 방식을 적용해 신뢰도를 높이고 싶을 때 유용하다.

### Chain of Thought

<img src="./Media/cot_module.png" width="300">

**ChainOfThought**는 출력 전에 명시적인 추론 단계를 포함하도록 프롬프트 시그니처를 수정하는 방식으로 동작한다. 시그니처로 초기화되면, "Reasoning: Let's think step by step in order to"라는 접두어를 가진 reasoning 필드를 앞에 추가한 확장 시그니처를 생성한다. 이 reasoning 필드는 언어 모델이 최종 답변을 제공하기에 앞서 사고 과정을 먼저 작성하도록 강제한다.

In [12]:
# Define the Signature and Module
cot_emotion = dspy.ChainOfThought('input -> sentiment: str')

# Example
text = "That was phenomenal, but I hated it!"

# Run
cot_response = cot_emotion(input=text)

# Output
print("Sentiment: ", cot_response.sentiment)
# Inherently added reasoning
print("\nReasoning: ", cot_response.reasoning)

Sentiment:  mixed

Reasoning:  The phrase "That was phenomenal" suggests a positive experience or appreciation, indicating enjoyment or admiration. However, the phrase "but I hated it!" introduces a strong negative sentiment that contradicts the previous positive remark. This juxtaposition indicates a mixed sentiment where the speaker acknowledges something remarkable but also expresses strong dislike, possibly due to personal reasons or conflicting feelings about the experience.


In [13]:
lm.inspect_history()





[34m[2026-01-25T15:49:31.635495][0m

[31mSystem message:[0m

Your input fields are:
1. `input` (str):
Your output fields are:
1. `reasoning` (str): 
2. `sentiment` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## input ## ]]
{input}

[[ ## reasoning ## ]]
{reasoning}

[[ ## sentiment ## ]]
{sentiment}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `input`, produce the fields `sentiment`.


[31mUser message:[0m

[[ ## input ## ]]
That was phenomenal, but I hated it!

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sentiment ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
The phrase "That was phenomenal" suggests a positive experience or appreciation, indicating enjoyment or admiration. However, the phrase "but I hated it!" introduces a st

### Program of Thought

<img src="./Media/program_of_thought.png">

ProgramOfThought(PoT)는 자연어 출력에 직접 의존하는 대신, 실행 가능한 Python 코드를 생성하여 과제를 해결한다. 작업이 주어지면, PoT는 먼저 ChainOfThought 예측기를 사용해 Python 코드를 생성한 뒤, 이를 격리된 Python 인터프리터 환경에서 실행한다. 코드 실행 중 오류가 발생하면, PoT는 해당 오류를 언어 모델에 다시 전달하고 수정된 코드를 생성하도록 요청하는 개선 루프에 들어간다. 이 과정은 최대 지정된 반복 횟수(기본값 3회)까지 반복된다. 최종 출력은 언어 모델이 직접 생성한 텍스트가 아니라, 성공적으로 실행된 코드의 실제 실행 결과에서 얻어진다.

In [14]:
# Define the Signature
class MathAnalysis(dspy.Signature):
    """Analyze a dataset and compute various statistical metrics."""
    numbers: list[float] = dspy.InputField(desc="List of numerical values to analyze")
    required_metrics: list[str] = dspy.InputField(desc="List of metrics to calculate (e.g. ['mean', variance', 'quartiles'])")
    analysis_results: dict[str, float] = dspy.OutputField(desc="Dictionary containing the calculated metrics")

# Create the module
math_analyzer = dspy.ProgramOfThought(MathAnalysis)

# Example
data = [1.5, 2.8, 3.2, 4.7, 5.1, 2.3, 3.9]
metrics = ['mean', 'median']

# Run
pot_response = math_analyzer(
    numbers=data,
    required_metrics=metrics
)

In [15]:
print("Reasoning: ", pot_response.reasoning)
print("\nResults: ", pot_response.analysis_results)

Reasoning:  The code provided calculates the mean and median of the given list of numbers using NumPy functions. The mean is the average of the numbers, while the median is the middle value when the numbers are sorted. The results were expected based on standard calculations for these statistics.

Results:  {'mean': 3.2857142857142856, 'median': 3.2}


In [16]:
lm.inspect_history()





[34m[2026-01-25T15:49:43.573132][0m

[31mSystem message:[0m

Your input fields are:
1. `numbers` (list[float]): List of numerical values to analyze
2. `required_metrics` (list[str]): List of metrics to calculate (e.g. ['mean', variance', 'quartiles'])
3. `final_generated_code` (str): python code that answers the question
4. `code_output` (str): output of previously-generated python code
Your output fields are:
1. `reasoning` (str): 
2. `analysis_results` (dict[str, float]): Dictionary containing the calculated metrics
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## numbers ## ]]
{numbers}

[[ ## required_metrics ## ]]
{required_metrics}

[[ ## final_generated_code ## ]]
{final_generated_code}

[[ ## code_output ## ]]
{code_output}

[[ ## reasoning ## ]]
{reasoning}

[[ ## analysis_results ## ]]
{analysis_results}        # note: the value you produce must adhere to the JSON schema: {"type": "object", "additionalProperties":

### Reasoning + Acting (ReAct)

<img src="./Media/react.png">

ReAct는 **추론(reasoning)** 과 **도구 사용(tool usage)** 을 결합하여 상호작용적인 문제 해결을 가능하게 한다. 이는 모델이 **생각–행동(thought–action) 쌍의 흐름(trajectory)** 을 유지하면서 작동하는 방식으로, 각 단계마다 모델은 자신의 추론을 설명하고, 사용할 도구를 선택하며, 해당 도구에 전달할 인자를 제공한 뒤, 도구 실행 결과를 관찰하여 다음 단계를 결정한다. 각 반복(iteration)은 네 가지 요소로 구성된다. 즉, 전략을 설명하는 **생각(thought)**, 사용 가능한 도구 중 하나를 선택하는 **도구 선택**, 도구에 전달할 **인자(arguments)**, 그리고 도구를 실행한 결과인 **관찰(observation)** 이다. 이 과정은 모델이 스스로 “finish”를 선택하거나 최대 반복 횟수에 도달할 때까지 계속된다. 아래는 간단한 예시이다.

In [17]:
# Define a Tool
def wikipedia_search(query: str) -> list[str]:
    """Retrieves abstracts from Wikipedia."""
    # Existing Wikipedia Abstracts Server
    results = dspy.ConBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
    return [x['text'] for x in results]

# Define ReAct Module
react_module = dspy.ReAct('question -> response', tools=[wikipedia_search])

# Example
text = "Who won the world series in 1983 and who won the world cup in 1966?"

# Run
react_response = react_module(question=text)

print("Answer: ", react_response.response)
print("\nReasoning: ", react_response.reasoning)

Answer:  The Baltimore Orioles won the World Series in 1983, and England won the World Cup in 1966.

Reasoning:  The New York Orioles won the World Series in 1983, defeating the Philadelphia Phillies. The England national football team won the World Cup in 1966, which was held in England, and they defeated West Germany in the final match.


### Multi Chain Comparison

<img src="./Media/multi_chain.png">

MultiChainComparison은 여러 개의 기존 응답(completion)을 하나의 더 견고한 최종 예측으로 종합하는 **메타 예측기(meta-predictor)** 이다. 이 모듈은 스스로 예측을 생성하지 않고, 대신 다른 예측기들로부터 **M개의 서로 다른 응답(기본값 3)** 을 입력으로 받는다. 이 응답들은 동일한 예측기를 서로 다른 temperature로 실행한 결과일 수도 있고, 완전히 다른 예측기에서 나온 결과일 수도 있으며, 혹은 동일한 설정으로 여러 번 호출한 결과일 수도 있다.

각 응답은 Student Attempt #1:, Student Attempt #2: 와 같은 형식으로 정리되며,
각 시도는 «나는 [추론 과정]을 바탕으로 생각해 보았고, 확신은 없지만 내 예측은 [답]이다»
와 같은 구조로 패키징된다.
이후 모듈은 “Accurate Reasoning: Thank you everyone. Let’s now holistically…”와 같은 프롬프트를 사용해, 모델이 이 여러 시도를 종합적으로 분석·비교·비판하도록 유도하고, 그 결과로 하나의 최종 답변을 도출한다.

이 접근법은 모델이 최종 결정을 내리기 전에 여러 해결 경로를 명시적으로 비교하고 검토하게 함으로써, 개별 예측에서 발생할 수 있는 오류를 완화하는 데 도움을 준다.

In [18]:
# Run CoT completions with increasing temperatures
text = "That was piculiar!"   # strange, piculiar, exotic, phenominal

cot_completions = []
for i in range(4):
    # Temperature increases: 0.7, 0.8, 0.9
    temp_config = dict(temperature=0.7+(0.1*i))
    completion = cot_emotion(input=text, config=temp_config)
    cot_completions.append(completion)

# Synthesize with MultiChainComparison
mcot_emotion = dspy.MultiChainComparison('input -> sentiment', M=4)
final_result = mcot_emotion(completions=cot_completions, input=text)

print(f"Sentiment: {final_result.sentiment}")
print(f"\nReasoning: {final_result.rationale}")

for i in range(4):
    print(f"\nCompletion {i+1}: ", cot_completions[i])

Sentiment: Positive

Reasoning: The word "piculous" conveys a sense of something being unusual or strange, and the exclamation mark implies an emotional reaction from the speaker. This suggests a mixture of surprise and intrigue. Therefore, the sentiment can be categorized as Positive, as the speaker seems to express an emotional response to an unexpected situation.

Completion 1:  Prediction(
    reasoning='The word "piciular" suggests that something was unusual or strange, which can indicate a sense of confusion or surprise. The exclamation mark emphasizes the speaker\'s emotional reaction, indicating that they found the experience noteworthy or unexpected.',
    sentiment='Surprised'
)

Completion 2:  Prediction(
    reasoning='The word "piculair" suggests that something was unusual or strange, potentially implying a sense of confusion or curiosity. The use of punctuation indicates a strong reaction, which could indicate surprise or intrigue.',
    sentiment='Neutral with a hint of 

### Majority

<img src="./Media/majority.png">

Majority는 여러 개의 응답(completion)을 대상으로 기본적인 투표 메커니즘을 적용해 가장 많이 등장한 답을 선택하는 유틸리티 함수다. 이 함수는 응답들을 포함한 Prediction 객체를 입력으로 받거나, 혹은 응답 리스트 자체를 직접 입력으로 받을 수 있다. 동작 과정에서 Majority는 대상 필드의 값을 정규화(normalization)하는데, 이때 사용할 필드는 명시적으로 지정할 수도 있고, 지정하지 않으면 기본적으로 마지막 출력 필드가 사용된다. 이 정규화 과정은 normalize_text 함수가 담당하며, 의미적으로 동일하지만 표현이 약간 다른 텍스트들을 같은 답으로 취급하도록 돕는다. 또한 무시되어야 할 답변의 경우 None을 반환하도록 처리한다. 표 수가 같은 경우(동률)에는 먼저 생성된 응답이 우선된다. 이 함수는 서로 다른 temperature로 예측기를 여러 번 실행하는 등, 다수의 응답을 생성하는 모듈들과 함께 사용할 때 특히 유용하며, 가장 일반적인 답을 간단히 선택할 수 있는 방법을 제공한다. 최종적으로 Majority는 선택된(승리한) 하나의 응답만을 포함하는 새로운 Prediction 객체를 반환한다.


In [19]:
# Example Completions From Prior Multi-Chain
majority_result = dspy.majority(cot_completions, field="sentiment")

# Results
print(f"Most common sentiment: {majority_result.sentiment}")

Most common sentiment: Surprised


---

## Evaluators

모듈이 프로그램의 기본 구성 요소이긴 하지만, 프롬프트 체인을 반복 수정하듯이 모듈 자체를 직접 튜닝하거나 변경하는 데에는 한계가 있다는 점을 느꼈을 것이다. 바로 이 지점에서 DSPy의 차별성이 드러난다. DSPy는 사전에 정의한 메트릭(metric)에 기반해 모듈의 성능을 측정하고, 그 결과를 통해 성능을 튜닝하는 것을 목표로 한다.

따라서 LLM 출력의 이상적인 상태가 무엇인지, 그리고 그것을 어떻게 측정할 것인지를 깊이 고민해야 한다. 이는 분류 문제에서는 단순한 정확도(accuracy)가 될 수도 있고, 검색 기반 생성(RAG)과 같은 경우에는 **검색된 컨텍스트에 대한 충실도(faithfulness)** 처럼 더 복잡한 기준이 될 수도 있다.

### Example 데이터 타입

DSPy에서 평가기(evaluator)와 메트릭이 사용하는 데이터 타입은 Example 객체다. 본질적으로는 딕셔너리(dict)에 가깝지만, DSPy 백엔드가 기대하는 형식에 맞게 데이터를 정리하고 처리해준다.

필드는 원하는 대로 자유롭게 정의할 수 있지만, 현재 사용 중인 모듈의 입력·출력 포맷과 반드시 일치하도록 구성해야 한다.

학습 데이터셋(training set)은 이러한 Example 객체들의 리스트로 구성된다.

In [20]:
qa_pair = dspy.Example(question="What is my name?", answer="Your name is Adam Lucek")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Example({'question': 'What is my name?', 'answer': 'Your name is Adam Lucek'}) (input_keys=None)
What is my name?
Your name is Adam Lucek


In [21]:
classification_pair = dspy.Example(excerpt="I really lova programming!", classification="Positive", confidence=0.95)

print(classification_pair)
print(classification_pair.excerpt)
print(classification_pair.classification)
print(classification_pair.confidence)

Example({'excerpt': 'I really lova programming!', 'classification': 'Positive', 'confidence': 0.95}) (input_keys=None)
I really lova programming!
Positive
0.95


또한 .with_inputs() 메서드를 사용해 입력(input)과 정답(label)을 명시적으로 구분할 수도 있다.
.with_inputs()에 지정되지 않은 필드들은 정답(label) 이거나 메타데이터(metadata) 로 간주된다.

In [22]:
article_summary = dspy.Example(article="Placeholder for Article", summary="Expected Summary").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example with input fields only: ", article_summary.inputs())
print("\nExample object Non-Input fields only: ", article_summary.labels())

Example with input fields only:  Example({'article': 'Placeholder for Article'}) (input_keys={'article'})

Example object Non-Input fields only:  Example({'summary': 'Expected Summary'}) (input_keys=None)


### Metrics

<img src="./Media/metrics.png" width="600">

이제 데이터 포맷을 이해했으니, **메트릭(metric)** 을 고민해야 한다. 메트릭은 DSPy에서 매우 핵심적인 요소로, 프레임워크는 정의된 메트릭을 기준으로 모듈을 최적화한다.

DSPy에서 메트릭은 간결하게 정의된다. 메트릭이란, 데이터에 포함된 Example과 시스템의 출력 결과를 입력으로 받아, 그 출력이 얼마나 좋은지를 수치로 반환하는 함수일 뿐이다. 그렇다면 질문은 이것이다. 당신의 시스템 출력은 어떤 기준에서 ‘좋다’ 혹은 ‘나쁘다’고 판단할 수 있는가?

### Simple Metrics

<img src="./Media/simple_metrics.png" width="250" >

먼저 가장 단순한 형태로 시작해 보자.
감성 분류(sentiment classification) 모듈을 대상으로, 정확히 일치하는지(exact match) 여부를 기준으로 검증(validation)을 설정하고 실행한다.

#### 모듈 설정 (Setup Module)

In [23]:
# Simple Tweet Sentiment Classification Module
from typing import Literal

class TwtSentiment(dspy.Signature):
    tweet: str = dspy.InputField(desc="Candidate tweet for classification")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

twt_sentiment = dspy.ChainOfThought(TwtSentiment)

#### 데이터셋 포맷 구성 (Format Dataset)

[MTEB Tweet Sentiment Extraction](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) 데이터셋에서 트윗과 감성 레이블(sentiment)로 이루어진 예제들을 가져온다.
이 데이터셋이 우리가 검증(validation)에 사용할 기준 데이터가 된다.

In [24]:
import json

# Formatting Examples
examples = []
num_examples = 50

with open("./datasets/train.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if num_examples and i >= num_examples:
            break
        data = json.loads(line.strip())
        example = dspy.Example(
            tweet=data['text'],
            sentiment=data['label_text']
        ).with_inputs("tweet")
        examples.append(example)

In [25]:
examples[12]

Example({'tweet': 'My Sharpie is running DANGERously low on ink', 'sentiment': 'negative'}) (input_keys={'tweet'})

#### 메트릭 정의 (Defining Metric)

이 메트릭은 Example, Prediction, 그리고 선택적인 trace를 입력으로 받는다 (trace는 이후에 다룰 예정이다).

이 경우 메트릭은 LLM이 예측한 감성 값이 정답(ground truth)과 동일한지 여부에 따라 True 또는 False를 반환한다.

In [26]:
def validate_answer(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

#### 수동 평가 실행 (Running a Manual Evaluation)

각 예제에 포함된 트윗에 대해, 정의해 둔 입력값(트윗)을 사용해 예측을 한 번씩 실행한다.

이 예측 결과는 validate_answer 메트릭으로 전달되며, 메트릭은 True 또는 False를 반환한다. 이 결과들은 차례로 scores 리스트에 저장된다.

In [27]:
scores = []
for x in examples:
    pred = twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    scores.append(score)

In [28]:
accuracy = sum(scores) / len(scores)
print("Baseline Accuracy: ", accuracy)

Baseline Accuracy:  0.64


#### 중간 단계 메트릭 (Intermediate Metrics)

<img src="./Media/inter_metrics.png" width="250">

정답과의 직접 비교 방식도 충분히 유용하지만, **장문 출력(long-form output)** 을 비교·평가하는 데에는
LLM-as-a-Judge 방식이 효과적이라는 점도 확인되어 왔다.

이제 LLM 기반 메트릭을 구현해 보자.

#### 모듈 설정 (Setup Module)

In [29]:
# CoT For Summarizing a Dialogue

dialog_sum = dspy.ChainOfThought("dialogue: str -> summary: str")

### 데이터셋 포맷 구성 (Format Dataset)

이번 예제에서 사용하는 데이터셋은 DialogSum으로, 여러 대화(dialogue)와 그에 대응하는 요약(summary)으로 구성된 컬렉션이다.

이 데이터셋에 포함된 **요약문을 ‘정답(gold standard)’** 으로 삼고, LLM 기반의 **퍼지 메트릭(fuzzy metrics)** 을 사용해 모델 출력을 평가한다.

In [30]:
num_examples = 20
dialogsum_examples = []

with open("./datasets/dialogsum.train.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if num_examples and i >= num_examples:
            break
        data = json.loads(line.strip())
        example = dspy.Example(
            dialogue=data['dialogue'],
            summary=data['summary']
        ).with_inputs("dialogue")
        dialogsum_examples.append(example)

#### 메트릭 시그니처 (Metric Signature)

이제 메트릭 내부에서도 모듈을 사용하게 되었으므로, 메트릭 예측(metric prediction)에 적용할 수 있는 동적인 시그니처가 필요하다.

In [31]:
# Define the signature for autimatic assessments
class Assess(dspy.Signature):
    """Assess the quality of a dialog summary along the specified dimensions."""
    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer: bool = dspy.OutputField()

#### 메트릭 정의 (Metric Definition)

이 메트릭에서는 LLM을 평가자로 사용해,
생성된 대화 요약이 원본 질문(또는 대화 내용)에 비해 정확한지,
그리고 기대 요약(expected summary)에 비해 간결한지를 판단한다.

In [32]:
def dialog_metric(gold, pred, trace=None):
    dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary
    # Define Assessment Questions
    accurate_question = f"Given this original dialog: '{dialogue}', does the summary accurately represent what was discussed without adding or achnging information?"
    concise_question = f"""Compare the level of detail in the generated summary with the gold summary:
    Gold summary: '{gold_summary}'
    Is the generated summary appropriately detailed - neither too sparse nor too verbose compared to the colg summary"""
    # Run Predictions
    accurate = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=accurate_question)
    concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)
    # Extract boolean assessment answers
    accurate, concise = [m.assessment_answer for m in [accurate, concise]]
    # Calculate score - accuracy is required for any points
    score = (accurate + concise) if accurate else 0

    if trace is not None:
        return score >= 2

    return score / 2.0

#### 평가 실행 (Running Evaluation)

앞에서 했던 방식과 동일하게, 수동 평가(manual evaluation)를 다시 수행한다.

In [33]:
intermediate_scores = []
for x in dialogsum_examples:
    pred = dialog_sum(**x.inputs())
    score = dialog_metric(x, pred)
    intermediate_scores.append(score)

In [34]:
final_score = sum(intermediate_scores) / len(intermediate_scores)
print("Dialog Metric Score: ", final_score)

Dialog Metric Score:  0.85


### DSPy에서 트레이싱을 활용한 고급 메트릭 (Advanced Metrics with Tracing in DSPy)

<img src="./Media/advan_metrics.png" width="250">

DSPy 문서에서는 메트릭으로 모듈을 사용할 때의 두 가지 핵심 포인트를 강조한다.

1. 메트릭 자체가 DSPy 프로그램인 경우, 가장 강력한 반복 개선 방법 중 하나는 메트릭 자체를 컴파일(최적화)하는 것이다.
이는 보통 매우 쉽다. 메트릭의 출력은 대개 단순한 값(예: 5점 만점의 점수)이기 때문에, **메트릭을 평가하는 메트릭(metric의 metric)** 을 정의하기 쉽고, 소수의 예제만으로도 최적화가 가능하다.

2. 메트릭이 평가(evaluation) 단계에서 사용될 때, DSPy는 프로그램의 내부 단계를 추적하지 않는다.
하지만 컴파일(최적화) 단계에서는, DSPy가 **언어 모델 호출을 트레이싱(trace)** 한다. 이 트레이스에는 각 DSPy predictor의 입력과 출력이 포함되며, 이를 활용해 중간 단계를 검증하고 최적화에 활용할 수 있다.

두 번째 포인트를 앞선 예제를 통해 좀 더 자세히 살펴보면, 이 메트릭은 두 가지 모드로 동작한다.

**표준 평가 모드 (Standard Evaluation, trace=None)**: 요약의 정확성과 간결성을 기준으로 0~1 사이로 정규화된 점수를 반환한다. 이때 **사실적 정확성(factual accuracy)** 은 게이팅 조건으로 작동한다.

**컴파일 모드 (Compilation Mode, trace 사용 가능)**: 컴파일 과정에서는 DSPy가 ChainOfThought 모듈(dialog_sum)의 트레이스를 제공한다. 표준 평가에서는 0~1 범위의 점수를 반환하지만, 컴파일 모드에서는 반환 로직을 변경해 **이진 성공 기준(예: score ≥ 2)** 을 반환하도록 한다. 이러한 **이진 신호(success / failure)** 는 각 예제에 대해 명확한 판단 기준을 제공하므로, DSPy가 컴파일 과정에서 훨씬 효과적으로 최적화할 수 있게 된다.

In [35]:
def dialog_metric(gold, pred, trace=None):
    dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary

    # LLM-based assessment using Assess signature
    accurate = dspy.Predict(Asess)(assessed_text=generated_summary, assessment_question=accurate_question)
    concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)

    if trace is not None:
        # During compilation: Can access and validate CoT reasoning steps
        # We're not doing anything with it currently but you can access in this way
        reasoning_steps = [output.reasoning for *_, output in trace if hasattr(output, 'reasoning')]
        # Return binary success criteria for optimization
        return score >= 2  # Requires both accuracy and conciseness

    return score / 2.0  # Normalized evaluation score

트레이스(trace) 기능은 ChainOfThought 구현처럼 복잡한 모듈에서 특히 큰 가치를 가진다.
이 기능은 DSPy가 최적화를 수행하는 방식을 바꿔 놓는다. 컴파일 단계에서는 정규화된 점수를 반환하는 대신, 특정 기준(예: score ≥ 2)을 만족하는지 여부에 따른 이진 성공 신호를 제공한다. 이러한 이진 피드백은 각 예제에 대해 명확한 성공/실패 신호를 제공하므로, DSPy가 모델을 훨씬 효과적으로 최적화할 수 있게 해준다.

이처럼 **이중 모드 평가 전략(dual-mode evaluation)**은 서로 다른 두 가지 목적을 수행한다.
일반 평가 단계에서는 모델 성능을 정밀하게 파악할 수 있도록 세부적인 정규화 점수를 제공하고,
컴파일 단계에서는 이진 성공 기준으로 전환해 최적화 과정을 보다 명확하게 유도한다.

이 접근법을 통해 우리는 풍부한 평가 메트릭을 유지하면서도, 컴파일 단계에서는 모델 개선에 필요한 명확한 신호를 제공할 수 있다. 나아가, 일반적으로는 드러나지 않는 중간 단계(intermediate step)의 신호들까지 포함해 메트릭을 더욱 정교하게 확장하는 것도 가능하다.

---

### 최적화 (Optimization)

<img src="./Media/optimizers.png" width="600">

이제 평가에 사용할 모듈과 메트릭이 준비되었으니, 마지막 단계인 **프로그램 최적화(Optimization)** 로 넘어갈 수 있다. 이 과정은 프롬프트를 감으로 수정하고 반복하는 일을 없애고, 측정 가능한 값에 기반해 자동으로 테스트·평가·반복 개선을 수행하게 해준다.

DSPy는 프로그램을 최적화하기 위해 여러 가지 방법을 제공한다. 아래 내용은 DSPy 공식 문서에서 정리한 것이다.

**자동 Few-Shot 러닝 (Automatic Few-Shot Learning)** 이 계열의 옵티마이저는 시그니처를 확장해, 최적화된 예제들을 자동으로 생성하고 프롬프트에 포함함으로써 few-shot learning을 구현한다.

* **LabeledFewShot** : 제공된 라벨된 입력–출력 데이터로부터 few-shot 예제를 구성. 파라미터: k: 프롬프트에 포함할 예제 수, trainset: k개의 예제를 무작위로 선택할 학습 데이터

* **BootstrapFewShot** : 프로그램의 각 단계에 대해 완전한 데모를 생성하기 위해 교사 모듈을 사용한다(기본값은 현재 프로그램). 이 과정에서 trainset에 포함된 라벨된 예제도 함께 활용한다. 주요 파라미터로는 trainset에서 무작위로 선택할 데모 수를 지정하는 `max_labeled_demos`, 그리고 교사가 추가로 생성할 예제 수를 지정하는 `max_bootstrapped_demos`가 있다. 부트스트래핑 과정에서는 메트릭을 사용해 데모를 검증하며, 메트릭을 통과한 데모만 “컴파일된” 프롬프트에 포함된다. 고급 기능으로, 더 어려운 과제를 위해 구조적으로 호환되는 다른 DSPy 프로그램을 교사 프로그램으로 사용하는 것도 지원한다.

* **BootstrapFewShotWithRandomSearch**: 생성된 데모들에 대해 무작위 탐색(random search)을 적용하며 BootstrapFewShot을 여러 차례 실행하고, 그중 최적의 성능을 보이는 프로그램을 선택한다. 파라미터는 BootstrapFewShot과 거의 동일하지만, 최적화 과정에서 평가할 무작위 프로그램의 수를 지정하는 `num_candidate_programs`가 추가된다. 이때 평가 대상에는 컴파일되지 않은 원본 프로그램, LabeledFewShot으로 최적화된 프로그램, 예제를 섞지 않은 BootstrapFewShot 컴파일 프로그램, 그리고 무작위로 섞은 예제 세트를 사용한 `num_candidate_programs`개의 BootstrapFewShot 컴파일 프로그램들이 모두 포함된다.

* **KNNFewShot** : k-최근접 이웃(k-Nearest Neighbors) 알고리즘을 사용해 주어진 입력 예제와 가장 가까운 학습 예제 데모들을 찾는다. 이렇게 선택된 최근접 이웃 데모들은 BootstrapFewShot 최적화 과정에서 사용할 학습 데이터(trainset)로 활용된다. 예시는 해당 노트북을 참고하라.


**자동 지시문 최적화 (Automatic Instruction Optimization)** 이 계열은 프롬프트에 들어가는 instruction 자체를 최적화한다. MIPROv2의 경우 few-shot 예제까지 함께 최적화할 수 있다.

* **COPRO** :각 단계마다 새로운 지시문을 생성하고 이를 반복적으로 개선하며, 메트릭 함수와 학습 데이터(trainset)를 사용한 좌표 상승법(coordinate ascent, 즉 힐클라이밍)을 통해 최적화한다. 주요 파라미터로는 옵티마이저가 프롬프트 개선을 반복 수행하는 횟수를 의미하는 `depth`가 포함된다.

* **MIPROv2**:각 단계마다 지시문과 few-shot 예시를 함께 생성한다. 지시문 생성은 데이터 인식(data-aware) 및 데모 인식(demonstration-aware) 방식으로 이루어지며, 베이지안 최적화(Bayesian Optimization)를 사용해 모듈 전반에 걸친 지시문/데모 생성 공간을 효과적으로 탐색한다.


**자동 파인튜닝 (Automatic Finetuning)** Distills a prompt-based DSPy program into weight updates. The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.

**프로그램 변환 (Program Transformations)**

* Ensemble :여러 DSPy 프로그램을 앙상블로 구성하여 전체를 사용하거나, 그중 일부를 무작위로 샘플링해 하나의 프로그램으로 결합한다.

**트윗 데이터 학습 및 테스트 로딩**

이번 예제에서는 앞서 사용한 트윗 감성 분류(tweet sentiment classification) 모듈을 최적화한다.
분류 문제는 LLM 활용 사례로서 가장 좋은 예는 아니지만,
각 옵티마이저가 내부적으로 어떻게 동작하는지 가볍게 이해하기에는 적합하다.

이를 통해 이후 더 복잡하고 고급한 DSPy 프로그램에 이 최적화 기법들을 효과적으로 적용할 수 있게 된다.

In [36]:
import json

# Formatting Examples
twitter_train = []
twitter_test = []
train_size = 100 # how many for train
test_size = 200  # how many for test

with open("./datasets/train.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= (train_size + test_size):
            break

        data = json.loads(line.strip())
        example = dspy.Example(
            tweet = data['text'],
            sentiment = data['label_text']
        ).with_inputs("tweet")

        if i < train_size:
            twitter_train.append(example)
        else:
            twitter_test.append(example)

#### 후보 프로그램 (Candidate Program)

In [37]:
# Simple Tweet Sentiment Classification Module
from typing import Literal

class TwtSentiment(dspy.Signature):
    tweet: str = dspy.InputField(desc="Candidate tweet for classification")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

base_twt_sentiment = dspy.Predict(TwtSentiment)

#### Simple Metrics

In [38]:
def validate_answer(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

#### Baseline Score

In [39]:
baseline_scores = []
for x in twitter_test:
    pred = base_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    baseline_scores.append(score)

base_accuracy = baseline_scores.count(True) / len(baseline_scores)
print("Baseline Accuracy: ", base_accuracy)

Baseline Accuracy:  0.755


#### 각 프로그램에 적용할 예제 트윗

In [40]:
# Expected Positive Label
example_tweet = "Hi! Waking up, and not lazy at all. You would be proud of me, 8 am here!!! Btw, nice colour, not burnt." 

### 자동 Few-Shot 학습 (Automatic Few-Shot Learning)

<img src="./Media/auto_fewshot.png" width="300">

이 옵티마이저들은 추론 시점에 학습 데이터에서 쿼리와 유사한 예제를 찾아 제공하거나,
혹은 프로그램 자체로부터 최적화된 예제를 생성해 사용하는 방식에 초점을 둔다.

#### LabeledFewShot

<img src="./Media/labeled_few_shot.png" >

가장 단순한 옵티마이저다.
학습 데이터에서 k개의 예제를 무작위로 선택해 **데모(demonstration)** 로 사용한다.

In [42]:
from dspy.teleprompt import LabeledFewShot
lfs_optimizer = LabeledFewShot(k=16)   # Use 16 examples in prompts
lfs_twt_sentiment = lfs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

In [45]:
lfs_scores = []
for x in twitter_test:
    pred = lfs_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    lfs_scores.append(score)

lfs_accuracy = lfs_scores.count(True) / len(lfs_scores)
print("Labeled Few Shot Accuracy: ", lfs_accuracy)

Labeled Few Shot Accuracy:  0.68


In [47]:
lfs_twt_sentiment.save("./optimized/lfs_twt_sentiment.json")

In [48]:
print(lfs_twt_sentiment(tweet=example_tweet).sentiment)

positive


### BootstrapFewShot

<img src="./Media/bootstrap_fewshot.png">
프로그램을 실행해 생성된 결과 중에서 성공적으로 동작한 실행(run)만 선별하여 고품질 예제를 생성한다.

In [53]:
from dspy.teleprompt import BootstrapFewShot

bsfs_optimizer = BootstrapFewShot(
    metric=validate_answer,     # Function to evaluate quality 
    max_bootstrapped_demos=4,   # Generated examples
    max_labeled_demos=16,       # Examples from training data
    metric_threshold=1          # Minimum quality threshold
)

bsfs_twt_sentiment = bsfs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

  4%|█▋                                         | 4/100 [00:00<00:02, 34.13it/s]

Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.





In [54]:
bsfs_scores = []
for x in twitter_test:
    pred = bsfs_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsfs_scores.append(score)

bsfs_accuracy = bsfs_scores.count(True) / len(bsfs_scores)
print("Bootstrap Few Shot Accuracy: ", bsfs_accuracy)

Bootstrap Few Shot Accuracy:  0.705


In [55]:
bsfs_twt_sentiment.save("./optimized/bsfs_twt_sentiment.json")

In [56]:
print(bsfs_twt_sentiment(tweet=example_tweet).sentiment)

positive


### BootstrapFewShotWithRandomSearch

<img src="./Media/bsfswrs_diagram.png">

BootstrapFewShot을 확장한 방식으로, 여러 개의 무작위 예제 조합을 시도한 뒤 성능이 가장 좋은 조합을 선택한다.

In [58]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

bsfswrs_optimizer = BootstrapFewShotWithRandomSearch(
    metric=validate_answer,
    num_candidate_programs=16,
    max_bootstrapped_demos=4,
    max_labeled_demos=16
)

bsfswrs_twt_sentiment = bsfswrs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)

Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 16 candidate sets.
Average Metric: 67.00 / 100 (67.0%): 100%|████| 100/100 [00:08<00:00, 12.00it/s]

2026/01/25 16:53:46 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)



New best score: 67.0 for seed -3
Scores so far: [67.0]
Best score so far: 67.0
Average Metric: 81.00 / 100 (81.0%): 100%|████| 100/100 [00:11<00:00,  8.77it/s]

2026/01/25 16:53:58 INFO dspy.evaluate.evaluate: Average Metric: 81 / 100 (81.0%)



New best score: 81.0 for seed -2
Scores so far: [67.0, 81.0]
Best score so far: 81.0


  4%|█▋                                         | 4/100 [00:00<00:02, 39.03it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 75.00 / 100 (75.0%): 100%|████| 100/100 [00:12<00:00,  7.78it/s]

2026/01/25 16:54:11 INFO dspy.evaluate.evaluate: Average Metric: 75 / 100 (75.0%)



Scores so far: [67.0, 81.0, 75.0]
Best score so far: 81.0


  5%|██▏                                        | 5/100 [00:04<01:24,  1.13it/s]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Average Metric: 80.00 / 100 (80.0%): 100%|████| 100/100 [00:11<00:00,  8.55it/s]

2026/01/25 16:54:27 INFO dspy.evaluate.evaluate: Average Metric: 80 / 100 (80.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0]
Best score so far: 81.0


  2%|▊                                          | 2/100 [00:01<01:11,  1.38it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 80.00 / 100 (80.0%): 100%|████| 100/100 [00:10<00:00,  9.73it/s]

2026/01/25 16:54:39 INFO dspy.evaluate.evaluate: Average Metric: 80 / 100 (80.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0]
Best score so far: 81.0


  1%|▍                                          | 1/100 [00:00<01:37,  1.02it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 78.00 / 100 (78.0%): 100%|████| 100/100 [00:11<00:00,  8.83it/s]

2026/01/25 16:54:51 INFO dspy.evaluate.evaluate: Average Metric: 78 / 100 (78.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0]
Best score so far: 81.0


  2%|▊                                          | 2/100 [00:01<01:27,  1.12it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 77.00 / 100 (77.0%): 100%|████| 100/100 [00:13<00:00,  7.59it/s]

2026/01/25 16:55:06 INFO dspy.evaluate.evaluate: Average Metric: 77 / 100 (77.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0]
Best score so far: 81.0


  2%|▊                                          | 2/100 [00:01<01:28,  1.10it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 75.00 / 100 (75.0%): 100%|████| 100/100 [00:16<00:00,  5.93it/s]

2026/01/25 16:55:25 INFO dspy.evaluate.evaluate: Average Metric: 75 / 100 (75.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0]
Best score so far: 81.0


  3%|█▎                                         | 3/100 [00:02<01:28,  1.09it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 73.00 / 100 (73.0%): 100%|████| 100/100 [00:10<00:00,  9.72it/s]

2026/01/25 16:55:38 INFO dspy.evaluate.evaluate: Average Metric: 73 / 100 (73.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0]
Best score so far: 81.0


  3%|█▎                                         | 3/100 [00:02<01:12,  1.34it/s]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 78.00 / 100 (78.0%): 100%|████| 100/100 [00:11<00:00,  8.40it/s]

2026/01/25 16:55:52 INFO dspy.evaluate.evaluate: Average Metric: 78 / 100 (78.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0]
Best score so far: 81.0


  3%|█▎                                         | 3/100 [00:02<01:29,  1.08it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 78.00 / 100 (78.0%): 100%|████| 100/100 [00:10<00:00,  9.37it/s]

2026/01/25 16:56:06 INFO dspy.evaluate.evaluate: Average Metric: 78 / 100 (78.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0]
Best score so far: 81.0


  2%|▊                                          | 2/100 [00:01<01:30,  1.08it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 83.00 / 100 (83.0%): 100%|████| 100/100 [00:11<00:00,  8.38it/s]

2026/01/25 16:56:20 INFO dspy.evaluate.evaluate: Average Metric: 83 / 100 (83.0%)



New best score: 83.0 for seed 8
Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0]
Best score so far: 83.0


  4%|█▋                                         | 4/100 [00:03<01:20,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 77.00 / 100 (77.0%): 100%|████| 100/100 [00:11<00:00,  8.52it/s]

2026/01/25 16:56:35 INFO dspy.evaluate.evaluate: Average Metric: 77 / 100 (77.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0]
Best score so far: 83.0


  1%|▍                                          | 1/100 [00:01<01:50,  1.11s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 78.00 / 100 (78.0%): 100%|████| 100/100 [00:13<00:00,  7.54it/s]

2026/01/25 16:56:49 INFO dspy.evaluate.evaluate: Average Metric: 78 / 100 (78.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0]
Best score so far: 83.0


  6%|██▌                                        | 6/100 [00:03<01:00,  1.55it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Average Metric: 80.00 / 100 (80.0%): 100%|████| 100/100 [00:10<00:00,  9.34it/s]

2026/01/25 16:57:04 INFO dspy.evaluate.evaluate: Average Metric: 80 / 100 (80.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0, 80.0]
Best score so far: 83.0


  6%|██▌                                        | 6/100 [00:05<01:21,  1.15it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Average Metric: 78.00 / 100 (78.0%): 100%|████| 100/100 [00:10<00:00,  9.30it/s]

2026/01/25 16:57:20 INFO dspy.evaluate.evaluate: Average Metric: 78 / 100 (78.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0, 80.0, 78.0]
Best score so far: 83.0


  4%|█▋                                         | 4/100 [00:03<01:12,  1.33it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 81.00 / 100 (81.0%): 100%|████| 100/100 [00:11<00:00,  8.88it/s]

2026/01/25 16:57:34 INFO dspy.evaluate.evaluate: Average Metric: 81 / 100 (81.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0, 80.0, 78.0, 81.0]
Best score so far: 83.0


  2%|▊                                          | 2/100 [00:01<01:17,  1.26it/s]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 80.00 / 100 (80.0%): 100%|████| 100/100 [00:10<00:00,  9.58it/s]

2026/01/25 16:57:46 INFO dspy.evaluate.evaluate: Average Metric: 80 / 100 (80.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0, 80.0, 78.0, 81.0, 80.0]
Best score so far: 83.0


  2%|▊                                          | 2/100 [00:01<01:27,  1.12it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 81.00 / 100 (81.0%): 100%|████| 100/100 [00:08<00:00, 11.11it/s]

2026/01/25 16:57:57 INFO dspy.evaluate.evaluate: Average Metric: 81 / 100 (81.0%)



Scores so far: [67.0, 81.0, 75.0, 80.0, 80.0, 78.0, 77.0, 75.0, 73.0, 78.0, 78.0, 83.0, 77.0, 78.0, 80.0, 78.0, 81.0, 80.0, 81.0]
Best score so far: 83.0
19 candidate programs found.


In [60]:
bsfswrs_scores = []
for x in twitter_test:
    pred = bsfswrs_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsfswrs_scores.append(score)

bsfswrs_accuracy = bsfswrs_scores.count(True) / len(bsfswrs_scores)
print("Bootstrap Few Shot With Random Search Accuracy: ", bsfswrs_accuracy)

Bootstrap Few Shot With Random Search Accuracy:  0.69


In [62]:
bsfswrs_twt_sentiment.save("./optimized/bsfswrs_twt_sentiment.json")

In [63]:
print(bsfswrs_twt_sentiment(tweet=example_tweet).sentiment)

positive


### KNNFewShot

<img src="./Media/knn_diagram.png">

입력과의 유사도를 기반으로 관련 예제를 동적으로 선택한다.

#### Embedding 함수 정의

KNN 기반 검색은 벡터 유사도에 의존하므로, 빠르게 동작하는 임베딩 함수가 필요하다.
여기서는 OpenAI API를 활용한 매우 단순한 임베딩 설정을 사용한다.

In [66]:
from openai import OpenAI
import numpy as np

client = OpenAI()

def openai_embeddings(texts):
    if isinstance(texts, str):
        texts = [texts]

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    # Convert to numpy array
    embeddings = np.array([embedding.embedding for embedding in response.data], dtype=np.float32)

    # If single text, return single embedding
    if len(embeddings) == 1:
        return embeddings[0]
    return embeddings

In [70]:
from dspy.teleprompt import KNNFewShot

knn_optimizer = KNNFewShot(
    k=5,                               # Number of neighbors to use
    trainset=twitter_train,            # Dataset for finding neighbors
    vectorizer=openai_embeddings       # Function to convert inputs to vectors
)

knn_twt_sentiment = knn_optimizer.compile(base_twt_sentiment)

In [71]:
knn_scores = []
for x in twitter_test:
    pred = knn_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    knn_scores.append(score)

 80%|████████████████████████████████████         | 4/5 [00:07<00:01,  1.96s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:05<00:01,  1.38s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.23s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.00it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.26it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.24it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.35it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.13it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.11it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.16it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.42it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.35it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.25it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.27it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.28it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.29it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:05<00:01,  1.40s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.09it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.23s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.43it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.45it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.64it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:05<00:01,  1.46s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.08s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:05<00:01,  1.48s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.57it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.22it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.38it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.14s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.27it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.31it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.40it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.00s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.45it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.26it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.13it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.38it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.16it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.14it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.34it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.16it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.45it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.29it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.15it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.35it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.54it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.24it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.00it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.26it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.31it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.45it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:04<00:01,  1.08s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.41it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.40it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.21it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.53it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.50it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.15it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.29it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.09it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.09it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.28it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.41it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.17it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.29it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.22it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.42it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.27it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.01it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.38it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.11it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.25it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.36it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.44it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.20it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.32it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.00it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.41it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:05<00:01,  1.30s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:02<00:00,  1.45it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.08it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.15it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.19it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


In [72]:
knn_accuracy = knn_scores.count(True) / len(knn_scores)
print("KNN Few Shot Accuracy: ", knn_accuracy)

KNN Few Shot Accuracy:  0.71


In [73]:
knn_twt_sentiment.save("./optimized/knn_twt_sentiment.json")

In [74]:
print(knn_twt_sentiment(tweet=example_tweet).sentiment)

 80%|████████████████████████████████████         | 4/5 [00:03<00:00,  1.30it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
positive


## Instruction Optimition

<img src="./Media/auto_instr.png" width="300">

이 옵티마이저들은 앞서 살펴본 few-shot 설정이 아니라,
모델에 전달되는 실제 지시문(instruction)과 프롬프트 자체를 개선하여
zero-shot 성능을 향상시키는 데 초점을 둔다.

### COPRO (Coordinate Prompt Optimization)

<img src="./Media/copro_diagram.png">

각 단계(step)별로 새로운 지시문(instruction)을 생성하고 반복적으로 개선하며,
좌표 상승법(coordinate ascent)—즉, 메트릭 함수와 학습 데이터셋을 활용한 힐클라이밍 방식으로 이를 최적화한다.

이 옵티마이저의 주요 파라미터 중 하나는 **depth**로, 이는 **프롬프트 개선을 반복 수행하는 횟수(optimization iteration 수)**를 의미한다.

In [76]:
from dspy.teleprompt import COPRO

copro_optimizer = COPRO(
    metric=validate_answer,                  # Metric to Optimize Against
    prompt_model=dspy.LM('openai/gpt-4o'),   # Different Model for Prompt Generation
    breadth=10,                              # New Prompts per iteration
    depth=3,                                 # Number of improvement rounds
    init_temperature=1.4                     # Creativity in generation
)
copro_twt_sentiment = copro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, eval_kwargs={'num_threads':6, 'display_progress':True})

2026/01/25 17:34:05 INFO dspy.teleprompt.copro_optimizer: Iteration Depth: 1/3.
2026/01/25 17:34:05 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #1/10 for Predictor 1 of 1.






[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the ta

2026/01/25 17:34:19 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)
2026/01/25 17:34:19 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #2/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:34:34 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:34:34 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #3/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:34:52 INFO dspy.evaluate.evaluate: Average Metric: 69 / 100 (69.0%)
2026/01/25 17:34:52 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #4/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:35:08 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:35:08 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #5/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:35:25 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)
2026/01/25 17:35:25 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #6/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:35:39 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:35:39 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #7/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:35:56 INFO dspy.evaluate.evaluate: Average Metric: 68 / 100 (68.0%)
2026/01/25 17:35:56 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #8/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:36:11 INFO dspy.evaluate.evaluate: Average Metric: 64 / 100 (64.0%)
2026/01/25 17:36:11 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #9/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:36:26 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:36:26 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #10/10 for Predictor 1 of 1.







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:36:27 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)







[34m[2026-01-25T17:34:05.272980][0m

[31mSystem message:[0m

Your input fields are:
1. `basic_instruction` (str): The initial instructions before optimization
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## basic_instruction ## ]]
{basic_instruction}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give you a ``signature`` of fields (inputs and outputs) in English. Your task is to propose an instruction that will lead a good language model to perform the t

2026/01/25 17:36:30 INFO dspy.teleprompt.copro_optimizer: Iteration Depth: 2/3.
2026/01/25 17:36:30 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #1/10 for Predictor 1 of 1.


Average Metric: 64.00 / 100 (64.0%): 100%|████| 100/100 [00:13<00:00,  7.38it/s]

2026/01/25 17:36:43 INFO dspy.evaluate.evaluate: Average Metric: 64 / 100 (64.0%)
2026/01/25 17:36:43 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #2/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:37:00 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:37:00 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #3/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:37:18 INFO dspy.evaluate.evaluate: Average Metric: 68 / 100 (68.0%)
2026/01/25 17:37:18 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #4/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:37:38 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:37:38 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #5/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:37:54 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:37:54 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #6/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:38:08 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)
2026/01/25 17:38:08 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #7/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:38:23 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:38:23 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #8/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:38:38 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:38:38 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #9/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:38:56 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)
2026/01/25 17:38:56 INFO dspy.teleprompt.copro_optimizer: At Depth 2/3, Evaluating Prompt Candidate #10/10 for Predictor 1 of 1.







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:39:13 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)







[34m[2026-01-25T17:36:30.276702][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:39:17 INFO dspy.teleprompt.copro_optimizer: Iteration Depth: 3/3.
2026/01/25 17:39:17 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #1/10 for Predictor 1 of 1.


Average Metric: 67.00 / 100 (67.0%): 100%|████| 100/100 [00:19<00:00,  5.26it/s]

2026/01/25 17:39:36 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)
2026/01/25 17:39:36 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #2/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:40:11 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:40:11 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #3/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:40:26 INFO dspy.evaluate.evaluate: Average Metric: 71 / 100 (71.0%)
2026/01/25 17:40:26 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #4/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:40:49 INFO dspy.evaluate.evaluate: Average Metric: 63 / 100 (63.0%)
2026/01/25 17:40:49 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #5/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:41:05 INFO dspy.evaluate.evaluate: Average Metric: 66 / 100 (66.0%)
2026/01/25 17:41:05 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #6/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:41:23 INFO dspy.evaluate.evaluate: Average Metric: 64 / 100 (64.0%)
2026/01/25 17:41:23 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #7/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:41:42 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:41:42 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #8/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:41:54 INFO dspy.evaluate.evaluate: Average Metric: 60 / 100 (60.0%)
2026/01/25 17:41:54 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #9/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:42:09 INFO dspy.evaluate.evaluate: Average Metric: 65 / 100 (65.0%)
2026/01/25 17:42:09 INFO dspy.teleprompt.copro_optimizer: At Depth 3/3, Evaluating Prompt Candidate #10/10 for Predictor 1 of 1.







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

2026/01/25 17:42:26 INFO dspy.evaluate.evaluate: Average Metric: 67 / 100 (67.0%)







[34m[2026-01-25T17:39:17.075249][0m

[31mSystem message:[0m

Your input fields are:
1. `attempted_instructions` (str):
Your output fields are:
1. `proposed_instruction` (str): The improved instructions for the language model
2. `proposed_prefix_for_output_field` (str): The string at the end of the prompt, which will help the model start solving the task
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## attempted_instructions ## ]]
{attempted_instructions}

[[ ## proposed_instruction ## ]]
{proposed_instruction}

[[ ## proposed_prefix_for_output_field ## ]]
{proposed_prefix_for_output_field}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicat

In [77]:
copro_scores = []
for x in twitter_test:
    pred = copro_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    copro_scores.append(score)

copro_accuracy = copro_scores.count(True) / len(copro_scores)
print("COPRO Accuracy: ", copro_accuracy)

COPRO Accuracy:  0.73


In [78]:
copro_twt_sentiment.save("./optimized/copro_twt_sentiment.json")

In [79]:
print(copro_twt_sentiment(tweet=example_tweet).sentiment)

positive


### MIPROv2(Multiprompt Instruction Proposal Optimizer Version 2)

<img src="./Media/mipro_diagram.png">

각 단계(step)마다 지시문(instruction)과 few-shot 예시를 함께 생성하며,
지시문 생성 과정은 데이터 인식(data-aware) 및 데모 인식(demonstration-aware) 방식으로 수행된다.

전체 모듈에 걸쳐 **지시문과 예시 조합의 탐색 공간을 베이지안 최적화(Bayesian Optimization)** 로 효율적으로 탐색하여,
성능 메트릭 기준에서 가장 우수한 구성을 자동으로 찾아낸다.

In [84]:
from dspy.teleprompt import MIPROv2

mipro_optimizer = MIPROv2(
    metric=validate_answer,
    prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation
    # num_candidates=10,                      # Instructions to try
)

mipro_twt_sentiment = mipro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, valset=twitter_test)

2026/01/25 17:50:29 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2026/01/25 17:50:29 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2026/01/25 17:50:29 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2026/01/25 17:50:29 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


  4%|█▋                                         | 4/100 [00:03<01:21,  1.18it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/6


  5%|██▏                                        | 5/100 [00:03<01:08,  1.38it/s]


Bootstrapped 3 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 5/6


  1%|▍                                          | 1/100 [00:00<01:14,  1.32it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6


  1%|▍                                          | 1/100 [00:00<01:15,  1.31it/s]
2026/01/25 17:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2026/01/25 17:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


2026/01/25 17:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2026/01/25 17:51:45 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2026/01/25 17:51:45 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `tweet`, produce the fields `sentiment`.

2026/01/25 17:51:45 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Imagine you are tasked with moderating a social media platform for a large event attended by thousands of users. The goal is to quickly identify tweets that might indicate user dissatisfaction or negativity, which could alert you to potential issues during the event. You have access to tweets being posted in real-time. Your job is to process each incoming tweet and classify it as either "neutral" or "negative" based on the sentiment expressed in the text. Your analysis will guide immediate responses from event coordinators to ensure a positive experience for all attendees. Use the `Predict` module to achieve thi

Average Metric: 79.00 / 100 (79.0%): 100%|███| 100/100 [00:00<00:00, 192.97it/s]

2026/01/25 17:51:46 INFO dspy.evaluate.evaluate: Average Metric: 79 / 100 (79.0%)
2026/01/25 17:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 79.0

2026/01/25 17:51:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==



Average Metric: 25.00 / 35 (71.4%): 100%|███████| 35/35 [00:04<00:00,  8.25it/s]

2026/01/25 17:51:51 INFO dspy.evaluate.evaluate: Average Metric: 25 / 35 (71.4%)
2026/01/25 17:51:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2026/01/25 17:51:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43]
2026/01/25 17:51:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0]
2026/01/25 17:51:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:51:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==



Average Metric: 26.00 / 35 (74.3%): 100%|███████| 35/35 [00:03<00:00,  8.80it/s]

2026/01/25 17:51:55 INFO dspy.evaluate.evaluate: Average Metric: 26 / 35 (74.3%)
2026/01/25 17:51:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2026/01/25 17:51:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29]
2026/01/25 17:51:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0]
2026/01/25 17:51:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:51:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==



Average Metric: 23.00 / 35 (65.7%): 100%|███████| 35/35 [00:04<00:00,  7.86it/s]

2026/01/25 17:51:59 INFO dspy.evaluate.evaluate: Average Metric: 23 / 35 (65.7%)
2026/01/25 17:51:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2026/01/25 17:51:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71]
2026/01/25 17:51:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0]
2026/01/25 17:51:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:51:59 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==



Average Metric: 26.00 / 35 (74.3%): 100%|███████| 35/35 [00:04<00:00,  7.95it/s]

2026/01/25 17:52:04 INFO dspy.evaluate.evaluate: Average Metric: 26 / 35 (74.3%)
2026/01/25 17:52:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2026/01/25 17:52:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29]
2026/01/25 17:52:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0]
2026/01/25 17:52:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==



Average Metric: 25.00 / 35 (71.4%): 100%|███████| 35/35 [00:04<00:00,  8.38it/s]

2026/01/25 17:52:08 INFO dspy.evaluate.evaluate: Average Metric: 25 / 35 (71.4%)
2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43]
2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0]
2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2026/01/25 17:52:08 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 74.29) from minibatch trials...



Average Metric: 74.00 / 100 (74.0%): 100%|████| 100/100 [00:06<00:00, 14.84it/s]

2026/01/25 17:52:15 INFO dspy.evaluate.evaluate: Average Metric: 74 / 100 (74.0%)
2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]





2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0
2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: 

2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==


Average Metric: 26.00 / 35 (74.3%): 100%|██████| 35/35 [00:00<00:00, 160.50it/s]

2026/01/25 17:52:15 INFO dspy.evaluate.evaluate: Average Metric: 26 / 35 (74.3%)
2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].





2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43, 74.29]
2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]
2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 13 - Minibatch ==


Average Metric: 24.00 / 35 (68.6%): 100%|███████| 35/35 [00:05<00:00,  6.87it/s]

2026/01/25 17:52:20 INFO dspy.evaluate.evaluate: Average Metric: 24 / 35 (68.6%)
2026/01/25 17:52:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2026/01/25 17:52:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43, 74.29, 68.57]
2026/01/25 17:52:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]
2026/01/25 17:52:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|███████| 35/35 [00:03<00:00,  8.76it/s]

2026/01/25 17:52:25 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2026/01/25 17:52:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2026/01/25 17:52:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43, 74.29, 68.57, 85.71]
2026/01/25 17:52:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]
2026/01/25 17:52:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:25 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==



Average Metric: 23.00 / 35 (65.7%): 100%|███████| 35/35 [00:04<00:00,  7.44it/s]

2026/01/25 17:52:29 INFO dspy.evaluate.evaluate: Average Metric: 23 / 35 (65.7%)
2026/01/25 17:52:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2026/01/25 17:52:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43, 74.29, 68.57, 85.71, 65.71]
2026/01/25 17:52:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]
2026/01/25 17:52:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0


2026/01/25 17:52:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|██████| 35/35 [00:00<00:00, 136.80it/s]

2026/01/25 17:52:30 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [71.43, 74.29, 65.71, 74.29, 71.43, 74.29, 68.57, 85.71, 65.71, 85.71]
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0]
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0







2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 85.71) from minibatch trials...


Average Metric: 79.00 / 100 (79.0%): 100%|███| 100/100 [00:00<00:00, 140.59it/s]

2026/01/25 17:52:30 INFO dspy.evaluate.evaluate: Average Metric: 79 / 100 (79.0%)
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [79.0, 74.0, 79.0]
2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 79.0





2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: 

2026/01/25 17:52:30 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 79.0!


In [85]:
mipro_scores = []
for x in twitter_test:
    pred = mipro_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    mipro_scores.append(score)

mipro_accuracy = mipro_scores.count(True) / len(mipro_scores)
print("MIPRO Accuracy: ", mipro_accuracy)

MIPRO Accuracy:  0.755


In [86]:
mipro_twt_sentiment.save("./optimized/mipro_twt_sentiment.json")

In [87]:
print(mipro_twt_sentiment(tweet=example_tweet).sentiment)

positive


## Automatic Finetuning

<img src="./Media/auto_ft.png" width='300'>

프로그램이 충분히 최적화된 이후에는, 추론 비용과 지연을 줄이면서 성능을 유지하거나 확장하는 단계로 넘어갈 수 있다.
이상적인 전략은 대형·고성능 모델로 최적의 행동을 학습한 뒤, 그 지식을 **더 작고 효율적인 모델로 이전(distillation)** 하는 것이다. 또는 기존 모델을 **지속적으로 학습(continual training)** 하는 방식도 가능하다.

DSPy는 이 과정을 자동화하기 위해 BootstrapFinetune을 제공한다.

### BootstrapFinetune

<img src="./Media/bootstrap_finetune_diagram.png">

성공적인 프로그램 실행 결과를 기반으로 언어 모델의 파인튜닝 버전을 생성한다.
이 예제에서는 MIPROv2에서 가장 우수한 성능을 보인 프로그램을 gpt-4o-mini에 직접 내재화한다.

In [88]:
dspy.settings.experimental = True

#### 추가 데이터 불러오기

In [90]:
import json

# Formatting Examples
bsft_twitter_train = []
bsft_twitter_test = []
train_size = 500    # how many for train
test_size = 200     # how many for test

with open("./datasets/train.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= (train_size + test_size):
            break
        data = json.loads(line.strip())
        example = dspy.Example(
        tweet=data['text'],
        sentiment=data['label_text']
        ).with_inputs("tweet")

        if i < train_size:
            bsft_twitter_train.append(example)
        else:
            bsft_twitter_test.append(example)

#### Teacher and Student

BootstrapFinetune의 핵심 목적은 가장 잘 최적화된 프로그램을 활용해 언어 모델을 파인튜닝하기 위한 학습 데이터를 생성하는 데 있다. 이를 위해 전체 데이터에 걸쳐 예제를 생성하는 데 사용될 teacher 모델이 필요하며, 이후 해당 데이터를 바탕으로 파인튜닝 대상이 되는 target 모델을 포함한 student 프로그램이 필요하다.

In [99]:
# First make a deep copy of your optimized MIPRO program as the teacher
teacher = mipro_twt_sentiment.deepcopy()

# Create student as a copy but with your target model
student = mipro_twt_sentiment.deepcopy()
student.set_lm(dspy.LM("openai/gpt-4o-mini-2024-07-18"))  # e.g., mistral or whatever model you want to fine-tune

In [None]:
from dspy.teleprompt import BootstrapFinetune

bsft_optimizer = BootstrapFinetune(
    metric=validate_answer,          # Used to filter training data
    num_threads=16                   # For parallel processing
)

bsft_twt_sentiment = bsft_optimizer.compile(
    student=student,
    trainset=bsft_twitter_train,
    teacher=teacher
)

2026/01/25 19:37:57 INFO dspy.teleprompt.bootstrap_finetune: Preparing the student and teacher programs...
2026/01/25 19:37:57 INFO dspy.teleprompt.bootstrap_finetune: Bootstrapping data...


Average Metric: 355.00 / 500 (71.0%): 100%|██| 500/500 [00:03<00:00, 149.47it/s]


2026/01/25 19:38:00 INFO dspy.evaluate.evaluate: Average Metric: 355 / 500 (71.0%)
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: Preparing the train data...
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: Collected data for 500 examples
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: After filtering with the metric, 355 examples remain
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: Using 355 data points for fine-tuning the model: openai/gpt-4o-mini-2024-07-18
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: Starting LM fine-tuning...
2026/01/25 19:38:00 INFO dspy.teleprompt.bootstrap_finetune: 1 fine-tuning job(s) to start
2026/01/25 19:38:01 INFO dspy.teleprompt.bootstrap_finetune: Starting 1 fine-tuning job(s)...
2026/01/25 19:38:01 INFO dspy.teleprompt.bootstrap_finetune: Calling lm.kill() on the LM to be fine-tuned to free up resources. This won't have any effect if the LM is not running.


[OpenAI Provider] Validating the data format
[OpenAI Provider] Saving the data to a file
[OpenAI Provider] Data saved to /Users/jun/.dspy_cache/finetune/3ae7bc380886d77d.jsonl
[OpenAI Provider] Uploading the data to the provider
[OpenAI Provider] Starting remote training
[OpenAI Provider] Job started with the OpenAI Job ID ftjob-Vllo9ZHQx82f4S3ahjl1Fwln
[OpenAI Provider] Waiting for training to complete
[OpenAI Provider] 2026-01-25 19:38:03 Validating training file: file-K5iyD9ydTPwByUrVMSgxvi


In [None]:
bsft_scores = []
for x in bsft_twitter_test:
    pred = bsft_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    bsft_scores.append(score)

bsft_accuracy = bsft_scores.count(True) / len(bsft_scores)
print("Bootstrap Fine Tune Accuracy: ", bsft_accuracy)

In [None]:
bsft_twt_sentiment.save("./optimized/bsft_twt_sentiment.pkl")

In [None]:
print(bsfr_twt_sentiment(tweet=example_tweet).sentiment)

## 옵티마이저 선택하기

**DSPy 문서에서 발췌:**

* 예제가 매우 적은 경우(약 10개 수준)에는 **BootstrapFewShot**부터 시작하라.
* 데이터가 더 많은 경우(50개 이상)에는 **BootstrapFewShotWithRandomSearch**를 시도하라.
* 지시문(instruction) 최적화만 하고 싶고(즉, 프롬프트를 0-shot으로 유지하고 싶은 경우)라면, **0-shot 최적화로 설정한 MIPROv2**를 사용하라.
* 더 많은 추론 호출을 사용해 긴 최적화 과정(예: 40회 이상의 trial)을 수행할 의향이 있고, 과적합을 방지할 수 있을 만큼 충분한 데이터(예: 200개 이상)가 있다면 **MIPROv2**를 사용하라.
* 대형 언어 모델(예: 70억 파라미터 이상)로 위 방법 중 하나를 성공적으로 사용했고, 매우 효율적인 프로그램이 필요하다면 **BootstrapFinetune**을 사용해 소형 언어 모델을 해당 작업에 맞게 파인튜닝하라.

무엇을 선택해야 할지 모르겠다면 **Ensemble 컴파일러**를 사용해 여러 최적화된 프로그램을 결합한 뒤, 출력 결과를 다수결, 가중 다수결 등의 방식으로 후처리해 최종 출력을 얻어라.

<img src="./Media/ensemble_diagram.png">

**최적화된 프로그램의 재최적화**

강조했듯이, 최적화를 한 번만 수행하는 것으로는 보통 충분하지 않습니다.
메트릭, 프로그램, 그리고 프로그램 내부의 메트릭 전반에 걸쳐 반복적으로 개선해야 합니다.

DSPy에는 이를 장려하기 위한 내장 함수인 **BetterTogether**가 제공됩니다.

<img src="./Media/better_together.png" width=400>

하지만 실제로 차이가 있는지 확인하기 위해, 여기서는 수동으로 진행해 보겠습니다.

#### **보지 않은 데이터(Unseen Data) 가져오기**


In [97]:
import json
# Formatting Examples
final_twitter_train = []
final_twitter_test = []
train_size = 300  # how many for train
test_size = 300   # how many for test
start_row = 1500  # start reading from this row

with open("./datasets/train.jsonl", 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        # Skip unutil we reach start_row
        if i < start_row:
            continue
        # Adjust the index for our collection logic
        collection_index = i - start_row

        if collection_index >= (train_size + test_size):
            break

        data = json.loads(line.strip())
        example = dspy.Example(
            tweet=data['text'],
            sentiment=data['label_text']
        ).with_inputs("tweet")
        if collection_index < train_size:
            final_twitter_train.append(example)
        else:
            final_twitter_test.append(example)

#### **MIPROv2를 사용해 파인튜닝된 프로그램 최적화하기**

In [98]:
mipro_optimizer = MIPROv2(
    metric=validate_answer,
    prompt_model=dspy.LM('openai/gpt-4o'),  # Different Model for Prompt Generation
    num_candidates=10,                      # Instructions to try
)

mipro_bsft_twt_sentiment = mipro_optimizer.compile(bsft_twt_sentiment, trainset=final_twitter_train, valset=final_twitter_test)

NameError: name 'bsft_twt_sentiment' is not defined

In [None]:
final_scores = []
for x in final_twitter_test:
    pred = mipro_bsft_twt_sentiment(**x.inputs())
    score = validate_answer(x, pred)
    final_scores.append(score)

mipro_bsft_accuracy = final_scores.count(True) / len(final_scores)
print("MIPROv2 After Bootstrap Fine Tube Accuracy: ", mipro_bsft_accuracy)

In [None]:
mipro_bsft_twt_sentiment.save("./optimized/mipro_bsft_twt_sentiment.pkl")

In [None]:
print(mipro_bsft_twt_sentiment(tweet=example_tweet).sentiment)

---

## **마무리 생각 (Final Thoughts)**

이 노트북은 사실상 DSPy 공식 문서를 코드 중심으로 탐구한 내용이므로, DSPy의 공식 문서를 꼭 확인해 보길 바란다. DSPy 팀은 최신 릴리스(2024년 12월 기준)의 일환으로 다양한 튜토리얼과 가이드를 지속적으로 업데이트하고 있다.

전반적으로 DSPy는 언어 모델을 프로그램에 적용하는 데 있어 흥미로운 접근 방식을 제공한다. 단순한 프롬프트 시행착오에 의존하는 대신, 명확한 메트릭 정의와 최적화를 중심으로 엄밀성을 도입한다. 해석하거나 튜닝하기 어려운 텍스트 문자열을 직접 다루는 대신, 알고리즘적 접근을 통해 추가 최적화가 가능한 깔끔한 기본 템플릿을 제공하며, 자동화된 방식으로 few-shot 예제를 조정하거나 생성하고, LLM에 전달되는 지시문을 직접 변경하거나, 혹은 이 둘을 결합할 수 있다.

딥러닝 프레임워크에서 영감을 받은 DSPy는 LLM 애플리케이션을 체계적이고 통제된 방식으로 신뢰성 있게 최적화하고 반복 개선할 수 있는 강력한 방법을 제공한다. 또한 이 생태계는 하루가 다르게 성장하고 있다. DSPy 저장소에 별(star) 하나 눌러주는 것도 잊지 말자.
