<a href="https://colab.research.google.com/github/parthasarathydNU/gen-ai-coursework/blob/main/advanced-llms/CourseWork/INFO_7374_Lecture_9_LLM_Programs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Programs

https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

Focus is shifting from models to compound systems with multiple components
Examples
- AlphaCode 2: state-of-the-art results in programming through a carefully engineered system that uses LLMs to generate up to 1 million possible solutions for a task and then filter down the set
- AlphaGeometry: combines an LLM with a traditional symbolic solver to tackle olympiad problems
- enterprise LLM applications: Databricks found that 60% of LLM applications use some form of [retrieval-augmented generation (RAG)](https://arxiv.org/pdf/2005.11401.pdf), and 30% use multi-step chains

This shift to compound systems opens many interesting design questions, but it is also exciting, because it means leading AI results can be achieved through clever engineering, not just scaling up training.

## Why Compound Systems?

Definition: system that tackles AI tasks using multiple interacting components
- Reasons for using compound systems:
    1. Some tasks are easier to improve via system design than model scaling
    2. Systems can be dynamic and incorporate timely data
    3. Improving control and trust is easier with systems
    4. Performance goals vary widely and require system flexibility
- Compound systems match industry trends in other AI fields like self-driving cars
- Compound systems will likely remain a leading paradigm as models improve

![](https://i.imgur.com/ZoZDECO.png)

## AlphaCodeium

https://arxiv.org/pdf/2401.08500.pdf

A key observation is that generating additional useful tests is easier than generating a correct code solution. Adding specific tests requires mainly understanding the problem, some insight, and basic brute-force or logical reasoning.

![](https://i.imgur.com/FLbbrzV.png)

![](https://i.imgur.com/Jg25n0W.png)

The pre-processing phase represents a linear flow where AlphaCodium reasons about the problem, in natural language.

The code iterations phase is where AlphaCodium generates, runs, and fixes a solution code against certain tests.

### Preprocessing Stage

1. Problem reflection: Describe the problem, in bullet points, while addressing the problem goals, inputs, outputs, rules, constraints, and other relevant details that appear in the problem description.


![](https://i.imgur.com/CinCRPK.png)

2. Public tests reasoning: Explain why each test input leads to the output.

3. Generate possible solutions: Generate a list of 2-3 possible solutions to the problem, described in natural language.

4. Rank solutions: Rank the possible solutions and choose the "best solution", in terms of correctness, simplicity, and robustness (not necessarily take the "most efficient" solution).

5. Generate additional AI tests. Generate an additional 6-8 diverse input-output tests for the problem. Try to cover cases and aspects not covered by the original public tests.


### Code Iterations Phase
1. Generate an initial code solution: Choose a potential solution, then run on the public/AI tests. Select the one that has the highest pass rate.
2. Iterate on public tests: start with initial code solution. Iteratively run public tests. If any tests fail, include the error message, then try to fix it.
3. Iterate on AI generated tests: Continue the run-fix iterations on the AI generated tests

# Verifiers

Generating a single completion is a bit risky. Can we have a model *check* the result before giving the final result?

## Outcome Supervision

provides feedback for the final result

## Let's Verify Step By Step

https://arxiv.org/abs/2305.20050 (May 2023)


- Process supervision: feedback for each intermediate reasoning step
- Process supervision significantly outperforms outcome supervision for training models to solve problems from the MATH dataset
- detecting and mitigating hallucinations is essential to improve reasoning capabilities
    - One effective method involves training reward models to discriminate between desirable and undesirable outputs. The reward model can then be used in an RL pipeline, or to perform search.
    - Resulting system is only as good as the reward model
- Process supervision vs outcome supervision
    - process supervision is easier for humans to interpret
    - Models trained with outcome supervision often use incorrect reasoning to reach to correct final answer
    - Process supervision can train a much more reliable reward model than outcome supervision
- Evaluation
    - Evaluate a reward model by its ability to perform best of N search over uniformly sampled solutions form the generator.
    - For each test problem, we select the solution ranked highest by the reward model, and grade it based on its final answer, and report the fraction that are correct. A reward model that is more reliable will select the correct solution more often
- Data Collection
    - Show human a step-by-step solution produced by the generator. For each step, the human labels, *positive*, *negative*, or *neutral*. A positive label is correct and reasonable. Negative is incorrect or unreasonable. Neutral is ambiguous.
    - 800k step level labels across 75k solutions to 12k problems.
    - Select solutions where the answer is wrong, but the starting RM thinks is good
    - ![](https://i.imgur.com/yrySkvD.png)
- Training
    - Train PRMs to predict the correctness of each step after the last token in each step. This prediction takes the form of a single token, and we maximize the log likelihood of these target tokens during training. The PRM can therefore be trained in a standard LM pipeline without special accommodations.
    - When comparing process supervision and outcome supervision, the authors deliberately choose to supervise only up to the first incorrect step to make the comparison more straightforward
- Solution Comparison
    - To compare multiple solutions, it is necessary to compute a single score for each solution. This is an important but straightforward detail: we define the PRM score for a solution to be the probability that every step is correct under the PRM. We implement this as the product of the correctness probabilities for each step
- Results
    - ![](https://i.imgur.com/MLjemN5.png)

## Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (Feb 2024)

https://arxiv.org/abs/2312.08935

- Process reward model
- *automatically* constructed process wise supervision data
- Applications
    - Verification: can be used to rerank multiple outputs from LMs
    - RL: Can be used to train LM policy
- Inspired by MCTS
    - define the quality of an intermediate step as its potential to deduce the correct final answer.
    - by leveraging the correctness of the answer, we can automatically gather step wise supervision
    - for a given math problem with a golden answer and a step-by-step solution, to achieve the label of a specific step, we utilize a fine tuned LLM to decode multiple subsequent reasoning paths from this step
    - we further validate whether the decoded final answer matches with the golden answer.
    - if a reasoning step can deduce more correct answers than another over multiple trials, it would be assigned a higher correctness score

- Task formulation:
    - Verification: given a problem $p$, sample $N$ candidate solutions. These candidates are then scored using a reward model, and the highest scoring solution is selected as the final answer. A better reward model elevates the likelihood of selecting the solution containing the correct answer.
    - RL: use the PRM to supervise LLMs with step by step PPO
- ORM: $$L_{ORM} = -(y_s log r_s + (1 - y_s) log (1 - r_s))$$
- PRM: assigns a score to each reasoning step $$L_{PRM} = -\sum_1^K y_{s_i}logr_{s_i} + (1 - y_{s_i})log(1 - r_{s_i})$$
    - in their experiments, found little different for using a neutral class
- Automatic PRM dataset collection
    - Defining the quality of reasoning step: *potential to deduce the correct answer*
    - To quantify and estimate the *potential* for a given reasoning step $s_i$, we use a LM (completer) to finalize $N$ subsequent reasoning processes from this step. Then we estimate the potential of this step based on the correctness of all decoded answers
    - ![there are some typos](https://i.imgur.com/PSGEgrX.png)
    - Estimation: From the completion results, how do we get the quality score for a step?
        - Hard estimation: 1 if any completion leads to the correct solution, 0 if no completion leads to the correct solution
        - Soft estimation: quality score is the proportion that leads to the correct solution
- RL: after training the PRM, use PPO in a step by step manner. Instead of using PPO with ORM which only offers a reward at the end of the response, step by step PPO has rewards at the end of each reasoning step

# DSPy


The framework for programming—not prompting—foundation models


Define a task, a metric, and a general architecture and DSPy will optimize it.

For example:

1. Task: sentiment classification
2. Metric: Accuracy
3. Architecture: Few shot with CoT

DSPy attempts to be the PyTorch for optimizing LLM Programs.

In PyTorch
1. Task: sentiment classification
2. Metric: Accuracy
3. Architecture: BERT/GPT/Number of tranformer blocks/etc


PyTorch tunes weights, DSPy can tune LM weights, few shot examples, prompt, and, demonstrations

In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os

try: # When on google Colab, let's clone the notebook so we download the cache.
    import google.colab
    repo_path = 'dspy'
    !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
except:
    repo_path = '.'

if repo_path not in sys.path:
    sys.path.append(repo_path)

# Set up the cache for this notebook
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(repo_path, 'cache')

import pkg_resources # Install the package if it's not installed
if not "dspy-ai" in {pkg.key for pkg in pkg_resources.working_set}:
    !pip install -U pip
    !pip install dspy-ai
    !pip install openai~=0.28.1
    # !pip install -e $repo_path

import dspy

fatal: cannot change to 'dspy': No such file or directory
Cloning into 'dspy'...
remote: Enumerating objects: 28454, done.[K
remote: Counting objects: 100% (362/362), done.[K
remote: Compressing objects: 100% (182/182), done.[K
remote: Total 28454 (delta 209), reused 293 (delta 178), pack-reused 28092[K
Receiving objects: 100% (28454/28454), 26.85 MiB | 6.60 MiB/s, done.
Resolving deltas: 100% (12566/12566), done.
Updating files: 100% (4381/4381), done.
Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting dspy-ai
  Downloading dspy_ai-2.4.0-py3-none-any.whl.metadata (36 kB)
Collecting backoff~=2.2.1 (from dspy-ai)
  Downloading back

In [None]:
from google.colab import userdata
import os
os.environ["TOGETHER_API_KEY"] = userdata.get('TOGETHER_API_KEY')

In [None]:
lm = dspy.OpenAI(
    model="Qwen/Qwen1.5-4B-Chat",
    max_tokens=400,
    model_type="chat",
    api_key=os.environ["TOGETHER_API_KEY"],
    api_base="https://api.together.xyz/v1",
)


In [None]:
lm("what is your name?")

['I am Qwen, a large language model created by Alibaba Cloud.']

In [None]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=lm, rm=colbertv2_wiki17_abstracts)

## Signature

Textual specificiation of input/output of of a DSPy module

* This is a way for you to tell the LM what it needs to do, not how.


### Shorthand signatures

In [None]:
sentence = "it's a charming and often affecting journey."  # example from the SST-2 dataset.

classify = dspy.Predict('sentence -> sentiment')
classify(sentence=sentence).sentiment

'positive'

In [None]:
# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."""

summarize = dspy.ChainOfThought('document -> summary')
response = summarize(document=document)

print(response.summary)

The article is about a 21-year-old football player who made seven appearances for a team called the Hammers and scored his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. He had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of his contract with the promoted Tykes has not been revealed.


In [None]:
response

Prediction(
    rationale='Produce the summary.',
    summary="The article is about a 21-year-old football player who made seven appearances for a team called the Hammers and scored his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. He had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of his contract with the promoted Tykes has not been revealed."
)

In [None]:
lm.inspect_history(n=1)





Given the fields `document`, produce the fields `summary`.

---

Follow the following format.

Document: ${document}
Reasoning: Let's think step by step in order to ${produce the summary}. We ...
Summary: ${summary}

---

Document: The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page.
Reasoning: Let's think step by step in order to[32m Produce the summary.
Summary: The article is about a 21-year-old football player who made seven appearances for a team called the Hammers and scored his only goal for them in a Europa League qualification round match against A

#### Multiple outputs

In [None]:
sentence = "it's a charming and often affecting journey."

classify = dspy.Predict('sentence -> sentiment, emotion')
classify(sentence=sentence)

Prediction(
    sentiment='positive',
    emotion='charming, affecting'
)

In [None]:
lm.inspect_history(n=1)





Given the fields `sentence`, produce the fields `sentiment`, `emotion`.

---

Follow the following format.

Sentence: ${sentence}
Sentiment: ${sentiment}
Emotion: ${emotion}

---

Sentence: it's a charming and often affecting journey.
Sentiment:[32m positive
Emotion: charming, affecting[0m





#### Multiple Inputs

In [None]:
sentence = "it's a charming and often affecting journey."  # example from the SST-2 dataset.
comment = "the view was stunning."

classify = dspy.Predict('sentence, comment -> sentiment')
classify(sentence=sentence, comment=comment)

Prediction(
    sentiment='Positive'
)

In [None]:
lm.inspect_history(n=1)





Given the fields `sentence`, `comment`, produce the fields `sentiment`.

---

Follow the following format.

Sentence: ${sentence}
Comment: ${comment}
Sentiment: ${sentiment}

---

Sentence: it's a charming and often affecting journey.
Comment: the view was stunning.
Sentiment:[32m Positive[0m





#### Other potential signatures



1. Question Answering: `"question -> answer"`

2. Sentiment Classification: `"sentence -> sentiment"`

3. Summarization: `"document -> summary"`

4. Retrieval-Augmented Question Answering: `"context, question -> answer"`

5. Multiple-Choice Question Answering with Reasoning: `"question, choices -> reasoning, selection"`

### Class based Signatures

Adds more clarity to your prompt

In [None]:
class Emotion(dspy.Signature):
    """Classify emotion among sadness, joy, love, anger, fear, surprise."""

    sentence = dspy.InputField()
    sentiment = dspy.OutputField()

sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"  # from dair-ai/emotion

classify = dspy.Predict(Emotion)
classify(sentence=sentence)

Prediction(
    sentiment='sadness'
)

In [None]:
lm.inspect_history(n=1)





Classify emotion among sadness, joy, love, anger, fear, surprise.

---

Follow the following format.

Sentence: ${sentence}
Sentiment: ${sentiment}

---

Sentence: i started feeling a little vulnerable when the giant spotlight started blinding me
Sentiment:[32m sadness[0m





#### Multiple inputs

In [None]:
class CheckCitationFaithfulness(dspy.Signature):
    """Verify that the text is based on the provided context."""

    context = dspy.InputField(desc="facts here are assumed to be true")
    text = dspy.InputField()
    faithfulness = dspy.OutputField(desc="True/False indicating if text is faithful to context")

context = "The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."

text = "Lee scored 3 goals for Colchester United."

faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
faithfulness(context=context, text=text)

Prediction(
    rationale="verify the faithfulness of the text to the context. We need to check if the text accurately reflects the information provided in the context.\n\nContext: The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page.\n\nText: Lee scored 3 goals for Colchester United.\n\nReasoning: The text states that Lee scored twice for Colchester United, which is not consistent with the information provided in the context. The context states that Lee scored his only goal for the Hammers and scored twice for Colchester United, but not three goals.",
    faithf

In [None]:
lm.inspect_history(n=1)





Verify that the text is based on the provided context.

---

Follow the following format.

Context: facts here are assumed to be true

Text: ${text}

Reasoning: Let's think step by step in order to ${produce the faithfulness}. We ...

Faithfulness: True/False indicating if text is faithful to context

---

Context: The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page.

Text: Lee scored 3 goals for Colchester United.

Reasoning: Let's think step by step in order to[32m verify the faithfulness of the text to the context. We need to check if the text accurately

## Modules

Each built-in module abstracts a prompting technique (like chain of thought or ReAct)

#### CoT

In [None]:
question = "What are interesting topics in machine learning?"

# 1) Declare with a signature, and pass some config.
answerer = dspy.ChainOfThought('question -> answer')

# 2) Call with input argument.
response = answerer(question=question)

# 3) Access the outputs.
response.answer

'Interesting topics in machine learning include deep learning, natural language processing, computer vision, and reinforcement learning.'

In [None]:
lm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: What are interesting topics in machine learning?
Reasoning: Let's think step by step in order to produce the answer. We need to identify interesting topics in machine learning. Some interesting topics in machine learning include deep learning, natural language processing, computer vision, and reinforcement learning. These topics are currently being researched and have the potential to revolutionize various industries.
Answer:[32m Interesting topics in machine learning include deep learning, natural language processing, computer vision, and reinforcement learning.[0m





In [None]:
response

Prediction(
    rationale='produce the answer. We need to identify interesting topics in machine learning. Some interesting topics in machine learning include deep learning, natural language processing, computer vision, and reinforcement learning. These topics are currently being researched and have the potential to revolutionize various industries.',
    answer='Interesting topics in machine learning include deep learning, natural language processing, computer vision, and reinforcement learning.'
)

#### CoT with Hint

In [None]:
#Define a simple signature for basic question answering
class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

#Pass signature to ChainOfThought module
generate_answer = dspy.ChainOfThoughtWithHint(BasicQA)

# Call the predictor on a particular input alongside a hint.
question='What is the color of the sky?'
hint = "It's what you often see during a sunny day."
pred = generate_answer(question=question, hint=hint)

print(f"Question: {question}")
print(f"Predicted Answer: {pred.answer}")

Question: What is the color of the sky?
Predicted Answer: Blue.


In [None]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Hint: ${hint}

Answer: often between 1 and 5 words

---

Question: What is the color of the sky?

Reasoning: Let's think step by step in order to Blue.

Hint: It's what you often see during a sunny day.

Answer:[32m Blue.[0m





#### ReACT

https://arxiv.org/abs/2210.03629

It is specifically designed to compose the interleaved steps of Thought, Action, and Observation.

In [None]:
# Define a simple signature for basic question answering
class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# Pass signature to ReAct module
react_module = dspy.ReAct(BasicQA)

# Call the ReAct module on a particular input
question = 'What is the color of the sky?'
result = react_module(question=question)

print(f"Question: {question}")
print(f"Final Predicted Answer (after ReAct process): {result.answer}")

Question: What is the color of the sky?
Final Predicted Answer (after ReAct process): blue


In [None]:
lm.inspect_history(n=1)





You will be given `question` and you will respond with `answer`.

To do this, you will interleave Thought, Action, and Observation steps.

Thought can reason about the current situation, and Action can be the following types:

(1) Search[query], which takes a search query and returns one or more potentially relevant passages from a corpus
(2) Finish[answer], which returns the final `answer` and finishes the task

---

Follow the following format.

Question: ${question}

Thought 1: next steps to take based on last observation

Action 1: always either Search[query] or, when done, Finish[answer]

Observation 1: observations based on action

Thought 2: next steps to take based on last observation

Action 2: always either Search[query] or, when done, Finish[answer]

---

Question: What is the color of the sky?

Thought 1: I need to find information about the color of the sky.

Action 1: Search[sky color]

Observation 1:
[1] «Sky blue | Sky blue is the name of a colour that resembles the

#### Retrieve

In [None]:
query='When was the first FIFA World Cup held?'

# Call the retriever on a particular query.
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(query).passages

print(f"Top {retrieve.k} passages for question: {query} \n", '-' * 30, '\n')

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

Top 3 passages for question: When was the first FIFA World Cup held? 
 ------------------------------ 

1] History of the FIFA World Cup | The FIFA World Cup was first held in 1930, when FIFA president Jules Rimet decided to stage an international football tournament. The inaugural edition, held in 1930, was contested as a final tournament of only thirteen teams invited by the organization. Since then, the World Cup has experienced successive expansions and format remodeling to its current 32-team final tournament preceded by a two-year qualifying process, involving over 200 teams from around the world. 

2] 1950 FIFA World Cup | The 1950 FIFA World Cup, held in Brazil from 24 June to 16 July 1950, was the fourth FIFA World Cup. It was the first World Cup since 1938, the planned 1942 and 1946 competitions having been cancelled owing to World War II. It was won by Uruguay, who had won the inaugural competition in 1930, clinching the cup by beating the hosts Brazil 2–1 in the deciding ma

## Data


Data is represented using the `Example` type. It is similar to python `dict`s but have additional utility functions.

All DSPy modules return `Prediction`s which are subclassed from `Example`

In [None]:
qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.


In our example above, we specify the field values, but we don't know which field are inputs and which are outputs. Think back to when we were doing supervised fine tuning. We had our prompt (the input) and the completion (the output).


To represent this in DSPy, we can use the `with_inputs` function.

In [None]:
article_summary = dspy.Example(article= "This is an article.", summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys=None)
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)


## Metrics


In PyTorch, we have metrics that we are optimizing. For example, it may be the negative log likelihood loss, or an L2 loss.


There are calculations that are can be automatically computed.


In DSPy, metrics are functions that take the inputs and outputs of your DSPy program and return a score.

For classification tasks such as sentiment analysis, it may be possible to use simple metrics like F1 or accuracy but for more sophisticated answers, this may not be sufficient.


For more sophisticated tasks, you may want to use a form of LM feedback. For example, we can leverage techniques we've used previously, such as LLM as a Judge, or a reward model.

Metrics are flexible you can implement any python logic that you want. For example, you could count the number of words to and return a score of 1 when it is fewer than 5 words and 0 otherwise.

You can also use criteria and return the average or minimum of each score. For example, you can count the number of words and get the score from the LLM as a Judge, normalize, then return the minimum of both scores.












#### Exact match metric

In [None]:
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

#### Trace


When compiling, `trace` is not None, we want to be strict about judging things, so we will only return `True`, since we use these examples for bootstrapping demonstrations.

`trace` is `None` if it is used for evaluation or optimization
Otherwise, it will return `bool` for bootstrapping demonstrations

In [None]:
def validate_context_and_answer(example, pred, trace=None):
    # check the gold label and the predicted answer are the same
    answer_match = example.answer.lower() == pred.answer.lower()

    # check the predicted answer comes from one of the retrieved contexts
    context_match = any((pred.answer.lower() in c) for c in pred.context)

    if trace is None: # if we're doing evaluation or optimization
        return (answer_match + context_match) / 2.0
    else: # if we're doing bootstrapping, i.e. self-generating good demonstrations of each step
        return answer_match and context_match

In [None]:
def metric(example, pred, trace=None):
    answer_match = example.sentiment.lower() == pred.sentiment.lower()
    word_count = len(pred.sentiment.strip().split()) == 1
    if trace is None:
        return min(int(answer_match), int(word_count))
    else:
        return answer_match and word_count

In [None]:
scores = []

dev_set = [dspy.Example(statement="Today is a lovely day.", sentiment="Positive").with_inputs("statement")]

program = dspy.Predict("statement -> sentiment")

for x in dev_set:
    pred = program(**x.inputs())
    print("Prediction: ")
    print(pred)
    score = metric(x, pred)
    print("Score: ")
    print(score)
    scores.append(score)


Prediction: 
Prediction(
    sentiment='Positive'
)
Score: 
1


#### Using AI Feedback

In [None]:
class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specific dimension."""

    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="yes/no")


In [None]:
def metric(example, pred, trace=None):
    question, answer, tweet = example.question, example.answer, pred.output

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"

    correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
    print(correct)
    lm.inspect_history(n=1)


    engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
    print(engaging)
    lm.inspect_history(n=1)

    correct, engaging = [m.assessment_answer.lower() == 'yes.' for m in [correct, engaging]]
    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0

    if trace is not None: return score >= 2
    return score / 2.0

In [None]:
scores = []

dev_set = [dspy.Example(question="Is machine learning cool?", answer="Machine learning is very cool.").with_inputs("question")]

program = dspy.Predict("question -> output")

for x in dev_set:
    pred = program(**x.inputs())
    print(pred)
    score = metric(x, pred)
    print(score)
    scores.append(score)

Prediction(
    output='Yes, machine learning is cool!'
)
Prediction(
    assessment_answer='Yes.'
)




Assess the quality of a tweet along the specific dimension.

---

Follow the following format.

Assessed Text: ${assessed_text}
Assessment Question: ${assessment_question}
Assessment Answer: yes/no

---

Assessed Text: Yes, machine learning is cool!
Assessment Question: The text should answer `Is machine learning cool?` with `Machine learning is very cool.`. Does the assessed text contain this answer?
Assessment Answer:[32m Yes.[0m



Prediction(
    assessment_answer='Yes.'
)




Assess the quality of a tweet along the specific dimension.

---

Follow the following format.

Assessed Text: ${assessed_text}
Assessment Question: ${assessment_question}
Assessment Answer: yes/no

---

Assessed Text: Yes, machine learning is cool!
Assessment Question: Does the assessed text make for a self-contained, engaging tweet?
Assessment Answer:[32m Yes.[0m



1.0


## Optimizers

Traditional deep neural networks (DNNs) can be optimized with gradient descent, given a loss function and some training data.

A DSPy module has 3 kinds of parameters:
1. LM weights
2. instructions
3. demonstrations of the input/output behavior


Parameter updates
1. LM Weights -> Fine Tuning LM model weights
2. Instructions -> Updating instructions
3. Demonstrations -> selecting/creating few shot examples

Compiling often leads to better prompts than what humans write. Not because DSPy optimizers are more creative than humans, but simply because they can try more things, much more systematically, and tune the metrics directly.


In [None]:
from dspy.teleprompt import *

dspy/cache/compiler


### Labeled Few Shot

Construct few shot exmaples from provided labeled Q/A pairs

In [None]:
from dspy.teleprompt import LabeledFewShot
from dspy import Example

dataset = [
    Example(text="I'm so excited for my vacation next week!", emotion="happy"),
    Example(text="I feel like I'm not good enough.", emotion="sad"),
    Example(text="I can't believe I got the job! I'm thrilled!", emotion="happy"),
    Example(text="I'm really worried about my presentation tomorrow.", emotion="anxious"),
    Example(text="I miss my family so much. I wish I could see them.", emotion="sad"),
    Example(text="I'm having the best day ever!", emotion="happy"),
    Example(text="I'm so stressed out about this deadline.", emotion="anxious"),
    Example(text="I feel like crying. Nothing seems to be going right.", emotion="sad"),
    Example(text="I'm over the moon! We're finally getting married!", emotion="happy"),
    Example(text="I'm terrified of heights. I don't think I can do this.", emotion="anxious"),
    Example(text="I feel so alone. No one understands me.", emotion="sad"),
    Example(text="I can't stop smiling! Today has been amazing!", emotion="happy"),
    Example(text="I'm really anxious about this exam. I hope I pass.", emotion="anxious"),
    Example(text="I'm feeling down. I think I need a hug.", emotion="sad"),
    Example(text="I'm so grateful for my friends and family!", emotion="happy"),
    Example(text="I'm having a panic attack. I can't breathe.", emotion="anxious"),
    Example(text="I feel like I'm in a rut. Nothing excites me anymore.", emotion="sad"),
    Example(text="I'm on top of the world! Everything is going perfectly!", emotion="happy"),
    Example(text="I'm so nervous about this first date. What if it goes badly?", emotion="anxious"),
    Example(text="I feel so empty inside. I don't know what to do.", emotion="sad")
]

dataset = [e.with_inputs("text") for e in dataset]
train_size = int(len(dataset) * 0.8)
train_set =  dataset[:train_size]
dev_set = dataset[train_size:]



class EmotionClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.program = dspy.Predict("text -> emotion")

    def forward(self, text):
        return self.program(text=text)

def metric(example, pred, trace=None):
    text, gt_emotion, predicted_emotion = example.text, example.emotion, pred.emotion
    score = predicted_emotion.strip().lower() == gt_emotion

    if trace is not None: return score
    return float(score)


program = EmotionClassifier()
labeled_fewshot_optimizer = LabeledFewShot(k=3)
your_dspy_program_compiled = labeled_fewshot_optimizer.compile(student = program, trainset=train_set)

In [None]:
your_dspy_program_compiled(dev_set[0].text)

Prediction(
    emotion='depressed'
)

In [None]:
lm.inspect_history(n=1)





Given the fields `text`, produce the fields `emotion`.

---

Follow the following format.

Text: ${text}
Emotion: ${emotion}

---

Text: I'm really anxious about this exam. I hope I pass.
Emotion: anxious

---

Text: I'm having a panic attack. I can't breathe.
Emotion: anxious

---

Text: I'm so stressed out about this deadline.
Emotion: anxious

---

Text: I feel like I'm in a rut. Nothing excites me anymore.
Emotion:[32m depressed[0m





In [None]:
your_dspy_program_compiled.save("./emotion_classifier_few_shot_labeled.json")

In [None]:
import json
with open("./emotion_classifier_few_shot_labeled.json", "r") as f:
    data = json.loads(f.read())
    print(json.dumps(data, indent=4))


{
    "program": {
        "lm": null,
        "traces": [],
        "train": [],
        "demos": [
            {
                "text": "I'm really anxious about this exam. I hope I pass.",
                "emotion": "anxious"
            },
            {
                "text": "I'm having a panic attack. I can't breathe.",
                "emotion": "anxious"
            },
            {
                "text": "I'm so stressed out about this deadline.",
                "emotion": "anxious"
            }
        ],
        "signature_instructions": "Given the fields `text`, produce the fields `emotion`.",
        "signature_prefix": "Emotion:"
    }
}


### Bootstrap Few Shot

Uses your program to self-generate complete demonstrations for every stage of your program. Will simply use the generated demonstrations (if they pass the metric) without any further optimization.

In [None]:
from dspy.teleprompt import BootstrapFewShot

class EmotionCotClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.program = dspy.ChainOfThought("text -> emotion")

    def forward(self, text):
        return self.program(text=text)

def metric(example, pred, trace=None):
    text, gt_emotion, predicted_emotion = example.text, example.emotion, pred.emotion
    score = predicted_emotion.strip().lower() == gt_emotion

    if trace is not None: return score
    return float(score)

program = EmotionCotClassifier()

fewshot_optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=2, max_labeled_demos=4, max_rounds=1, max_errors=5)

compiled_few_shot = fewshot_optimizer.compile(student=program, trainset=train_set[:4])

 50%|█████     | 2/4 [00:01<00:01,  1.12it/s]

Bootstrapped 2 full traces after 3 examples in round 0.





In [None]:
compiled_few_shot(dev_set[0].text)

Prediction(
    rationale='${produce the emotion}. We need to analyze the text to determine the underlying emotion. The text mentions a feeling of being stuck in a rut, which suggests a lack of interest or motivation. Therefore, the emotion that best fits this text is "boredom".',
    emotion='boredom'
)

In [None]:
lm.inspect_history(n=1)





Given the fields `text`, produce the fields `emotion`.

---

Follow the following format.

Text: ${text}
Reasoning: Let's think step by step in order to ${produce the emotion}. We ...
Emotion: ${emotion}

---

Text: I'm so excited for my vacation next week!
Reasoning: Let's think step by step in order to ${produce the emotion}. We need to consider the context of the text. The text mentions a vacation, which is typically associated with positive emotions such as excitement and anticipation. Therefore, the emotion that best fits the text is "happy".
Emotion: happy

---

Text: I feel like I'm not good enough.
Reasoning: Let's think step by step in order to ${produce the emotion}. We ...
Emotion: sad

---

Text: I'm really worried about my presentation tomorrow.
Emotion: anxious

---

Text: I can't believe I got the job! I'm thrilled!
Emotion: happy

---

Text: I feel like I'm in a rut. Nothing excites me anymore.
Reasoning: Let's think step by step in order to[32m ${produce the emoti

### Bootstrap Few Shot With Random Search

 Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program.

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

class EmotionCotClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.program = dspy.ChainOfThought("text -> emotion")

    def forward(self, text):
        return self.program(text=text)

def metric(example, pred, trace=None):
    text, gt_emotion, predicted_emotion = example.text, example.emotion, pred.emotion
    score = predicted_emotion.strip().lower() == gt_emotion

    if trace is not None: return score
    return float(score)

program = EmotionCotClassifier()

fewshot_with_random_search_optimizer = BootstrapFewShotWithRandomSearch(metric=metric, max_bootstrapped_demos=2, max_labeled_demos=4, max_rounds=1, max_errors=5)

compiled_few_shot = fewshot_with_random_search_optimizer.compile(student=program, trainset=train_set[:4])

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 16 candidate sets.


Average Metric: 1.0 / 4  (25.0): 100%|██████████| 4/4 [00:01<00:00,  2.63it/s]


Average Metric: 1.0 / 4  (25.0%)
Score: 25.0 for set: [0]
New best score: 25.0 for seed -3
Scores so far: [25.0]
Best score: 25.0


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:01<00:00,  3.06it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
New best score: 100.0 for seed -2
Scores so far: [25.0, 100.0]
Best score: 100.0


 50%|█████     | 2/4 [00:00<00:00, 517.59it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:01<00:00,  3.58it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 75%|███████▌  | 3/4 [00:01<00:00,  2.30it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00,  5.80it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:01,  1.86it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:01<00:00,  3.49it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:01,  2.36it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00,  5.72it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:00, 957.39it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00, 481.77it/s] 


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:01,  2.28it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00,  4.50it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 75%|███████▌  | 3/4 [00:01<00:00,  2.09it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:01<00:00,  3.44it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:02,  1.45it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:01<00:00,  3.47it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 50%|█████     | 2/4 [00:00<00:00,  6.78it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:01<00:00,  3.99it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 50%|█████     | 2/4 [00:00<00:00,  6.52it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00, 385.44it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 50%|█████     | 2/4 [00:00<00:00, 1063.06it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:00<00:00, 706.68it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:00, 1163.15it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00, 827.20it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 50%|█████     | 2/4 [00:00<00:00, 1198.20it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:00<00:00, 761.63it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 75%|███████▌  | 3/4 [00:00<00:00, 10.19it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:00<00:00, 719.06it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 75%|███████▌  | 3/4 [00:00<00:00,  3.07it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 3.0 / 4  (75.0): 100%|██████████| 4/4 [00:01<00:00,  3.98it/s]


Average Metric: 3.0 / 4  (75.0%)
Score: 75.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 75.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:01,  2.27it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:00<00:00,  5.58it/s]


Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 75.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 25%|██▌       | 1/4 [00:00<00:01,  1.56it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 4.0 / 4  (100.0): 100%|██████████| 4/4 [00:01<00:00,  3.66it/s]

Average Metric: 4.0 / 4  (100.0%)
Score: 100.0 for set: [4]
Scores so far: [25.0, 100.0, 100.0, 75.0, 75.0, 75.0, 75.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 100.0, 75.0, 75.0, 100.0, 100.0]
Best score: 100.0
Average of max per entry across top 1 scores: 1.0
Average of max per entry across top 2 scores: 1.0
Average of max per entry across top 3 scores: 1.0
Average of max per entry across top 5 scores: 1.0
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
19 candidate programs found.





In [None]:
compiled_few_shot(dev_set[0].text)

Prediction(
    rationale='${produce the emotion}. We need to analyze the text to determine the underlying emotion. The text mentions a feeling of being stuck in a rut, which suggests a lack of interest or motivation. Therefore, the emotion that best fits this text is "boredom".',
    emotion='boredom'
)

In [None]:
lm.inspect_history(n=1)





Given the fields `text`, produce the fields `emotion`.

---

Follow the following format.

Text: ${text}
Reasoning: Let's think step by step in order to ${produce the emotion}. We ...
Emotion: ${emotion}

---

Text: I'm really worried about my presentation tomorrow.
Emotion: anxious

---

Text: I feel like I'm not good enough.
Emotion: sad

---

Text: I'm so excited for my vacation next week!
Emotion: happy

---

Text: I can't believe I got the job! I'm thrilled!
Emotion: happy

---

Text: I feel like I'm in a rut. Nothing excites me anymore.
Reasoning: Let's think step by step in order to[32m ${produce the emotion}. We need to analyze the text to determine the underlying emotion. The text mentions a feeling of being stuck in a rut, which suggests a lack of interest or motivation. Therefore, the emotion that best fits this text is "boredom". 
Emotion: boredom[0m





### Loading a program

```
loaded_program = YOUR_PROGRAM_CLASS()
loaded_program.load(path=YOUR_SAVE_PATH)
```

## Typed Predictors


In [None]:
from pydantic import BaseModel, Field

class Input(BaseModel):
    context: str = Field(description="The context for the question")
    query: str = Field(description="The question to be answered")

class Output(BaseModel):
    answer: str = Field(description="The answer for the question")
    confidence: float = Field(ge=0, le=1, description="The confidence score for the answer")

In [None]:
class QASignature(dspy.Signature):
    """Answer the question based on the context and query provided, and on the scale of 10 tell how confident you are about the answer."""

    input: Input = dspy.InputField()
    output: Output = dspy.OutputField()

In [None]:
predictor = dspy.TypedPredictor(QASignature)

In [None]:
doc_query_pair = Input(
    context="The quick brown fox jumps over the lazy dog",
    query="What does the fox jumps over?",
)

prediction = predictor(input=doc_query_pair)

In [None]:
prediction

Prediction(
    output=Output(answer='The fox jumps over the lazy dog.', confidence=0.95)
)

In [None]:
answer = prediction.output.answer
confidence_score = prediction.output.confidence

print(f"Prediction: {prediction}\n\n")
print(f"Answer: {answer}, Answer Type: {type(answer)}")
print(f"Confidence Score: {confidence_score}, Confidence Score Type: {type(confidence_score)}")

Prediction: Prediction(
    output=Output(answer='The fox jumps over the lazy dog.', confidence=0.95)
)


Answer: The fox jumps over the lazy dog., Answer Type: <class 'str'>
Confidence Score: 0.95, Confidence Score Type: <class 'float'>


### Assertions/Suggestions

https://arxiv.org/abs/2312.13382

Suggestions will not fail if the condition is not met, it will continue to update the prompt to add refinements up to a specified retry count. After this retry amount, it will move onto the next module.

Assertions will stop when the condition is violated.

![](https://i.imgur.com/dCKOJ82.png)

![](https://i.imgur.com/SMh3v9L.png)