# Bootstrap Few-shot Prompting with LangSmith

Prompt engineering is a pain. You can use _examples_ to optimize the prompt for you with the help of tools like LangSmith. Instead of guessing which examples will be the most impactful, you can use tried-and-true evaluation practices to curate and compile the right examples for your pipeline. The main steps are:

1. Create a dataset
2. Pick a metric to improve
3. Create an initial system
4. Decide the update logic (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
5. Train!


Below is an example bootstrapping a gpt-3.5-turbo model on an entailment task using few-shot examples. This example inspired by Christopher Potts' [example](https://github.com/stanfordnlp/dspy/blob/main/examples/nli/scone/scone.ipynb) on the SCONE dataset.

The task is natural language inference, where the LLM is required to predict whether the a statement can be logically concluded from a premise / grounding statement.

In [None]:
%pip install -U langsmith langchain langchain_openai pandas

In [1]:
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = "YOUR API KEY"
# os.environ["OPENAI_API_KEY"] = "YOUR API KEY"

In [2]:
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache
from langchain_core.globals import set_llm_cache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [3]:
from langsmith import Client

client = Client()

public_datasets = [
    "https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d",  # train
    "https://smith.langchain.com/public/3205fa05-bd78-4eaf-924f-96df0f577b1f/d",  # train2
    "https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d",  # dev
    "https://smith.langchain.com/public/aee61506-3c60-4ca8-95c4-0314c9719ca8/d",  # dev2
    "https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d",  # test
]
for ds in public_datasets:
    client.clone_public_dataset(ds)
train_name = "scone-train2"
dev_name = "scone-dev2"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"

example = next(client.list_examples(dataset_name=train_name))
print("inputs", example.inputs)
print("outputs", example.outputs)

inputs {'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
outputs {'answer': 'No', 'category': 'one_not_scoped'}


Reviewing the values above, these examples can be tricky! 

## Evaluator

Since we have ground-truth clasification labels, we can use an exact-match criterion as our evaluator.

In [4]:
import sys

import langsmith as ls


def exact_match(run, example):
    # Evaluate the exact match correctness of the NLI result
    try:
        predicted = run.outputs["is_entailed"]
        expected = example.outputs["answer"]
        score = expected.lower() == predicted.lower()
    except Exception as e:
        try:
            expected = example.outputs["answer"]
            expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
            score = run.outputs["output"].is_entailed == expected_bool
        except Exception as e2:
            score = 0
    return {
        "key": "exact_match",
        "score": int(score),
    }

In [131]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

# And we will create a placeholder in the template to add few-shot examples
original_prompt_str = """You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${{context}}

Question: ${{question}}

Reasoning: Let's think step by step in order to ${{produce the answer}}. We ...

Answer: Yes or No

Context: {context}

Question: {question}

Reasoning: Let's think step by step in order to"""
base_prompt = PromptTemplate.from_template(original_prompt_str)


def parse(pred: str):
    fnd = "\nAnswer:"
    idx = pred.find(fnd)
    answer = pred[idx + len(fnd) :].strip()
    return {"is_entailed": answer, "reasoning": pred[:idx].strip()}


chain = base_prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser() | parse

In [6]:
prediction = chain.invoke(example.inputs)
prediction

{'is_entailed': 'No',
 'reasoning': 'produce the answer. We are given that the man does not walk confidently and drops produce. We are not specifically told that he drops kale. It could be any type of produce. Therefore, we cannot logically conclude for sure that he drops kale.'}

## Initial Evaluation

In [None]:
res = ls.evaluate(
    chain.invoke,
    data="scone-test2",
    evaluators=[exact_match],
    metadata={"optimizer": None},
)

Got about ~55% on it. Definitely room for improvement.

## ✨ Optimize ✨


This just means to "use data to update the system". At present, LangChain runnables don't natively support a "backwards" method (a la pytorch), but you can pretty easily define updates/mutations for key important components you'd want to update, (such as prompts or LLMs).

For instance, component-wise, you could apply:
- Few shot prompting: add an additional string input or MessagesPlaceholder in the prompt template
- Updating the instructions: update the prompt template directly (likely the system prompt)
- LLM: do a backwards pass.

We will focus on few-shot prompting to limit the search space. We will then apply a genetic/evolutionary algorithm to compare performance of different few-shot examples and pick the ones that provide the most "lift" of the provided metric.

We'll first create a constructor for our chain that accepts the few-shot examples, letting us re-create the chain with each updated state.

In [38]:
# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional

from langchain_core.runnables import RunnableLambda


def format_example(example: dict):
    inputs = example["input"]
    outputs = example["output"]
    return f"""

Context: {inputs['context']}

Question: {inputs['question']}

Reasoning: {outputs['reasoning']}

Answer: {outputs['is_entailed']}

"""


def create_chain(prompt=None, llm=None):
    if prompt is None:
        prompt = base_prompt
    else:
        prompt = PromptTemplate.from_template(prompt)
    llm = llm or ChatOpenAI(model="gpt-3.5-turbo")
    chain = (prompt | llm | StrOutputParser() | parse).with_config(tags=["to_train"])
    return chain.invoke

## Training

Next, we'll define the training utilities.

In [105]:
meta_prompt = """You are an expert prompt engineer tasked with improving prompts for various AI tasks. You will be provided with the current prompt and a set of annotated predictions made by an AI using this prompt. Your goal is to analyze the results and create an improved version of the prompt.

## Current Prompt
<current_prompt>
{current_prompt}
</current_prompt>
## Analysis
Carefully analyze the current prompt and the annotated predictions. Identify elements of the prompt that seem to lead to high-quality responses versus low-quality responses. Pay close attention to any user-provided scores, feedback, or notes.

## Brainstorming
<task>
First, state the task the original prompt is trying to perform.
</task>
<brainstorm>
Then, brainstorm what should be im proved. Some example aspects you can address (depending on the task)
1. Clarity: How can you make the instructions clearer (without loss of generality)?
2. Context: What additional context is the current prompt missing that would help it avoid errors identified by the annotated feedback above?
3. Constraints: Are there any constraints that need to be added or modified?
4. Structure: How can you improve the prompt's structure or formatting?
5. Focus: What changes could help the AI focus on the most critical aspects of the task?
6. Task-specific considerations: What unique elements of this task need to be addressed?
</brainstorm>
<plan>
After brainstorming, create a plan of proposed edits. Cite specific annotated scores or notes that would be improved using the fixed instructions.
</plan>
<improved_prompt>
Finally, write the improved prompt.
</improved_prompt>

## Important Notes
1. All variables in double curly braces {{variable_name}} are placeholders for input that MUST be retained in the new version of the prompt. These represent task-critical information that the AI needs access to.
2. Ensure that your improved prompt is clear, concise, and tailored to the specific task at hand.
3. Consider the appropriate length for the expected output and provide guidance on this in your prompt if necessary.
4. If the task involves sensitive or controversial topics, include appropriate guidelines for handling such content responsibly.

Remember, your goal is to create a prompt that will consistently guide the AI to generate high-quality responses across the entire dataset (and when generalizing beyond the dataset)."""

In [106]:
from collections import defaultdict


def format_feedback(single_feedback, max_score=1):
    """
    Formats a single feedback item into a structured string.

    This function takes a feedback object and an optional maximum score,
    then formats the feedback's score (if present) and comment into a structured
    string representation. The feedback's key is used as an identifier in the
    output string.

    Parameters:
    - single_feedback (object): An object representing a single piece of feedback.
                                It must have `score`, `comment`, and `key` attributes.
    - max_score (int, optional): The maximum possible score that can be assigned to
                                 feedback. Defaults to 4.

    Returns:
    - str: A structured string representation of the feedback, including the key,
           score (if available), and comment.
    """
    if single_feedback.score is None:
        if single_feedback.value is not None:
            val = f"Value: {single_feedback.value}"
        else:
            val = ""
    else:
        val = f"\nScore:[{single_feedback.score}/{max_score}]"
    comment = f"{single_feedback.comment}".strip()
    if comment:
        comment = f"\n{comment}"
    return f"""<feedback key={single_feedback.key}>{val}{comment}
</feedback>"""


def format_run_with_feedback(run, feedback, id):
    """
    Formats the output of a run along with its associated feedback into a structured string.

    This function takes a run object and a list of feedback objects associated with that run,
    then formats the run's output and the feedback into a structured string representation
    suitable for display or further processing.

    Parameters:
    - run (object): An object representing a single run. Must have an `outputs` attribute
                    that contains a dictionary with an `"output"` key.
    - feedback (list): A list of feedback objects to be formatted and included with the run's output.

    Returns:
    - str: A structured string representation of the run's output and associated feedback.
    """
    all_feedback = "\n".join([format_feedback(f) for f in feedback])
    return f"""<example id={id}>
<input>
{run.inputs}
</input>
<prediction>
{run.outputs}
</prediction>
<annotations>
{all_feedback}
</annotations>
</example>"""

In [151]:
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from pydantic.v1 import BaseModel, Field, root_validator
from trustcall import create_extractor


def optimize(current_prompt, annotated_predictions):
    input_variables = list(PromptTemplate.from_template(current_prompt).input_variables)

    class OptimizerOutput(BaseModel):
        """Think step-by-step, then write the optimized prompt."""

        task_objective: str = Field(
            description="What task is this prompt seeking to solve? What defines success here?"
        )
        brainstorm: str = Field(
            description="At least 3 bullet points brainstorming how to improve the prompt. Can focus on logical/correctness, style, or any other qualities that are salient, given the provided annotations."
        )
        plan: str = Field(
            description="Proposed edits and citations on which feedback will be improved."
        )
        improved_prompt: str = Field(
            description=f"The full text of the optimized prompt. Ensure that all the curly bracket {{variable_name}}'s are retained in the new prompt. These are: {input_variables}."
        )

        @root_validator
        def check_prompt_strings(cls, values):
            predicted_input_variables = list(
                PromptTemplate.from_template(
                    values.get("improved_prompt") or ""
                ).input_variables
            )
            missing = set(input_variables) - set(predicted_input_variables)
            extra = set(predicted_input_variables) - set(input_variables)
            if missing or extra:
                raise ValueError(
                    f"Unexpected variables included in output prompt. Expected {input_variables}. Got: {predicted_input_variables}.\nMissing: {missing}\nExtra: {extra}"
                )
            return values

    optim_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", meta_prompt),
            (
                "user",
                """Given the following annotated/evaluated predictions, optimize the provided prompt.
    <annotated_predictions>
    {annotated_predictions}
    </annotated_predictions>
    
    Remember to first brainstorm, then plan, and finally generate the optimized prompt. Remember to retain all bracketed variable placeholders.""",
            ),
        ]
    )

    meta_optimizer = optim_prompt | create_extractor(
        ChatAnthropic(model="claude-3-5-sonnet-20240620"),
        tools=[OptimizerOutput],
        tool_choice=OptimizerOutput.__name__,
    )
    results = meta_optimizer.invoke(
        {
            "annotated_predictions": annotated_predictions,
            "current_prompt": current_prompt,
        }
    )
    return results["responses"][0]

In [152]:
from difflib import SequenceMatcher

from rich.console import Console
from rich.jupyter import print as richprint
from rich.panel import Panel
from rich.syntax import Syntax


def colorize_diff(diff):
    for op, i1, i2, j1, j2 in diff.get_opcodes():
        if op == "equal":
            yield diff.a[i1:i2]
        elif op == "insert":
            yield f"[green]{diff.b[j1:j2]}[/green]"
        elif op == "delete":
            yield f"[red]{diff.a[i1:i2]}[/red]"
        elif op == "replace":
            yield f"[red]{diff.a[i1:i2]}[/red][green]{diff.b[j1:j2]}[/green]"


def print_rich_diff(original, updated, title: str = ""):
    diff = SequenceMatcher(None, original, updated)
    colorized_diff = "".join(colorize_diff(diff))
    panel = Panel(
        colorized_diff, title=title or "Prompt Diff", expand=False, border_style="bold"
    )

    richprint(panel)

In [153]:
import itertools

import numpy as np


def step(
    construct_chain,
    prompt: str,
    train_examples,
    evaluators,
    step_idx,
) -> str:
    # TODO: Batching to speed it up
    chain = construct_chain(prompt)
    results = ls.evaluate(
        chain, data=train_examples[:15], evaluators=evaluators, blocking=False
    )
    formatted = []
    for idx, res in enumerate(results):
        formatted.append(
            format_run_with_feedback(
                res["run"], res.get("evaluation_results", {}).get("results") or [], idx
            )
        )
    updated = optimize(
        current_prompt=prompt, annotated_predictions="\n".join(formatted)
    )
    new_prompt = updated.improved_prompt
    print_rich_diff(prompt or "", new_prompt, f"Prompt diff at step {step_idx}")
    # Now prin
    return new_prompt


def eval(eval_dataset, system, evaluators, step_n) -> float:
    """Compute the metrics on the validation dataset."""
    dev_results = ls.evaluate(
        system,
        # TODO: do whole
        data=itertools.islice(
            ls.Client().list_examples(dataset_name=eval_dataset), 0, 15
        ),
        evaluators=evaluators,
        metadata={
            "step": step_n,
        },
    )
    scores = []
    for res in dev_results:
        scores.append(res["evaluation_results"]["results"][0].score)
    # Assume single metric rn ha
    return np.mean(scores)


def train(
    chain_constructor,
    original,
    train_dataset,
    eval_dataset,
    evaluators,
    steps: int = 5,
):
    """Run the full training loop"""
    best_score = eval(eval_dataset, chain_constructor(original), evaluators, 0)
    best_step = 0
    scores = [(best_score, [])]
    train_examples = list(ls.Client().list_examples(dataset_name=train_dataset))
    updated = original
    for step_number in range(steps):
        updated = step(
            chain_constructor, updated, train_examples, evaluators, step_idx=step_number
        )
        updated_chain = chain_constructor(updated)
        updated_score = eval(eval_dataset, updated_chain, evaluators, step_number + 1)
        scores.append((updated_score, updated))

        if updated_score > best_score:
            print(
                f"New best score {updated_score} > {best_score}. Updating selected examples."
            )
            best_score = updated_score
            best_step = step_number + 1
        else:
            print(f"Underperformed ({updated_score} < {best_score}). Continuing")
    print("Best overall score: ", best_score)
    print("Best step: ", best_step)
    return sorted(scores, key=lambda x: x[0], reverse=True)

#### Train

Now we can finally run the training loop!

In [None]:
import functools

# We will train with gpt-4-turbo
llm = ChatOpenAI(model="gpt-4o")
all_scores = train(
    functools.partial(create_chain, llm=llm),
    original_prompt_str,
    train_name,
    dev_name,
    [exact_match],
    steps=5,
)

View the evaluation results for experiment: 'elderly-payment-30' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=7ad95300-319a-4f8a-8a56-bb9f2ace11aa




0it [00:00, ?it/s]

View the evaluation results for experiment: 'ample-nail-98' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a1613752-8df2-4e05-be08-9af1a2b55f60/compare?selectedSessions=13926443-ff28-4627-8e0b-ebbc7da41011




0it [00:00, ?it/s]

View the evaluation results for experiment: 'puzzled-knife-28' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=a9ba2210-4a71-4629-8ede-bad8566dfdd7




0it [00:00, ?it/s]

View the evaluation results for experiment: 'tart-science-50' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a1613752-8df2-4e05-be08-9af1a2b55f60/compare?selectedSessions=7329ddf9-d492-4bdc-8068-bb77c0063d21




0it [00:00, ?it/s]

## Compare on held-out set

It's easy to overfit a single benchmark if you explicitly choose your pipeline based on metrics on that benchmark.

Let's compare models on an unseen test set to see whether the selected examples are reliably better.

In [None]:
# best_score, best_examples = all_scores[0]

In [None]:
# original_model = create_chain()
# # This time we will apply gpt-3.5-turbo, but use the few-shot examples + reasoning trajectories
# # from gpt-4 to help induce better performance
# best_performing_model = create_chain(best_examples)

In [None]:
# for model_name, model in [
#     ("optimized", best_performing_model),
#     # ("original", original_model),
# ]:
#     client.run_on_dataset(
#         dataset_name=test_name,
#         llm_or_chain_factory=model,
#         evaluation=eval_config,
#         verbose=True,
#         project_metadata={
#             "model": model_name,
#         },
#     )

Using the GPT-4 generated examples, we were able to boost the performance from ~0.54 to ~0.87: not bad!