## Optimization with LangSmith

Prompt engineering isn't always the most fun. You can use data to optimize the prompt for you with the help of tools like LangSmith. Main steps are:
1. Create a dataset
2. Pick a metric to improve
3. Create an initial chain
4. Decide the update logic (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
5. Train!


Below is an example bootstrapping a gpt-3.5-turbo model on an entailment task using few-shot examples.

In [2]:
# %pip install -U langsmith langchain_openai pandas

In [None]:
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key
# We are using openai here as well
# os.environ["OPENAI_API_KEY"] = "YOUR API KEY"

In [4]:
from langsmith import Client

client = Client()

In [5]:
# TODO: Add clone steps
public_datasets = [
    "https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d",  # train
    "https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d",  # dev
    "https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d",  # test
]
for ds in public_datasets:
    client.clone_public_dataset(ds)
train_name = "scone-train"
dev_name = "scone-dev"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"

## Evaluator

In [6]:
example = next(client.list_examples(dataset_name=train_name))
print(example.inputs)
print(example.outputs)

{'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
{'answer': 'No', 'category': 'one_not_scoped'}


In [7]:
from langsmith.evaluation import run_evaluator


@run_evaluator
def exact_match(run, example):
    predicted = run.outputs["output"]
    expected = example.outputs["answer"]
    expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
    score = expected_bool == predicted.is_entailed
    return {
        "key": "exact_match",
        "score": int(score),
        "comment": f"predicted={predicted}\nexpected={expected}={expected_bool}",
    }

In [10]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI


class EntailmentOutput(BaseModel):
    reasoning: str = Field(
        description="Think step-by-step to avoid any logical errors in your decision"
    )
    is_entailed: bool


prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a logical expert in predicting entailment.{examples}"),
        (
            "user",
            "Can you logically conclude the hypothesis given the premise?\n"
            "Hypothesis: {question}\n"
            "Premise: {context}\n",
        ),
    ]
).partial(examples="")
chain = prompt | ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(
    schema=EntailmentOutput
)

  warn_beta(


In [11]:
prediction = chain.invoke(example.inputs)
prediction

EntailmentOutput(reasoning='If a man drops produce and kale is a type of produce, then it can be logically concluded that a man who does not walk confidently drops kale.', is_entailed=True)

In [12]:
example.outputs

{'answer': 'No', 'category': 'one_not_scoped'}

## Initial Evaluation

In [13]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    custom_evaluators=[exact_match],
)

res = client.run_on_dataset(
    dataset_name=dev_name,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
    project_metadata={"optimizer": None},
)

View the evaluation results for project 'advanced-page-6' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9/compare?selectedSessions=6d52ffb4-f7f5-4a65-896e-c35566b67acf

View all tests for Dataset scone-dev at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9
[------------------------------------------------->] 50/50

Got about ~55% on it. Whoopy.

## ✨ Optimize ✨


This just means "use data to update system and improve metric". LangChain runnables don't natively support a "backwards" method (a la pytorch), but you can pretty easily define updates/mutations for most of them.

For instance:
- Few shot prompting: add an additional string input or MessagesPlaceholder in the prompt template
- Updating the Instructions: just update the prompt template directly (likely the system prompt)
- etc.

If you had a completely unconstrained search space, it'll be an expensive and hard-to tune system. (e.g., NAS hasn't been that successfull in the industry).


Projects like DSPy have a bunch of off-the-shelf optimizers that encapsulate logic for mutating the model based on metrics. For instance the `BootstrapFewShotWithRandomSearch` does the following in order:

1. Zero shot eval - This is really a No-Op in their code.
2. Labeled few shot randomly sample K from the training set
3. Bootstrapped few shot - go through training examples, predict with the base model, if it gets it right, add it to the few-shot pool
(repeat (3) for N candidate programs)


You can configure additional branching in (3) and (2). It's pretty similar to a genetic algorithm.

In [19]:
# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional

from langchain_core.runnables import RunnableLambda


def format_example(example: dict):
    inputs = example["input"]
    outputs = example["output"]
    return f"Hypothesis: {inputs['question']}\nPremise: {inputs['context']}\nAnswer: {outputs}"


def format_few_shot(input_: dict, examples: Optional[List[dict]] = None):
    if examples:
        # TODO: make this configurable / bound to the prompt template
        input_["examples"] = "\n\n## Examples\n" + "\n".join(
            f"{i}: {format_example(e)}" for i, e in enumerate(examples)
        )
    return input_


# And we will create a placeholder in the template to add few-shot examples
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a logical expert in predicting entailment.{examples}"),
        (
            "user",
            "Can you logically conclude the hypothesis given the premise?\n"
            "Hypothesis: {question}\n"
            "Premise: {context}\n",
        ),
    ]
).partial(examples="")


def create_chain(examples: Optional[List] = None):
    chain = (
        RunnableLambda(format_few_shot).bind(examples=examples)
        | prompt
        | ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(
            schema=EntailmentOutput
        )
    ).with_config(tags=["to_train"])
    return chain

In [22]:
def step(
    chain,
    train_dataset,
    eval_dataset,
    eval_config,
    examples=None,
    k: int = 5,
):
    collected = examples.copy() if examples else {}
    train_results = client.run_on_dataset(
        dataset_name=train_dataset,
        llm_or_chain_factory=chain,
        evaluation=eval_config,
    )
    # Or could query langsmith, but there's a lag there.
    df = train_results.to_dataframe()
    feedback_keys = [c for c in df.columns if c.startswith("feedback.")]
    passing = df[
        df.apply(lambda x: bool(x[feedback_keys].values.all()), axis=1)
    ].index.tolist()
    for passing_idx in passing:
        if passing_idx not in collected:
            collected[passing_idx] = train_results["results"][passing_idx]

    return collected


def eval(eval_dataset, chain, eval_config, step_n) -> float:
    dev_results = client.run_on_dataset(
        dataset_name=eval_dataset,
        llm_or_chain_factory=chain,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "step": step_n,
        },
    )
    df = dev_results.to_dataframe()
    feedback_key = [c for c in df.columns if c.startswith("feedback.")][0]
    # Assume single metric rn ha
    return df[feedback_key].mean()


def train(
    chain_constructor,
    train_dataset,
    eval_dataset,
    eval_config,
    steps: int = 4,
    k: int = 5,
):
    best_score = eval(eval_dataset, chain_constructor(), eval_config, 0)
    best_step = 0
    examples = None
    for step_number in range(steps):
        chain = chain_constructor(examples)
        collected = step(chain, train_dataset, eval_dataset, eval_config, k=k)
        foo = collected
        # TODO: probably want some diversity of labels here lol
        selected = random.sample(sorted(collected), k)
        sampled_examples = {k: collected[k] for k in selected}
        selected_examples = list(selected_examples.values())
        updated_chain = chain_constructor(examples=selected_examples)
        updated_score = eval(eval_dataset, updated_chain, eval_config, step_number + 1)
        if updated_score > best_score:
            print(
                f"New best score {updated_score} > {best_score}. Updating selected examples."
            )
            examples = selected_examples
            best_step = step_number + 1
        else:
            print("Underperformed. Continuing")
    print("Best overall score: ", best_score)
    print("Best step: ", best_step)
    return examples

In [None]:
examples = train(create_chain, train_name, dev_name, eval_config)

View the evaluation results for project 'kind-wall-40' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9/compare?selectedSessions=249f8b54-4488-4435-b288-a73da66e6ae3

View all tests for Dataset scone-dev at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,f91c0f8c-d27a-49e4-92c6-0ea20ecd6d07
freq,,,,1
mean,0.54,,1.674155,
std,0.503457,,0.493475,
min,0.0,,1.073126,
25%,0.0,,1.376178,
50%,1.0,,1.572887,
75%,1.0,,1.894788,


View the evaluation results for project 'helpful-potato-26' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073/compare?selectedSessions=ca3f35e2-2450-4dde-8da3-c77e5c0abc28

View all tests for Dataset scone-train at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073
[------------------------------------------------->] 200/200View the evaluation results for project 'helpful-cushion-72' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9/compare?selectedSessions=9fc32a64-7fd9-4bcc-9f8c-03269342270f

View all tests for Dataset scone-dev at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,f0c09a8f-5ace-4514-82ba-7b8714a29216
freq,,,,1
mean,0.58,,1.758317,
std,0.498569,,0.480888,
min,0.0,,1.119977,
25%,0.0,,1.47187,
50%,1.0,,1.660316,
75%,1.0,,1.890199,


New best score. Updating selected examples.
View the evaluation results for project 'memorable-place-30' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073/compare?selectedSessions=bd87407c-c890-4574-a21b-819ee5cb2df6

View all tests for Dataset scone-train at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073
[------------------------------------------------->] 200/200View the evaluation results for project 'bold-scarecrow-92' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9/compare?selectedSessions=799181f4-892f-4910-980d-3fafe05ec092

View all tests for Dataset scone-dev at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/1bba4039-6f9c-4309-b09b-04453af5edc9
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,2a42e116-6ada-4762-8215-4ae920a23773
freq,,,,1
mean,0.56,,1.858655,
std,0.501427,,0.503588,
min,0.0,,1.165229,
25%,0.0,,1.510991,
50%,1.0,,1.818283,
75%,1.0,,2.103101,


New best score. Updating selected examples.
View the evaluation results for project 'warm-sofa-98' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073/compare?selectedSessions=8a4c0794-1b74-4265-8bd4-3d3fbc07e0ad

View all tests for Dataset scone-train at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/c4a2ec92-6e2c-4004-af24-7654f507a073
[------------------------------------------------->] 199/200

## Compare on held-out set

It's easy to overfit a benchmark if you do model selection on it. Let's compare models on the test set we had held-out before.

In [None]:
original_model = create_chain()
best_performing_model = create_chain(examples)

In [None]:
for model_name, model in [
    ("original", original_model),
    ("optimized", best_performing_model),
]:
    client.run_on_dataset(
        dataset_name=test_dataset,
        llm_or_chain_factory=model,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "model": model_name,
        },
    )