## Optimization with LangSmith

Prompt engineering isn't always the most fun. You can use data to optimize the prompt for you with the help of tools like LangSmith. Main steps are:
1. Create a dataset
2. Pick a metric to improve
3. Create an initial chain
4. Decide the update logic (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
5. Train!


Below is an example bootstrapping a gpt-3.5-turbo model on an entailment task using few-shot examples.

In [None]:
# %pip install -U langsmith langchain_openai pandas

In [1]:
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key
# We are using openai here as well
os.environ["OPENAI_API_KEY"] = "YOUR API KEY"

In [2]:
# # We can do the same thing with a SQLite cache
# from langchain.cache import SQLiteCache
# from langchain_core.globals import set_llm_cache
# 
# set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [3]:
from langsmith import Client

client = Client()

In [4]:
# TODO: Add clone steps
public_datasets = [
    "https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d",  # train
    "https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d",  # dev
    "https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d",  # test
]
for ds in public_datasets:
    client.clone_public_dataset(ds)
# train_name = "scone-train"
train_name = "scone-train2"
# dev_name = "scone-dev"
dev_name = "scone-dev2"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"

## Evaluator

In [5]:
example = next(client.list_examples(dataset_name=train_name))
print(example.inputs)
print(example.outputs)

{'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
{'answer': 'No', 'category': 'one_not_scoped'}


In [6]:
import sys

from langsmith.evaluation import run_evaluator


@run_evaluator
def exact_match(run, example):
    try:
        predicted = run.outputs["is_entailed"]
        expected = example.outputs["answer"]
        # expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
        score = expected.lower() == predicted.lower()
    except Exception as e:
        try:
            expected = example.outputs["answer"]
            expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
            score = run.outputs["output"].is_entailed == expected_bool
        except Exception as e2:
            print("ERROR", e, e2, run.outputs)
            score = 0
    return {
        "key": "exact_match",
        "score": int(score),
    }

In [10]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

# And we will create a placeholder in the template to add few-shot examples
prompt = PromptTemplate.from_template(
    """You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${{context}}

Question: ${{question}}

Reasoning: Let's think step by step in order to ${{produce the answer}}. We ...

Answer: Yes or No

---{examples}

Context: {context}

Question: {question}

Reasoning: Let's think step by step in order to"""
).partial(examples="")


def parse(pred: str):
    fnd = "\nAnswer:"
    idx = pred.find(fnd)
    answer = pred[idx + len(fnd) :].strip()
    return {"is_entailed": answer, "reasoning": pred[:idx].strip()}


chain = prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser() | parse

In [11]:
prediction = chain.invoke(example.inputs)
prediction

{'is_entailed': 'No',
 'reasoning': 'produce the answer. We know that dropping produce is a general behavior that can happen to anyone, regardless of their confidence while walking. Therefore, we cannot logically conclude for sure that a man who does not walk confidently dropping kale.'}

## Initial Evaluation

In [12]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    custom_evaluators=[exact_match],
)

In [None]:
res = client.run_on_dataset(
    dataset_name="scone-test2",  # dev_name,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
    project_metadata={"optimizer": None},
)

Got about ~55% on it. Whoopy.

## ✨ Optimize ✨


This just means "use data to update system and improve metric". LangChain runnables don't natively support a "backwards" method (a la pytorch), but you can pretty easily define updates/mutations for most of them.

For instance:
- Few shot prompting: add an additional string input or MessagesPlaceholder in the prompt template
- Updating the Instructions: just update the prompt template directly (likely the system prompt)
- etc.

If you had a completely unconstrained search space, it'll be an expensive and hard-to tune system. (e.g., NAS hasn't been that successfull in the industry).


Projects like DSPy have a bunch of off-the-shelf optimizers that encapsulate logic for mutating the model based on metrics. For instance the `BootstrapFewShotWithRandomSearch` does the following in order:

1. Zero shot eval - This is really a No-Op in their code.
2. Labeled few shot randomly sample K from the training set
3. Bootstrapped few shot - go through training examples, predict with the base model, if it gets it right, add it to the few-shot pool
(repeat (3) for N candidate programs)


You can configure additional branching in (3) and (2). It's pretty similar to a genetic algorithm.

In [25]:
# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional

from langchain_core.runnables import RunnableLambda


def format_example(example: dict):
    inputs = example["input"]
    outputs = example["output"]
    return f"""

Context: {inputs['context']}

Question: {inputs['question']}

Reasoning: {outputs['reasoning']}

Answer: {outputs['is_entailed']}

"""


def format_few_shot(input_: dict, examples: Optional[List[dict]] = None):
    if examples:
        # TODO: make this configurable / bound to the prompt template
        input_["examples"] = (
            "--".join(format_example(e) for i, e in enumerate(examples)) + "--"
        )
    return input_


def create_chain(examples: Optional[List] = None, llm=None):
    llm = llm or ChatOpenAI(model="gpt-3.5-turbo")
    chain = (
        RunnableLambda(format_few_shot).bind(examples=examples)
        | prompt
        | llm
        | StrOutputParser()
        | parse
    ).with_config(tags=["to_train"])
    return chain

In [20]:
from langchain_core.tracers.context import collect_runs
from tqdm.auto import tqdm


def step(
    construct_chain,
    train_examples,
    eval_config,
    examples=None,
    bootstrap_k: int = 8,
):
    collected = examples.copy() if examples else []
    random.shuffle(train_examples)
    train_examples = train_examples.copy()
    # TODO: Batching to speed it up
    while train_examples:
        if len(collected) >= bootstrap_k:
            break
        train_batch = [
            train_examples.pop() for _ in range(bootstrap_k - len(collected))
        ]
        chain = construct_chain([e for e in collected if e["id"] != example.id])
        with collect_runs() as cb:
            preds = chain.batch([e.inputs for e in train_batch])
        evaluator = eval_config.custom_evaluators[0]
        for run, example in zip(cb.traced_runs, train_batch):
            metric = evaluator.evaluate_run(run, example)
            score = metric.score
            # Check if success
            if score:
                collected.append(
                    {
                        "input": example.inputs,
                        "output": run.outputs,
                        "id": example.id,
                    }
                )
    return collected


def eval(eval_dataset, chain, eval_config, step_n) -> float:
    dev_results = client.run_on_dataset(
        dataset_name=eval_dataset,
        llm_or_chain_factory=chain,
        evaluation=eval_config,
        verbose=True,
        concurrency_level=8,
        project_metadata={
            "step": step_n,
        },
    )
    df = dev_results.to_dataframe()
    feedback_key = [c for c in df.columns if c.startswith("feedback.")][0]
    # Assume single metric rn ha
    return df[feedback_key].mean()


def train(
    chain_constructor,
    train_dataset,
    eval_dataset,
    eval_config,
    steps: int = 5,
    k: int = 8,
    bootstrap_k: int = 8,
):
    best_score = eval(eval_dataset, chain_constructor(), eval_config, 0)
    best_step = 0
    scores = [(best_score, [])]
    train_examples = list(client.list_examples(dataset_name=train_dataset))
    for step_number in range(steps):
        collected = step(
            chain_constructor, train_examples, eval_config, bootstrap_k=bootstrap_k
        )
        if len(collected) < k:
            # TODO: probably want some diversity of labels here lol
            to_sample = min(k - len(collected), len(train_examples))
            collected += random.sample(train_examples, to_sample)
        selected_examples = collected
        updated_chain = chain_constructor(examples=selected_examples)
        updated_score = eval(eval_dataset, updated_chain, eval_config, step_number + 1)
        scores.append((updated_score, selected_examples))

        if updated_score > best_score:
            print(
                f"New best score {updated_score} > {best_score}. Updating selected examples."
            )
            best_score = updated_score
            examples = selected_examples
            best_step = step_number + 1
        else:
            print("Underperformed. Continuing")
    print("Best overall score: ", best_score)
    print("Best step: ", best_step)
    return sorted(scores, key=lambda x: x[0], reverse=True)

In [21]:
import functools

llm = ChatOpenAI(model="gpt-4-turbo-preview")
all_scores = train(
    functools.partial(create_chain, llm=llm),
    train_name,
    dev_name,
    eval_config,
    steps=10,
)

View the evaluation results for project 'essential-point-83' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=41bfb05f-d8b7-427f-b374-21a74e654ed5

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,974da439-2784-49d7-8ea6-507d45f420f5
freq,,,,1
mean,0.84,,0.040363,
std,0.370328,,0.052331,
min,0.0,,0.008406,
25%,1.0,,0.014565,
50%,1.0,,0.019673,
75%,1.0,,0.027574,


View the evaluation results for project 'excellent-man-73' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=7ac05947-8a77-42ac-beda-714e90d5741e

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,da5103cf-c1da-4c09-a938-297bd7dd7408
freq,,,,1
mean,0.82,,5.961965,
std,0.388088,,1.693766,
min,0.0,,2.969188,
25%,1.0,,4.784548,
50%,1.0,,5.662,
75%,1.0,,6.719448,


Underperformed. Continuing
View the evaluation results for project 'worthwhile-table-54' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=e8af6293-068d-4ca7-a1c0-7b6d739b96ea

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,b2f1cdb8-1a7a-4df2-82b6-9b3246cd7451
freq,,,,1
mean,0.88,,5.97616,
std,0.328261,,2.184316,
min,0.0,,3.297546,
25%,1.0,,4.178783,
50%,1.0,,5.525802,
75%,1.0,,7.423345,


New best score 0.88 > 0.84. Updating selected examples.
View the evaluation results for project 'glossy-click-58' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=624606d3-6a74-464a-b15b-763ca50f4d35

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,474256cf-ace2-4510-ab1a-6fdaeaae80f3
freq,,,,1
mean,0.8,,23.840804,
std,0.404061,,42.348033,
min,0.0,,2.351115,
25%,1.0,,4.407975,
50%,1.0,,5.905005,
75%,1.0,,8.113949,


Underperformed. Continuing
View the evaluation results for project 'stupendous-scarecrow-56' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=7453522e-704d-436d-a1ec-fe4ce6bf1629

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,2fa03e3f-3af3-4f15-a59e-568471ab3fe3
freq,,,,1
mean,0.88,,6.150452,
std,0.328261,,2.696097,
min,0.0,,3.327153,
25%,1.0,,4.657029,
50%,1.0,,5.834938,
75%,1.0,,6.94373,


Underperformed. Continuing
View the evaluation results for project 'aching-frame-9' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=7bfb5653-d303-4569-b0b2-e1611e5a4024

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,5f30e4f3-9e01-4ce8-841d-b3c56e43e29c
freq,,,,1
mean,0.9,,4.53655,
std,0.303046,,1.7305,
min,0.0,,2.408829,
25%,1.0,,3.232037,
50%,1.0,,3.769973,
75%,1.0,,5.969299,


New best score 0.9 > 0.88. Updating selected examples.
View the evaluation results for project 'sparkling-acknowledgment-96' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=ba7ded51-cf17-4f5c-8686-eff0c70956e4

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,0158a345-4a57-41a3-8216-24e3e162e339
freq,,,,1
mean,0.9,,4.673715,
std,0.303046,,1.73606,
min,0.0,,1.990211,
25%,1.0,,3.194072,
50%,1.0,,4.176253,
75%,1.0,,5.73169,


Underperformed. Continuing
View the evaluation results for project 'left-linen-3' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=0e1787d0-1e4d-488d-b7ca-6540c97f0f55

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,65515645-0e2e-44ab-8b43-1a026a51d990
freq,,,,1
mean,0.88,,4.828782,
std,0.328261,,1.848141,
min,0.0,,1.977732,
25%,1.0,,3.116855,
50%,1.0,,4.794901,
75%,1.0,,6.149255,


Underperformed. Continuing
View the evaluation results for project 'proper-push-44' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=0be0745f-65a4-4a19-b76f-587e91392b2b

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,d211eead-1f2a-4dcf-9e11-c53fa36b8302
freq,,,,1
mean,0.92,,5.145023,
std,0.274048,,1.948927,
min,0.0,,2.58651,
25%,1.0,,3.806319,
50%,1.0,,4.782008,
75%,1.0,,5.876891,


New best score 0.92 > 0.9. Updating selected examples.
View the evaluation results for project 'excellent-butter-4' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=4e9555a7-01ec-44f4-bf4c-9fa547bfbc0c

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,ca2263e4-49f0-4479-a60f-9f4011201ca9
freq,,,,1
mean,0.92,,4.790062,
std,0.274048,,1.863609,
min,0.0,,2.469936,
25%,1.0,,3.411112,
50%,1.0,,4.494514,
75%,1.0,,5.816941,


Underperformed. Continuing
View the evaluation results for project 'definite-family-71' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=6b167b1f-d435-491f-a2f0-7472545e585c

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,bf6cba87-acbc-4c92-88c1-a5c6e95b9e09
freq,,,,1
mean,0.84,,4.912422,
std,0.370328,,1.79685,
min,0.0,,2.453865,
25%,1.0,,3.595303,
50%,1.0,,4.249207,
75%,1.0,,6.062839,


Underperformed. Continuing
Best overall score:  0.92
Best step:  8


## Compare on held-out set

It's easy to overfit a benchmark if you do model selection on it. Let's compare models on the test set we had held-out before.

In [22]:
best_score, best_examples = all_scores[0]

In [26]:
original_model = create_chain()
best_performing_model = create_chain(best_examples)

In [27]:
for model_name, model in [
    ("optimized", best_performing_model),
    # ("original", original_model),
]:
    client.run_on_dataset(
        dataset_name=test_name,
        llm_or_chain_factory=model,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "model": model_name,
        },
    )

View the evaluation results for project 'proper-summer-37' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/29cedfd4-7574-4f96-b807-a63f642cd654/compare?selectedSessions=02c734fe-69ff-4a6d-bd15-e0d92b963cf2

View all tests for Dataset scone-test-one-scoped at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/29cedfd4-7574-4f96-b807-a63f642cd654
[------------------------------------------------->] 200/200

Unnamed: 0,feedback.exact_match,error,execution_time,run_id
count,200.0,0.0,200.0,200
unique,,0.0,,200
top,,,,06524dfd-d8cd-499c-bfac-7e4b441ff2ab
freq,,,,1
mean,0.81,,2.525272,
std,0.393285,,6.933165,
min,0.0,,1.23494,
25%,1.0,,1.765877,
50%,1.0,,1.923477,
75%,1.0,,2.144059,


Using the GPT-4 generated examples, we were able to boost the performance from ~0.5 to ~0.8: not bad!