## Prompt-tuning with LangSmith + Claude

Prompt engineering isn't always the most fun, especially when it comes to tasks where metrics are hard to defined.

Turns out LLMs can do a [decent job at prompt engineering](https://arxiv.org/abs/2211.01910), especially when incorporating human feedback on representative data.

LangSmith makes this this whole flow very easy. Let's give it a whirl!

This example is based on [@alexalbert__'s example Claude workflow](https://x.com/alexalbert__/status/1767258557039378511?s=20).

In [None]:
%pip install -U langsmith langchain_anthropic langchain sqlite

In [2]:
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key
# We are using Anthropic here as well
os.environ["ANTHROPIC_API_KEY"] = "YOUR API KEY"

In [3]:
# We can set an LLM cache in case you want to re-run the steps
from langchain.cache import SQLiteCache
from langchain_core.globals import set_llm_cache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [4]:
from langsmith import Client

client = Client()

# 1. Pick a task

Let's say I want to write a tweet generator about academic papers, one that is catchy but not laden with too many buzzwords
or impersonal. Let's see if we can "optimize" a prompt without having to engineer it ourselves.

In [5]:
from langchain import hub
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser

task = (
    "Generate a tweet to market an academic paper or open source project. It should be"
    " well crafted but avoid gimicks or over-reliance on buzzwords."
)


# See: https://smith.langchain.com/hub/wfh/metaprompt
prompt = hub.pull("wfh/metaprompt")
llm = ChatAnthropic(model="claude-3-opus-20240229")


def get_instructions(gen: str):
    return gen.split("<Instructions>")[1].split("</Instructions>")[0]


meta_prompter = prompt | llm | StrOutputParser() | get_instructions


recommended_prompt = meta_prompter.invoke(
    {
        "task": task,
        "input_variables": """
{paper}
""",
    }
)
print(recommended_prompt)


Your task is to write an engaging tweet to market the following academic paper or open source project:

<paper>
{paper}
</paper>

Pay close attention to the abstract or project summary, as that will contain the key points you'll want to highlight in the tweet.  Think carefully about how to concisely summarize the work in a way that will be interesting and appealing to a broad audience on Twitter. 

The tweet should strike a balance between being clear and informative about the work, while also using some marketing language to generate excitement and interest. However, avoid being gimmicky or relying too heavily on buzzy jargon.  Focus on substance over style.

Write out the full text of your proposed tweet inside <tweet> tags. The tweet must be under 280 characters in length.



OK so it's a fine-not-great prompt. Let's see how it does!

## 2. Dataset

For some tasks you can generate them yourselves. For our notebook, we have created a 10-datapoint dataset of some scraped ArXiv papers.

In [19]:
public_ds = "https://smith.langchain.com/public/42bbacae-e9b2-4410-a053-ce3c11abec83/d"
ds_name = "Tweet Generator"
client.clone_public_dataset(public_ds)
ds = client.read_dataset(dataset_name=ds_name)

In [None]:
prediction = chain.invoke(example.inputs)
prediction

## 3. Predict

We will refrain from defining metrics for now (it's quite subjective). Instead we will run the first version of the generator against the dataset and manually review + provide feedback on the results.

In [20]:
from langchain_core.prompts import PromptTemplate


def parse_tweet(response: str):
    try:
        return response.split("<tweet>")[1].split("</tweet>")[0].strip()
    except:
        return response.strip()


def create_tweet_generator(prompt_str: str):
    prompt = PromptTemplate.from_template(prompt_str)
    return prompt | llm | StrOutputParser() | parse_tweet


tweet_generator = create_tweet_generator(recommended_prompt)

In [21]:
res = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)

View the evaluation results for project 'brief-reason-82' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/453ddc95-6353-4cb6-ba79-06b8a6f518b6/compare?selectedSessions=c66cc5bc-f9b3-4e31-a881-fc374da3a02c

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/453ddc95-6353-4cb6-ba79-06b8a6f518b6
[------------------------------------------------->] 10/10

## 4. Label

Now, we will use an annotation queue to score + add notes to the results. We will use this to iterate on our prompt!

For this notebook, I will be logging two types of feedback:

`note`- freeform comments on the runs

`tweet_quality` - a 0-4 score of the generated tweet based on my subjective preferences

In [22]:
q = client.create_annotation_queue(name="Tweet Generator")

In [23]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(project_name=res["project_name"], execution_order=1)
    ],
)

Now, go through the runs to label them. Return to this notebook when you are finished.

![Queue](./img/queue.png)

## 4. Update

With the human feedback in place, let's update the prompt and try again.

In [26]:
from collections import defaultdict


def format_feedback(single_feedback, max_score=4):
    if single_feedback.score is None:
        score = ""
    else:
        score = f"\nScore:[{single_feedback.score}/{max_score}]"
    comment = f"\n{single_feedback.comment}".strip()
    return f"""<feedback key={single_feedback.key}>{score}{comment}
</feedback>"""


def format_run_with_feedback(run, feedback):
    all_feedback = "\n".join([format_feedback(f) for f in feedback])
    return f"""<example>
<tweet>
{run.outputs["output"]}
</tweet>
<annotations>
{all_feedback}
</annotations>
</example>"""


def get_formatted_feedback(project_name: str):
    traces = list(client.list_runs(project_name=project_name, execution_order=1))
    feedbacks = defaultdict(list)
    for f in client.list_feedback(run_ids=[r.id for r in traces]):
        feedbacks[f.run_id].append(f)
    return [
        format_run_with_feedback(r, feedbacks[r.id])
        for r in traces
        if r.id in feedbacks
    ]

In [39]:
formatted_feedback = get_formatted_feedback(res["project_name"])

5

LLMs are especially good at 2 things:
1. Generating grammatical text
2. Summarization

Now that we've left a mixture of scores and free-form comments, we can use an "optimizer prompt" ([wfh/optimizerprompt](https://smith.langchain.com/hub/wfh/optimizerprompt)) to incorporate the feedback into an updated prompt.


In [40]:
# See: https://smith.langchain.com/hub/wfh/optimizerprompt
optimizer_prompt = hub.pull("wfh/optimizerprompt")


def extract_new_prompt(gen: str):
    return gen.split("<improved_prompt>")[1].split("</improved_prompt>")[0].strip()


optimizer = optimizer_prompt | llm | StrOutputParser() | extract_new_prompt

In [41]:
current_prompt = recommended_prompt
new_prompt = optimizer.invoke(
    {
        "current_prompt": current_prompt,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)

In [42]:
print("Original Prompt\n\n" + current_prompt)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt)

Original Prompt


Your task is to write an engaging tweet to market the following academic paper or open source project:

<paper>
{paper}
</paper>

Pay close attention to the abstract or project summary, as that will contain the key points you'll want to highlight in the tweet.  Think carefully about how to concisely summarize the work in a way that will be interesting and appealing to a broad audience on Twitter. 

The tweet should strike a balance between being clear and informative about the work, while also using some marketing language to generate excitement and interest. However, avoid being gimmicky or relying too heavily on buzzy jargon.  Focus on substance over style.

Write out the full text of your proposed tweet inside <tweet> tags. The tweet must be under 280 characters in length.

********************************************************************************
New Prompt

Your task is to write an engaging tweet to market the following academic paper or open source project

## 5. Repeat!

Now that we have an "upgraded" prompt, we can test it out again and repeat until we are satisfied with the result.

If you find the prompt isn't converging to something you want, you can manually update the prompt (you are the optimizer in this case) and/or be more explicit in your free-form note feedback.

In [44]:
tweet_generator = create_tweet_generator(new_prompt)

updated_results = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)

View the evaluation results for project 'aching-vein-26' at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/453ddc95-6353-4cb6-ba79-06b8a6f518b6/compare?selectedSessions=8b937ad4-acea-4987-9c3d-db2c1edc25fc

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/30239cd8-922f-4722-808d-897e1e722845/datasets/453ddc95-6353-4cb6-ba79-06b8a6f518b6
[------------------------------------------------->] 10/10

In [45]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(
            project_name=updated_results["project_name"], execution_order=1
        )
    ],
)

Then review/provide feedback/repeat.

Once you've provided feedback, you can continue here:

In [46]:
formatted_feedback = get_formatted_feedback(updated_results["project_name"])

In [47]:
# Swap them out
current_prompt = new_prompt
new_prompt = optimizer.invoke(
    {
        "current_prompt": current_prompt,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)

In [48]:
print("Previous Prompt\n\n" + current_prompt)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt)

Previous Prompt

Your task is to write an engaging tweet to market the following academic paper or open source project:

<paper>
{paper}
</paper>

Here are the key elements to include in your tweet:

1. Concisely summarize the most important new findings from the paper. Focus on the key advances rather than the minutiae of the methodology. 

2. Explain why the findings are important or how they could lead to new applications. Avoid overhyped terms like "breakthrough" and focus on clear, grounded explanations using concrete examples where possible.

3. Briefly define any essential technical jargon that appears in your summary (e.g. what are "spintronic devices"?). Aim to make the significance of the work accessible to a broad audience.

4. Credit the authors and/or their institutions, and include a link to the paper/project.

Feel free to use more than 280 characters to achieve a good balance of being punchy and interesting while also including key context and explanations. But don't le

## Conclusion

Congrats! You've "optimized" a prompt on a subjective task using human feedback and an automatic prompt engineer flow. LangSmith makes it easy to score and improve LLM systems even when it is hard to craft a hard metric.

You can push the optimized version of your prompt to the hub (here and in future iterations) to version each change.

In [49]:
hub.push("wfh/academic-tweet-generator", PromptTemplate.from_template(new_prompt))

'https://smith.langchain.com/hub/wfh/academic-tweet-generator/03670db0'

#### Extensions:

We haven't optimized the meta-prompts above - feel free to make them your own by forking and updating them!
Some easy extensions you could try out include:
1. Including the full history of previous prompts and annotations (or most recent N prompts with feedback) in the "optimizer prompt" step. This may help it better converge (especially if you're using a small dataset)
2. Updating the optimizer prompt to encourage usage of few-shot examples, or to encourage other prompting tricks.
3. Incorporating an LLM judge by including the annotation few-shot examples and instructing it to critique the generated outputs: this could help speed-up the human annotation process.
4. Generating and including a validation set (to avoid over-fitting this training dataset)