# Bootstrap Few-shot Prompting with LangSmith
## 작성자 : AISchool ( http://aischool.ai/%ec%98%a8%eb%9d%bc%ec%9d%b8-%ea%b0%95%ec%9d%98-%ec%b9%b4%ed%85%8c%ea%b3%a0%eb%a6%ac/ )
## Reference : https://github.com/langchain-ai/langsmith-cookbook/blob/main/optimization/bootstrap-fewshot/bootstrap-few-shot.ipynb

프롬프트 엔지니어링은 매우 번거롭습니다. 그러나 LangSmith와 같은 도구를 사용하면 예제를 활용하여 프롬프트를 최적화할 수 있습니다. **어떤 예제가 가장 효과적일지 추측하는 대신, 검증된 평가 방식을 사용하여 파이프라인에 적합한 예제를 신중하게 선별하고 편집**할 수 있습니다. 주요 단계는 다음과 같습니다:


1.   데이터셋 생성
2.   개선할 지표 선택
3.   초기 시스템 생성
4.   업데이트 로직 결정 (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
5.   학습!

아래는 **few-shot examples을 사용하여 gpt-3.5-turbo 모델을 포함 작업(entailment task)에 부트스트래핑하는 예제**입니다. 이 예제는 SCONE 데이터셋에 대한 Christopher Potts의 예제( https://github.com/stanfordnlp/dspy/blob/main/examples/nli/scone/scone.ipynb )에서 영감을 받았습니다.

task는 자연 언어 추론(Natural Language Inference)으로, LLM이 전제 / 근거 문장에서 논리적으로 결론을 도출할 수 있는지를 예측하는 것입니다.












# 라이브러리 설치

In [None]:
!pip install langchain langsmith langchainhub langchain-openai tiktoken langchain_community

Collecting langchain
  Downloading langchain-0.2.11-py3-none-any.whl (990 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.3/990.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith
  Downloading langsmith-0.1.93-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchainhub
  Downloading langchainhub-0.1.20-py3-none-any.whl (5.0 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.1.17-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_community
  Downloading 

In [None]:
import os
from uuid import uuid4

# Used by the agent in this tutorial
os.environ["OPENAI_API_KEY"] = "여러분의_OPENAI_API_KEY"

unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Bootstrap Few-shot Prompting- {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "여러분의_LANGCHAIN_API_KEY"

In [None]:
unique_id

'2156ccdb'

In [None]:
from langsmith import Client

client = Client()

In [None]:
public_datasets = [
    "https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d",  # train
    "https://smith.langchain.com/public/3205fa05-bd78-4eaf-924f-96df0f577b1f/d",  # train2
    "https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d",  # dev
    "https://smith.langchain.com/public/aee61506-3c60-4ca8-95c4-0314c9719ca8/d",  # dev2
    "https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d",  # test
]
for ds in public_datasets:
    client.clone_public_dataset(ds)

In [None]:
train_name = "scone-train2"
dev_name = "scone-dev2"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"

In [None]:
example = next(client.list_examples(dataset_name=train_name))
print("inputs", example.inputs)
print("outputs", example.outputs)

inputs {'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
outputs {'answer': 'No', 'category': 'one_not_scoped'}


In [None]:
# inputs {'context': '자신 있게 걷지 않는 남자가 농산물을 떨어뜨린다.', 'question': '자신 있게 걷지 않는 남자가 케일을 떨어뜨린다고 논리적으로 확실히 결론지을 수 있습니까?'}
# outputs {'answer': '아니요', 'category': '하나로 범위 지정되지 않음'}

위의 값을 검토해보면, 이러한 예제들은 꽤 어려운 문제라는 사실을 알 수 있습니다.







# Evaluator


우리는 정답 분류 레이블을 가지고 있으므로, 정확히 일치하는 기준을 evaluator로 사용할 수 있습니다.

In [None]:
import sys

from langsmith.evaluation import run_evaluator


@run_evaluator
def exact_match(run, example):
    # Evaluate the exact match correctness of the NLI result
    try:
        predicted = run.outputs["is_entailed"]
        expected = example.outputs["answer"]
        score = expected.lower() == predicted.lower()
    except Exception as e:
        try:
            expected = example.outputs["answer"]
            expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
            score = run.outputs["output"].is_entailed == expected_bool
        except Exception as e2:
            score = 0
    return {
        "key": "exact_match",
        "score": int(score),
    }

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

# And we will create a placeholder in the template to add few-shot examples
prompt = PromptTemplate.from_template(
    """You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${{context}}

Question: ${{question}}

Reasoning: Let's think step by step in order to ${{produce the answer}}. We ...

Answer: Yes or No

---{examples}

Context: {context}

Question: {question}

Reasoning: Let's think step by step in order to"""
).partial(examples="")


def parse(pred: str):
    fnd = "\nAnswer:"
    idx = pred.find(fnd)
    answer = pred[idx + len(fnd) :].strip()
    return {"is_entailed": answer, "reasoning": pred[:idx].strip()}


chain = prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser() | parse

In [None]:
prediction = chain.invoke(example.inputs)

In [None]:
example.inputs

{'context': 'A man who does not walk confidently dropping produce.',
 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}

In [None]:
prediction

{'is_entailed': 'No',
 'reasoning': 'produce the answer. We know that dropping produce could include any type of produce, not specifically kale. So, we cannot logically conclude that a man who does not walk confidently drops kale.'}

In [None]:
# {'is_entailed': '아니요',
# 'reasoning': '답을 도출하십시오. 우리는 상품을 떨어뜨리는 것이 특정한 케일이 아니라
# 다양한 종류의 상품을 포함할 수 있음을 알고 있습니다.
# 따라서 자신감 있게 걷지 않는 남자가 케일을 떨어뜨린다고 논리적으로 결론을 내릴 수 없습니다.'}

# Initial Evaluation

In [None]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    custom_evaluators=[exact_match],
)

In [None]:
dev_name

'scone-dev2'

In [None]:
res = client.run_on_dataset(
    dataset_name=dev_name,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
    project_metadata={"optimizer": None},
)

View the evaluation results for project 'upbeat-fire-7' at:
https://smith.langchain.com/o/2da4358c-aaa8-5f93-b4af-fa2d78b96bd8/datasets/83064b47-7278-45d2-bba6-766f20d59705/compare?selectedSessions=dd30695c-d79c-41e1-b829-8ac3bb325d52

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/2da4358c-aaa8-5f93-b4af-fa2d78b96bd8/datasets/83064b47-7278-45d2-bba6-766f20d59705
[------------------------------------------------->] 50/50

In [None]:
res

약 50%의 정확도를 기록했습니다. 개선의 여지가 분명히 있습니다.








# ✨ Optimize ✨

이는 **"데이터를 사용하여 시스템을 업데이트한다"**는 의미입니다. 현재 LangChain의 실행 가능 항목은 기본적으로 "역방향" 메서드를 지원하지 않지만(pytorch 방식처럼), **업데이트하고자 하는 주요 구성 요소(예: 프롬프트 또는 LLM)에 대한 업데이트/변경을 쉽게 정의**할 수 있습니다.

예를 들어, 구성 요소별로 다음과 같이 적용할 수 있습니다:

*   Few shot prompting: 프롬프트 템플릿에 추가 문자열 입력 또는 MessagesPlaceholder를 추가
*   Updating the instructions: 프롬프트 템플릿을 직접 업데이트 (주로 시스템 프롬프트)
*   LLM: 역방향 패스를 수행.









우리는 검색 공간을 제한하기 위해 **few-shot prompting에 집중**할 것입니다. 그런 다음 **유전/진화 알고리즘을 적용하여 다양한 few-shot 예제의 성능을 비교하고 제공된 지표에서 가장 큰 "향상"을 제공하는 예제를 선택**할 것입니다.

먼저 few-shot 예제를 받아들이는 체인의 생성자를 만들어, 각 업데이트된 상태로 체인을 재생성할 수 있게 하겠습니다.

In [None]:
# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional

from langchain_core.runnables import RunnableLambda


def format_example(example: dict):
    inputs = example["input"]
    outputs = example["output"]
    return f"""

Context: {inputs['context']}

Question: {inputs['question']}

Reasoning: {outputs['reasoning']}

Answer: {outputs['is_entailed']}

"""


def format_few_shot(input_: dict, examples: Optional[List[dict]] = None):
    if examples:
        # TODO: make this configurable / bound to the prompt template
        input_["examples"] = (
            "--".join(format_example(e) for i, e in enumerate(examples)) + "--"
        )
    return input_


def create_chain(examples: Optional[List] = None, llm=None):
    llm = llm or ChatOpenAI(model="gpt-3.5-turbo")
    chain = (
        RunnableLambda(format_few_shot).bind(examples=examples)
        | prompt
        | llm
        | StrOutputParser()
        | parse
    ).with_config(tags=["to_train"])
    return chain

# Training

다음으로, 훈련 유틸리티를 정의하겠습니다.

In [None]:
from langchain_core.tracers.context import collect_runs

In [None]:
def step(
    construct_chain,
    train_examples,
    eval_config,
    examples=None,
    bootstrap_k: int = 8,
):
    collected = examples.copy() if examples else []
    random.shuffle(train_examples)
    train_examples = train_examples.copy()
    # TODO: Batching to speed it up
    while train_examples:
        if len(collected) >= bootstrap_k:
            break
        train_batch = [
            train_examples.pop() for _ in range(bootstrap_k - len(collected))
        ]
        chain = construct_chain([e for e in collected if e["id"] != example.id])
        with collect_runs() as cb:
            chain.batch([e.inputs for e in train_batch])
        evaluator = eval_config.custom_evaluators[0]
        for run, example in zip(cb.traced_runs, train_batch):
            metric = evaluator.evaluate_run(run, example)
            score = metric.score
            # Check if success
            if score:
                collected.append(
                    {
                        "input": example.inputs,
                        "output": run.outputs,
                        "id": example.id,
                    }
                )
    return collected

In [None]:
def eval(eval_dataset, chain, eval_config, step_n) -> float:
    """Compute the metrics on the validation dataset."""
    dev_results = client.run_on_dataset(
        dataset_name=eval_dataset,
        llm_or_chain_factory=chain,
        evaluation=eval_config,
        verbose=True,
        concurrency_level=1,
        project_metadata={
            "step": step_n,
        },
    )
    df = dev_results.to_dataframe()
    feedback_key = [c for c in df.columns if c.startswith("feedback.")][0]
    # Assume single metric rn ha
    return df[feedback_key].mean()

In [None]:
def train(
    chain_constructor,
    train_dataset,
    eval_dataset,
    eval_config,
    steps: int = 5,
    k: int = 8,
    bootstrap_k: int = 8,
):
    """Run the full training loop"""
    best_score = eval(eval_dataset, chain_constructor(), eval_config, 0)
    best_step = 0
    scores = [(best_score, [])]
    train_examples = list(client.list_examples(dataset_name=train_dataset))
    for step_number in range(steps):
        collected = step(
            chain_constructor, train_examples, eval_config, bootstrap_k=bootstrap_k
        )
        if len(collected) < k:
            # TODO: probably want some diversity of labels here lol
            to_sample = min(k - len(collected), len(train_examples))
            collected += random.sample(train_examples, to_sample)
        selected_examples = collected
        updated_chain = chain_constructor(examples=selected_examples)
        updated_score = eval(eval_dataset, updated_chain, eval_config, step_number + 1)
        scores.append((updated_score, selected_examples))

        if updated_score > best_score:
            print(
                f"New best score {updated_score} > {best_score}. Updating selected examples."
            )
            best_score = updated_score
            best_step = step_number + 1
        else:
            print("Underperformed. Continuing")
    print("Best overall score: ", best_score)
    print("Best step: ", best_step)
    return sorted(scores, key=lambda x: x[0], reverse=True)

# Train


이제 마침내 훈련 루프를 실행할 수 있습니다!

In [None]:
import functools

# We will train with gpt-4o
llm = ChatOpenAI(model="gpt-4o")
all_scores = train(
    functools.partial(create_chain, llm=llm),
    train_name,
    dev_name,
    eval_config,
    #steps=10,
    steps=1,
)

In [None]:
all_scores

[(0.9375,
  [{'input': {'context': 'The three children are not in the classroom, but they are holding sunflowers.',
     'question': 'Can we logically conclude for sure that the three children are not in the classroom, but they are holding plants?'},
    'output': {'is_entailed': 'Yes',
     'reasoning': "Reasoning: Let's think step by step in order to produce the answer. We know from the context that the three children are holding sunflowers. Sunflowers are a type of plant. Therefore, if the children are holding sunflowers, they are indeed holding plants."},
    'id': UUID('8cef87d4-d3d2-4367-8689-824855c7cb0c')},
   {'input': {'context': 'The dog liked sandwich, but not from the store that is not far away.',
     'question': 'Can we logically conclude for sure that the dog liked food, but not from the store that is not far away?'},
    'output': {'is_entailed': 'Yes',
     'reasoning': 'Reasoning: Let\'s think step by step in order to produce the answer. We know from the context that

# Compare on held-out set

단일 벤치마크를 기준으로 파이프라인을 선택하면 해당 벤치마크에서 과적합(overfit)하기 쉽습니다.

선택된 예제가 신뢰할 수 있게 더 나은지 확인하기 위해 **보지 않은 테스트 세트(unseen test set)에서 모델을 비교**해봅시다.


In [None]:
best_score, best_examples = all_scores[0]

In [None]:
best_score

0.9375

In [None]:
best_examples

[{'input': {'context': 'The three children are not in the classroom, but they are holding sunflowers.',
   'question': 'Can we logically conclude for sure that the three children are not in the classroom, but they are holding plants?'},
  'output': {'is_entailed': 'Yes',
   'reasoning': "Reasoning: Let's think step by step in order to produce the answer. We know from the context that the three children are holding sunflowers. Sunflowers are a type of plant. Therefore, if the children are holding sunflowers, they are indeed holding plants."},
  'id': UUID('8cef87d4-d3d2-4367-8689-824855c7cb0c')},
 {'input': {'context': 'The dog liked sandwich, but not from the store that is not far away.',
   'question': 'Can we logically conclude for sure that the dog liked food, but not from the store that is not far away?'},
  'output': {'is_entailed': 'Yes',
   'reasoning': 'Reasoning: Let\'s think step by step in order to produce the answer. We know from the context that the dog liked the sandwich 

In [None]:
len(best_examples)

In [None]:
original_model = create_chain()
# This time we will apply gpt-3.5-turbo, but use the few-shot examples + reasoning trajectories
# from gpt-4o to help induce better performance
best_performing_model = create_chain(best_examples)

In [None]:
full_test_name

In [None]:
for model_name, model in [
    ("optimized", best_performing_model),
    # ("original", original_model),
]:
    client.run_on_dataset(
        dataset_name=full_test_name,
        llm_or_chain_factory=model,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "model": model_name,
        },
    )

Few-shot examples을 추가하여 사용하여 **성능을 약 0.54에서 약 0.95로 향상**시킬 수 있었습니다!








# 성능 비교를 위해 Few-shot examples를 적용하지 않은 상태에서 성능을 측정해보기

In [None]:
original_model = create_chain()
# This time we will apply gpt-3.5-turbo, but use the few-shot examples + reasoning trajectories
# from gpt-4o to help induce better performance
best_performing_model = create_chain(best_examples)

In [None]:
for model_name, model in [
    # ("optimized", best_performing_model),
    ("original", original_model),
]:
    client.run_on_dataset(
        dataset_name=full_test_name,
        llm_or_chain_factory=model,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "model": model_name,
        },
    )