# Braintrustを使用してハルシネーションを検出するLLM-as-a-judge評価の構築

カスタマーサービスボットに取り組んでおり、その応答の品質を評価しようとしているとしましょう。「返品ポリシーは何ですか？」のような質問を考えてみてください。正しい答えが「購入から30日以内に商品を返品できます」であるのに、あなたのボットが「30日以内に商品を返品できます」と生成した場合、これが良い応答かどうかをどのように評価しますか？

`Levenshtein`文字列距離のようなヒューリスティックでは、この応答は不正確であることを示すでしょう。しかし、より良いアプローチは、LLM-as-a-judgeを使用して応答の正確性を評価することです。LLM-as-a-judgeは、LLMを活用して回答の品質をスコア化する技術です。LLMは表面的な文字列比較を超えて言語について推論できるため、より正確に回答を評価することができます。

このクックブックでは、OpenAIのモデルと互換性のあるサードパーティ評価プラットフォームである[Braintrust](https://www.braintrust.dev/)を使用して、ハルシネーションを検出できるLLM-as-a-judgeスコアラーを構築する方法を説明します。

## 依存関係のインストール

いくつかの基本的な依存関係をインストールしましょう。CoQAデータセット（DuckDB経由）、評価用の[Braintrust](https://www.braintrust.dev/)、および[OpenAIのモデル](https://platform.openai.com/docs/models)を使用します。Braintrustはサードパーティの評価プラットフォームであることにご注意ください。続行する前に、彼らの[利用規約とプライバシーポリシー](https://www.braintrust.dev/legal/terms-of-service)を確認してください。

In [2]:
%pip install autoevals duckdb braintrust openai --quiet


Note: you may need to restart the kernel to use updated packages.


次に、OpenAIクライアントを初期化しましょう。リクエストを並列化できるように`AsyncOpenAI`クライアントを使用します。`braintrust.wrap_openai`関数は、LLM呼び出しを[Braintrust](https://www.braintrust.dev/)にログ記録できるようにOpenAIクライアントをラップします。以下の評価を促進するためにBraintrustを使用します。
続行する前に、[Braintrustアカウント](https://www.braintrust.dev/signup)にサインアップし、環境変数`BRAINTRUST_API_KEY`に有効なAPIキーを設定してください。

In [1]:
import os

import braintrust
from openai import AsyncOpenAI

braintrust.login(api_key=os.environ["BRAINTRUST_API_KEY"])
client = braintrust.wrap_openai(AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]))

## データセットを探索する

多様な文章、質問、回答を含む[CoQAデータセット](https://stanfordnlp.github.io/coqa/)を使用します。CoQAは非常に大きなデータセットなので、最初のいくつかの文章のみを見ていきます。どの公開データセットでも同様ですが、基盤となるLLMがデータセットの一部を記憶している可能性があるため、独自のスコアラーを開発する際は、自分のプライベートデータを使用してテストすることをお勧めします。

In [2]:
import duckdb

# DuckDB has an easy wrapper for loading datasets from Hugging Face.
con = duckdb.connect(":memory:")
full_result = con.query("""
    SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'
        LIMIT 40
""").fetchall()

single_result = full_result[10]

print("Passage:")
print(single_result[1])

print("\nQuestion:")
print(single_result[2][0])

print("\nAnswer:")
print(single_result[3]["input_text"][0])

Passage:
(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife. 

Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris. 

Welcome to the world of boxing promotion, circa 2015. 

American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday. 

With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher. 

"Nowadays you have to be on social media to launch the fight and to build hype," says boxing promoter Nisse Sauerland, CEO of Team Sauerland. "It couldn't be done without it." 

Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker "The Money Man" or "TBE" (The Best Ever), boasts nearly five m

データには一連の文章が含まれており、それぞれに複数の質問と回答があります。これを`(passage, question, answer)`タプルのリストに平坦化しましょう。

In [3]:
from dataclasses import dataclass


@dataclass
class QuestionAnswer:
    passage: str
    question: str
    expected_answer: str
    generated_answer: str


qa_pairs = [
    QuestionAnswer(
        passage=r[1],
        question=question,
        generated_answer=r[3]["input_text"][i],
        expected_answer=r[3]["input_text"][i],
    )
    for r in full_result
    for (i, question) in enumerate(r[2])
]

print(len(qa_pairs))

629


### ハルシネーションの追加

Braintrustのスコアラーはハルシネーションをテストするように設計されているため、QAペアを使用して既知のハルシネーションを生成できます。各質問に対して、パッセージを使用せずにLLMに自信を持って答えを生成させることで、ハルシネーションされた回答を作成します。

In [4]:
import asyncio
import random

random.seed(42)


async def hallucinate_answer(qa):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """\
You are a helpful hallucinating assistant, who makes up fake answers to questions.

Answer the following question in 1 sentence. If you know the answer, then make up some fake
superfluous details that are not in the passage you have memorized.

Make sure to always answer it confidently, even if you don't know the answer. Do not use words
like "perhaps", "likely", "maybe", etc. or punctuation like "...".Do not admit that you cannot
or do not know the answer.""",
            },
            {"role": "user", "content": qa.question},
        ],
        temperature=1,
        max_tokens=100,
    )
    return response.choices[0].message.content


hallucinated_answers = await asyncio.gather(
    *[hallucinate_answer(qa) for qa in qa_pairs]
)


hallucinations = [
    QuestionAnswer(
        passage=qa.passage,
        question=qa.question,
        expected_answer=qa.expected_answer,
        generated_answer=hallucination,
    )
    for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)
    # Exclude simple yes/no answers.
    if "yes" not in hallucination.lower() and "no" not in hallucination.lower()
]

print("Passage:")
print(hallucinations[0].passage)
print("\nQuestion:")
print(hallucinations[0].question)
print("\nExpected Answer:")
print(hallucinations[0].expected_answer)
print("\nGenerated Answer:")
print(hallucinations[0].generated_answer)

print("\n\nNumber of hallucinations:", len(hallucinations))

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!" 

"I only wanted to be more like you". 

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to

## 評価器の作成

LLM-as-a-judgeを作成するためのいくつかの人気なアプローチを検討します。各アプローチについて、スコアラーを作成し、その後「メタ評価」を行ってパフォーマンスを確認します。
幻覚的な回答が間違っていることがわかっているため、幻覚的な回答を`0`としてスコア付けする頻度をテストすることで評価器の品質を評価します。

### LLM-as-a-judge #1: 数値評価者

LLM-as-a-judgeを作成する際の一般的な最初の直感は、LLMに回答を1から5のスケールで評価してもらうことです。このアプローチの利点は、LLMの出力を数値スコアに変換するのが簡単なことです。

[Factuality](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml)テンプレートの修正版を使用しますが、LLMに回答を1から10のスケールで評価してもらいます。

In [5]:
import json

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""


@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0


これは有望に見えます！単一の例で健全性をチェックしたので、今度は適切な評価を実行して、より広いデータセットでのパフォーマンスを確認してみましょう。評価は3つのコンポーネントで構成されます：

- **データ**: この場合、`input`は質問、幻覚的な回答、および正解となります。スコアラーはこれを0から1の間のスコアに変換します。これは幻覚なので、期待されるスコアは0です。
- **タスク**: タスクは単純に各入力に対して数値評価器を呼び出すことです。
- **スコア**: 生成されたスコアと正解スコアを比較することで、生成されたスコアの品質を評価します。両方の数値が0から1の間にあることがわかっているため、正規化された差分をスコアとして使用できます。

In [7]:
from dataclasses import asdict

from braintrust import Eval


def data():
    for pair in hallucinations:
        yield dict(
            input=dict(asdict(pair)), expected=0, metadata=dict(hallucination=True)
        )


async def task(input):
    return await numeric_rater(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


def normalized_diff(output, expected):
    return 1 - abs(output - expected)


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater",
    max_concurrency=10,
)

Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater
LLM-as-a-judge [experiment_name=Numeric rater] (data): 270it [00:00, 54634.41it/s]


LLM-as-a-judge [experiment_name=Numeric rater] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]


95.35% 'normalized_diff' score

201.60tok prompt_tokens
5tok completion_tokens
206.60tok total_tokens

See results for Numeric rater at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater


EvalResultWithSummary(summary="...", results=[...])

数値評価者は全体で約94%のスコアを記録したようです。悪くはありませんが、評価の6%が誤って判定されている場合、それらを信頼するのは非常に困難になる可能性があります。何が起こっているのかを洞察するために、BraintrustのUIを詳しく調べてみましょう。

![Partial credit](../images/Custom-LLM-as-a-Judge-Partial-Credit.gif)

不正解の多くが1から10の間の数値でスコア付けされているようです。しかし、現在のところ、モデルがこれらのスコアを与えた理由についての洞察はありません。次にそれを修正できるかどうか見てみましょう。

### LLM-as-a-judge #2: 推論の追加

LLMが評価について推論も行うようにプロンプトを調整してみましょう。この手法は[Chain of Thought Reasoning](https://en.wikipedia.org/wiki/Chain_of_thought_reasoning)と呼ばれます。スコアを改善する可能性があることに加えて、モデルがなぜこれらのスコアを付けたのかについての洞察を得ることができます。

In [9]:
@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "title": "Reasoning",
                                "type": "string",
                            },
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0


In [10]:
await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater with reasoning",
    max_concurrency=10,
)

Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 270it [00:00, 111715.70it/s]


LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]


Numeric rater with reasoning compared to Numeric rater:
92.10% (-03.25%) 'normalized_diff' score	(5 improvements, 63 regressions)

3.68s duration
3.68s llm_duration
239.60tok (+3800.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
136.82tok (+13182.22%) 'completion_tokens'	(0 improvements, 270 regressions)
376.43tok (+16982.22%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ estimated_cost

See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning


EvalResultWithSummary(summary="...", results=[...])

推論を追加してもスコアの向上には役立たなかったようです（実際には3%悪化しています）。しかし、失敗例の一つを見ることで、モデルが何を考えていたかについての洞察を得ることができます。以下は幻覚による回答の例です：

![Output](../images/Custom-LLM-as-a-Judge-Output.png)

そして、その推論とともにスコアが示されています：

![Reasoning](../images/Custom-LLM-as-a-Judge-Reasoning.png)

モデルが部分点を計算するために独自の判断を適用しているようです。これは数値評価における一般的な問題で、モデルと人間の両方に当てはまり、より良いプロンプトを使用することで解決できることがよくあります。

### LLM-as-a-judge #3: 評価の代わりに分類を行う

次に、具体的な基準を明示し、それらの基準に従って回答を分類するようモデルに求めます。この手法により、テストしたい幻覚（ハルシネーション）に向けてモデルをより正確に導くことができます。直感的に、モデルに評価のための具体的な基準を与えることで、より正確なスコアが得られるでしょう。

In [11]:
PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

# Since we're testing for hallucinations, penalize (B) as much as (D).
CHOICE_SCORES = {
    "A": 0.5,
    "B": 0,
    "C": 1,
    "D": 0,
    "E": 1,
}


@braintrust.traced
async def classifier(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Call this function to select a choice.",
                    "parameters": {
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "type": "string",
                            },
                            "choice": {
                                "description": "The choice",
                                "type": "string",
                                "enum": ["A", "B", "C", "D", "E"],
                            },
                        },
                        "required": ["reasons", "choice"],
                        "type": "object",
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    choice = arguments["choice"]
    return CHOICE_SCORES[choice] if choice in CHOICE_SCORES else None


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await classifier(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await classifier(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0


In [12]:
async def task(input):
    return await classifier(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Classifier",
    max_concurrency=10,
)

Experiment Classifier is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier
LLM-as-a-judge [experiment_name=Classifier] (data): 270it [00:00, 84930.41it/s]


LLM-as-a-judge [experiment_name=Classifier] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]


Classifier compared to Numeric rater with reasoning:
98.15% (+06.05%) 'normalized_diff' score	(86 improvements, 5 regressions)

4.41s (+72.60%) 'duration'         	(104 improvements, 165 regressions)
4.40s (+72.59%) 'llm_duration'     	(104 improvements, 165 regressions)
418.60tok (+17900.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
164.91tok (+2809.26%) 'completion_tokens'	(64 improvements, 204 regressions)
583.52tok (+20709.26%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ (+00.07%) 'estimated_cost'   	(8 improvements, 255 regressions)

See results for Classifier at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier


EvalResultWithSummary(summary="...", results=[...])

分類器は98%のスコアを記録し、これは大幅な改善です！

### このパターンの体系化

上記の分類器は、以下のように簡潔に書き直すことができます：

```python
PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

Classifier = autoevals.LLMClassifier(
    name="Hallucination detector",
    prompt_template=PROMPT,
    choice_scores={"A": 0.5, "B": 0, "C": 1, "D": 0, "E": 1},
    use_cot=True,
)
```

## 次のステップ

次のステップとして、個々の改善点と後退点を詳しく調査して評価し、プロンプトの将来的な改善を検討することができます。また、独自のデータでテストし、結果があなたのユースケースに適用できることを再確認することもできます。
o1のようなモデルを測定したり、より小さなモデルをファインチューニングして結果が再現可能かどうかを確認したり、few-shotプロンプティングを使用してより主観的な基準にモデルを合わせることもできます。
いずれの場合も、各変更の影響を厳密に評価できるよう、結果を評価することに努めるべきです。