# Code-graded eval: classification task

このレッスンでは、顧客からの苦情を分類するプロンプトをテストするために、もう少し複雑な **コード採点（code-graded）の評価** をゼロから実装します。

目標は、顧客の苦情を次のカテゴリに安定して分類できるプロンプトを書くことです：

* Software Bug
* Hardware Malfunction
* User Error
* Feature Request
* Service Outage

たとえば、次の苦情文：

> The website is completely down, I can't access any pages

は `Service Outage` に分類されるべきです。

また、場合によっては次の例のように、該当カテゴリを最大2つまで許可したいこともあります：

> I think I installed something incorrectly, and now my computer won't start at all

これは `User Error` と `Hardware Malfunction` の両方に分類されるべきです。

---


## The Evaluation data set

まずは、入力とゴールデンアンサー（正解ラベル）からなる評価データセットを定義します。一般には100件程度の入力を持つ評価データセットが望ましいですが、このレッスンでは簡潔さ（実行の速さ・コスト）を優先して、少数のデータに絞っています。

このテストセットは辞書のリストで、各辞書が `complaint` と `golden_answer` を持ちます：


In [1]:
eval_data = [
    {
        "complaint": "The app crashes every time I try to upload a photo",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My printer isn't recognized by my computer",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I can't figure out how to change my password",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The website is completely down, I can't access any pages",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "It would be great if the app had a dark mode option",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The software keeps freezing when I try to save large files",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My wireless mouse isn't working, even with new batteries",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I accidentally deleted some important files, can you help me recover them?",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "None of your servers are responding, is there an outage?",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "Could you add a feature to export data in CSV format?",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The app is crashing and my phone is overheating",
        "golden_answer": ["Software Bug", "Hardware Malfunction"]
    },
    {
        "complaint": "I can't remember my password!",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The new update broke something and the app no longer works for me",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "I think I installed something incorrectly, now my computer won't start at all",
        "golden_answer": ["User Error", "Hardware Malfunction"]
    },
    {
        "complaint": "Your service is down, and I urgently need a feature to batch process files",
        "golden_answer": ["Service Outage", "Feature Request"]
    },
    {
        "complaint": "The graphics card is making weird noises",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "My keyboard just totally stopped working out of nowhere",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "Whenever I open your app, my phone gets really slow",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "Can you make the interface more user-friendly? I always get lost in the menus",
        "golden_answer": ["Feature Request", "User Error"]
    },
    {
        "complaint": "The cloud storage isn't syncing and I can't access my files from other devices",
        "golden_answer": ["Software Bug", "Service Outage"]
    }
]

---

## An initial prompt

まずは基本的なプロンプトで性能を測ります。以下のプロンプト生成関数は `complaint` を受け取り、プロンプト文字列を返します：


In [2]:
def basic_prompt(complaint):
    return f"""
    Classify the following customer complaint into one or more of these categories: 
    Software Bug, Hardware Malfunction, User Error, Feature Request, or Service Outage.
    Only respond with the matching category or categories and nothing else.

    Complaint: {complaint}

    Classification:
    """

---

## Collecting outputs

次に、プロンプトを評価するロジックを書きます。このロジックは、前回レッスンの「脚の本数」例より少し複雑です：


In [3]:
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def get_model_response(prompt, model_name):
    response = client.messages.create(
        model=model_name,
        max_tokens=200,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return response.content[0].text

def calculate_accuracy(eval_data, model_responses):
    correct_predictions = 0
    total_predictions = len(eval_data)
    
    for item, response in zip(eval_data, model_responses):
        golden_set = set(category.lower() for category in item["golden_answer"])
        prediction_set = set(category.strip().lower() for category in response.split(','))
        
        if golden_set == prediction_set:
            correct_predictions += 1
    
    return correct_predictions / total_predictions

def evaluate_prompt(prompt_func, eval_data, model_name):
    print(f"Evaluating with model: {model_name}")
    model_responses = [get_model_response(prompt_func(item['complaint']), model_name) for item in eval_data]
    accuracy = calculate_accuracy(eval_data, model_responses)
    
    print(f"Accuracy: {accuracy:.2%}")
    
    for item, response in zip(eval_data, model_responses):
        print(f"\nComplaint: {item['complaint']}")
        print(f"Golden Answer: {item['golden_answer']}")
        print(f"Model Response: {response}")
    return accuracy

`evaluate_prompt` 関数は次を行います：

1. 各入力をプロンプト生成関数に渡し、`get_model_response` でモデル実行して応答を収集します。
2. モデル出力とデータセット内のゴールデンアンサーを比較し、`calculate_accuracy` を呼んで正答率を計算します。
3. `calculate_accuracy` は `set` を使って、モデル出力に適切なカテゴリが含まれているかをチェックします。これは前回の「脚の本数」評価のような完全一致（exact-match）ではありません。
4. `calculate_accuracy` は正答率を返します。
5. `evaluate_prompt` は最終結果を表示します。

**前回のように完全一致で採点するのではなく、モデル出力に値が含まれているかを `set` で確認している点に注目してください。**


まずは初期 `basic_prompt` で試してみましょう。

In [7]:
# evaluate_prompt(basic_prompt, eval_data, model_name="claude-3-haiku-20240307")
evaluate_prompt(basic_prompt, eval_data, model_name="claude-sonnet-4-5-20250929")


Evaluating with model: claude-sonnet-4-5-20250929
Accuracy: 85.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction, Software Bug, User Error

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfun

0.85

---

## An improved prompt

初期プロンプトでは正答率が 85% でした。プロンプトを改善して評価を再実行し、スコアが上がるか確認しましょう。

次のプロンプトでは、カテゴリの説明を詳しくし、さらに9つの入出力例（few-shot）も追加しています：


In [5]:
def improved_prompt(complaint):
    return f"""
    You are an AI assistant specializing in customer support issue classification. Your task is to analyze customer complaints and categorize them into one or more of the following categories:

    1. Software Bug: Issues related to software not functioning as intended.
    2. Hardware Malfunction: Problems with physical devices or components.
    3. User Error: Difficulties arising from user misunderstanding or misuse.
    4. Feature Request: Suggestions for new functionalities or improvements.
    5. Service Outage: System-wide issues affecting service availability.

    Important Guidelines:
    - A complaint may fall into multiple categories. If so, list all that apply but try to prioritize picking a single category when possible.

    Examples:
    1. Complaint: "The app crashes when I try to save my progress."
    Classification: Software Bug

    2. Complaint: "My keyboard isn't working after I spilled coffee on it."
    Classification: Hardware Malfunction

    3. Complaint: "I can't find the login button on your website."
    Classification: User Error

    4. Complaint: "It would be great if your app had a dark mode."
    Classification: Feature Request

    5. Complaint: "None of your services are loading for me or my colleagues."
    Classification: Service Outage

    6. Complaint "Complaint: The app breaks every time I try to change my profile picture"
    Classification: Software Bug

    7. Complaint "The app is acting buggy on my phone and it seems like your website is down, so I'm completely stuck!"
    Classification: Software Bug, Service Outage

    8. Complaint: "Your software makes my computer super laggy and awful, I hate it!"
    Classification: Software Bug

    9. Complaint: "Your dumb app always breaks when I try to do anything with images."
    Classification: 'Software Bug'

    Now, please classify the following customer complaint:

    <complaint>{complaint}</complaint>

    Only respond with the appropriate categories and nothing else.
    Classification:
    """

改善したプロンプトで評価を実行します：


In [8]:
# evaluate_prompt(improved_prompt, eval_data, model_name="claude-3-haiku-20240307")
evaluate_prompt(improved_prompt, eval_data, model_name="claude-sonnet-4-5-20250929")

Evaluating with model: claude-sonnet-4-5-20250929
Accuracy: 95.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I accide

0.95

改善したプロンプトでは正答率が 100% になりました！

ここでも、次の標準的な「プロンプト＋評価」ループに沿っています：

![process.png](../images/process.png)

**ただし、これは非常に小さなデータセットを使ったシンプルな評価です。このレッスンの目的はコード採点評価の一般的な流れを示すことであり、本番規模の評価の決定版（canonical example）として意図されたものではありません。**

このやり方は動きますが、評価ロジックを毎回ゼロから書くのは手間がかかり、結果を横並びで比較するのも難しいです。もし、表やグラフで綺麗に結果を出してくれたり、複数モデルで簡単に評価を回せたりするツールがあったらどうでしょう？ 次のレッスンではまさにそれを扱います。次は、本番ユースケース向けに再現性がありスケーラブルな評価を書きやすくする評価フレームワークを見ていきます。
