# Custom model-graded evals

**Note: このレッスンは関連コードファイルを含むフォルダ内にあります。実際に手元で実行しながら進めたい場合は、フォルダ全体をダウンロードしてください。**

このレッスンでは、promptfoo を使って **カスタムのモデル採点（model-graded）評価** を作る方法を学びます。まずはシンプルな目標から始めます：技術的に複雑で長い Wikipedia 記事を、小学生（grade school）向けに短く分かりやすく要約できるプロンプトを書きたい。

例えば [畳み込みニューラルネットワーク（Convolutional neural network）の Wikipedia 記事](https://en.wikipedia.org/wiki/Convolutional_neural_network) 全文が入力として与えられたとき、次のような平易な要約が欲しい、というイメージです：

> Convolutional neural networks, or CNNs, are a special type of computer program that can learn to recognize images and patterns. They work a bit like the human brain, using layers of artificial "neurons" to process information.
CNNs are really good at tasks like identifying objects in pictures or recognizing faces. They do this by breaking down images into smaller pieces and looking for important features, kind of like putting together a puzzle.
What makes CNNs special is that they can learn these features on their own by looking at lots of examples. This allows them to get better and better at recognizing things, sometimes even matching human-level performance.
Scientists and engineers use CNNs for all sorts of cool applications, like helping self-driving cars see the road, finding new medicines, or even teaching computers to play games like chess and Go.

プロンプトの有効性を評価するために、要約結果を次の3つの指標で採点するカスタムのモデル採点アサーションを書きます：

* Conciseness（1〜5）：できるだけ簡潔か？
* Accuracy（1〜5）：元記事に照らして完全に正確か？
* Tone（1〜5）：技術的背景のない小学生向けとして適切なトーンか？

各指標は1〜5点で、平均を取って **平均4.5/5以上** を目標にします。そのために、独自のモデルグレーダ関数を定義します！

---


## The input data

ここでの目標は、複雑な Wikipedia 記事を短く分かりやすく要約するプロンプトを書くことです。まず評価用に要約したい記事を集めます。

このフォルダには `articles` ディレクトリがあり、8つの txt ファイルが入っています。各ファイルは Wikipedia 記事の本文テキストです。これらを評価の入力として使います。ファイルをいくつか開いて、どれくらい長く複雑か確認してみてください。

このデータセットはテストケースが8件しかなく、実運用の評価には小さすぎます。コース内で繰り返し述べている通り、評価データセットは少なくとも100件以上を推奨します。

---


## Our prompts

`prompts.py` を見てください。promptfoo で評価する3つのプロンプト生成関数が入っています：

```py
def basic_summarize(article):
  return f"Summarize this article {article}"

def better_summarize(article):
  return f"""
  Summarize this article for a grade-school audience: {article}"""

def best_summarize(article):
  return f"""
  You are tasked with summarizing long wikipedia articles for a grade-school audience.
  Write a short summary, keeping it as concise as possible. 
  The summary is intended for a non-technical, grade-school audience. 
  This is the article: {article}"""
```

**重要：これらのプロンプトは全体的に「そこそこ」なプロンプトです。** この評価セットを回す際のトークン数を減らすため、意図的に短くし、十分な例を入れるなどのベストプラクティスもあえて省いています。

---


## Updating the config file

`promptfooconfig.yaml` には、これまでに見てきたフィールドが入っています：

```yaml
description: 'Summarization Evaluation'

prompts:
  - prompts.py:basic_summarize
  - prompts.py:better_summarize
  - prompts.py:best_summarize

providers:
  - id: anthropic:messages:claude-3-5-sonnet-20240620
    label: "3.5 Sonnet"

tests:
  - vars:
      article: file://articles/article1.txt
  - vars:
      article: file://articles/article2.txt
  - vars:
      article: file://articles/article3.txt
  - vars:
      article: file://articles/article4.txt
  - vars:
      article: file://articles/article5.txt
  - vars:
      article: file://articles/article6.txt
  - vars:
      article: file://articles/article7.txt
  - vars:
      article: file://articles/article8.txt

defaultTest:
  assert:
    - type: python
      value: file://custom_llm_eval.py

```

ここでは `prompts.py` の3つのプロンプトを評価対象にし、プロバイダとして Claude 3.5 Sonnet を使うよう指定しています。

`tests` では、テストごとに `article` 変数へ別の値を入れています。新しい点は、値をテキストファイルから読み込んでいることです。記事は長すぎるため、YAMLに直接埋め込まずファイル参照にしています。例えば次の設定：

```yaml
tests:
  - vars:
      article: file://articles/article1.txt
```

は、`article` 変数に `article1.txt` の全文テキストを入れてテストする、という意味です。これを8ファイル分繰り返します。

---


## Writing the custom model-grader function

次に、YAMLの最後のフィールドに注目します：

```yaml
defaultTest:
  assert:
    - type: python
      value: file://custom_llm_eval.py
```

これは、各テストに対して `custom_llm_eval.py` に定義した Python アサーションを実行する、という指定です。カスタムのコード採点アサーションを定義したときと同じ構文ですが、今回は「別のモデルを使ってモデル出力を採点する」関数を書きます。

`custom_llm_eval.py` の中身を見てみましょう。かなり長いコードが入っています：

```py
import anthropic
import os
import json

def llm_eval(summary, article):
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    prompt = f"""Evaluate the following summary based on these criteria:
    1. Conciseness (1-5) - is the summary as concise as possible?
        - Conciseness of 1: The summary is unnecessarily long, including excessive details, repetitions, or irrelevant information. It fails to distill the key points effectively.
        - Conciseness of 3:  The summary captures most key points but could be more focused. It may include some unnecessary details or slightly over explain certain concepts.
        - Conciseness of 5: The summary effectively condenses the main ideas into a brief, focused text. It includes all essential information without any superfluous details or explanations.
    2. Accuracy (1-5) - is the summary completely accurate based on the initial article'?
        - Accuracy of 1: The summary contains significant errors, misrepresentations, or omissions that fundamentally alter the meaning or key points of the original article.
        - Accuracy of 3:  The summary captures some key points correctly but may have minor inaccuracies or omissions. The overall message is generally correct, but some details may be wrong.
        - Accuracy of 5: The summary faithfully represents the main gist of the original article without any errors or misinterpretations. All included information is correct and aligns with the source material.
    3. Tone (1-5) - is the summary appropriate for a grade school student with no technical training?
        - Tone of 1: The summary uses language or concepts that are too complex, technical, or mature for a grade school audience. It may contain jargon, advanced terminology, or themes that are not suitable for young readers.
        - Tone of 2:  The summary mostly uses language suitable for grade school students but occasionally includes terms or concepts that may be challenging. Some explanations might be needed for full comprehension.
        - Tone of 3: The summary consistently uses simple, clear language that is easily understandable by grade school students. It explains complex ideas in a way that is accessible and engaging for young readers.
    4. Explanation - a general description of the way the summary is evaluated

    <examples>
    <example>
    This summary:
    <summary>
    Artificial neural networks are computer systems inspired by how the human brain works. They are made up of interconnected "neurons" that process information. These networks can learn to do tasks by looking at lots of examples, similar to how humans learn. 

    Some key things about neural networks:
    - They can recognize patterns and make predictions
    - They improve with more data and practice
    - They're used for things like identifying objects in images, translating languages, and playing games

    Neural networks are a powerful tool in artificial intelligence and are behind many of the "smart" technologies we use today. While they can do amazing things, they still aren't as complex or capable as the human brain.
    <summary>
    Should receive a 5 for tone, a 5 for accuracy, and a 5 for conciseness
    </example>

    <example>
    This summary:
    <summary>
    Here is a summary of the key points from the article on artificial neural networks (ANNs):

    1. ANNs are computational models inspired by biological neural networks in animal brains. They consist of interconnected artificial neurons that process and transmit signals.

    2. Basic structure:
    - Input layer receives data
    - Hidden 
```

（中略：このファイルには続きがあります）


### `get_assert()`

promptfoo は、アサーションファイル内の `get_assert` 関数を自動的に探します。そして次の2引数を渡します：

- モデル応答の `output`
- `output` を生成した変数やプロンプトを含む `context` 辞書

promptfoo は戻り値として次のいずれかを期待します：
- bool（合否）
- float（スコア）
- GradingResult 辞書

ここでは GradingResult 辞書を返します。必要なプロパティは次のとおりです：

- `pass`：boolean
- `score`：float
- `reason`：説明文（string）

以下は、何をしているか分かるようコメントを付けた関数例です：

```py
def get_assert(output: str, context, threshold=4.5):
    # Get the specific article from the context
    article = context['vars']['article']
    #Pass the model output and the article to a function we've defined called llm_eval
    score, evaluation = llm_eval(output, article ) #capture the resulting score it returns and the evaluation explanation
    #return a dictionary indicating whether the output passed the test, its score, and the explanation behind the score
    return {
        "pass": score >= threshold,
        "score": score,
        "reason": evaluation
    }
```

### `llm_eval()`

次に、実際の採点を行う `llm_eval` 関数をもう少し詳しく見ます。この関数は次を行います：

1. 要約の採点ルーブリック（非常に長いプロンプト）を定義する
2. その採点プロンプトを Anthropic API に送って評価者モデルを実行する
3. 応答をパースして平均スコアを計算する
4. 平均スコアと、モデルが返した全文レスポンスを返す

以下がコード全文です：

```py
def llm_eval(summary, article):
    """
    Evaluate summary using an LLM (Claude).
    
    Args:
    summary (str): The summary to evaluate.
    article (str): The original text that was summarized.
    
    Returns:
    bool: True if the average score is above the threshold, False otherwise.
    """
    client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    prompt = f"""Evaluate the following summary based on these criteria:
    1. Conciseness (1-5) - is the summary as concise as possible?
        - Conciseness of 1: The summary is unnecessarily long, including excessive details, repetitions, or irrelevant information. It fails to distill the key points effectively.
        - Conciseness of 3:  The summary captures most key points but could be more focused. It may include some unnecessary details or slightly overexplain certain concepts.
        - Conciseness of 5: The summary effectively condenses the main ideas into a brief, focused text. It includes all essential information without any superfluous details or explanations.
    2. Accuracy (1-5) - is the summary completely accurate based on the initial article'?
        - Accuracy of 1: The summary contains significant errors, misrepresentations, or omissions that fundamentally alter the meaning or key points of the original article.
        - Accuracy of 3:  The summary captures some key points correctly but may have minor inaccuracies or omissions. The overall message is generally correct, but some details may be wrong.
        - Accuracy of 5: The summary faithfully represents the main gist of the original article without any errors or misinterpretations. All included information is correct and aligns with the source material.
    4. Tone (1-5) - is the summary appropriate for a grade school student with no technical training?
        - Tone of 1: The summary uses language or concepts that are too complex, technical, or mature for a grade school audience. It may contain jargon, advanced terminology, or themes that are not suitable for young readers.
        - Tone of 2:  The summary mostly uses language suitable for grade school students but occasionally includes terms or concepts that may be challenging. Some explanations might be needed for full comprehension.
        - Tone of 3: The summary consistently uses simple, clear language that is easily understandable by grade school students. It explains complex ideas in a way that is accessible and engaging for young readers.
    5. Explanation - a general description of the way the summary is evaluated

    <examples>
    <example>
    This summary:
    <summary>
    Artificial neural networks are computer systems inspired by how the human brain works. They are made up of interconnected "neurons" that process information. These networks can learn to do tasks by looking at lots of examples, similar to how humans learn. 

    Some key things about neural networks:
    - They can recognize patterns and make predictions
    - They improve with more data and practice
    - They're used for things like identifying objects in images, translating languages, and playing games

    Neural networks are a powerful tool in artificial intelligence and are behind many of the "smart" technologies we use today. While they can do amazing things, they still aren't as complex or capable as the human brain.
    <summary>
    Should receive a 5 for tone, a 5 for accuracy, and a 5 for conciseness
    </example>

    <example>
    This summary:
    <summary>
    Here is a summary of the key points from the article on artificial neural networks (ANNs):

    1. ANNs are computational models inspired by biological neural networks in animal brains. They consist of interconnected artificial neurons that process and transmit signals.

    2. Basic structure:
    - Input layer receives data
    - Hidden layers process information 
    - Output layer produces results
    - Neurons are connected by weighted edges

    3. Learning process:
    - ANNs learn by adjusting connection weights
    - Use techniques like backpropagation to minimize errors
    - Can perform supervised, unsupervised, and reinforcement learning

    4. Key developments:
    - Convolutional neural networks (CNNs) for image processing
    - Recurrent neural networks (RNNs) for sequential data
    - Deep learning with many hidden layers

    5. Applications:
    - Pattern recognition, classification, regression
    - Computer vision, speech recognition, natural language processing
    - Game playing, robotics, financial modeling

    6. Advantages:
    - Can model complex non-linear relationships
    - Ability to learn and generalize from data
    - Adaptable to many different types of problems

    7. Challenges:
    - Require large amounts of training data
    - Can be computationally intensive
    - "Black box" nature can make interpretability difficult

    8. Recent advances:
    - Improved hardware (GPUs) enabling deeper networks
    - New architectures like transformers for language tasks
    - Progress in areas like generative AI

    The article provides a comprehensive overview of ANN concepts, history, types, applications, and ongoing research areas in this field of artificial intelligence and machine learning.
    </summary>
    Should receive a 1 for tone, a 5 for accuracy, and a 3 for conciseness
    </example>
    </examples>

    Provide a score for each criterion in JSON format. Here is the format you should follow always:

    <json>
    {{
    "conciseness": <number>,
    "accuracy": <number>,
    "tone": <number>,
    "explanation": <string>,
    }}
    </json>


    Original Text: <original_article>{article}</original_article>
    
    Summary to Evaluate: <summary>{summary}</summary>
    """
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1000,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": prompt
            },
            {
                "role": "assistant",
                "content": "<json>" 
            }
        ],
        stop_sequences=["</json>"]
    )
    
    evaluation = json.loads(response.content[0].text)
    # Filter out non-numeric values and calculate the average
    numeric_values = [value for key, value in evaluation.items() if isinstance(value, (int, float))]
    avg_score = sum(numeric_values) / len(numeric_values)
    # Return the average score and the overall model response
    return avg_score, response.content[0].text
```

---


## Running the eval

評価を実行するには、これまでと同じコマンドを使います：

```bash
npx promptfoo@latest eval
```

このプロセスは少し時間がかかることがあります。記事要約を生成するためのモデル呼び出しに加えて、その要約を採点するための追加のモデル呼び出しも行うためです。

評価結果のスクリーンショット：

![eval_result.png](../images/eval_result.png)

結果をより理解するため、Webビューを起動します：

```bash
npx promptfoo@latest view
```

Webダッシュボードのスクリーンショット：

![web_view.png](../images/web_view.png)


各セルの虫眼鏡をクリックすると、テスト結果の詳細を確認できます：

![explanation.png](../images/explanation.png)

この例では、tone のスコアが低すぎるため、カスタム llm-eval 関数で失敗していることが分かります。


さらに、結果の最上段には各プロンプトのスコア概要が表示されます：

![overall_scores.png](../images/overall_scores.png)

予想どおり、`best_summary` プロンプトが最も良い性能です。


ダッシュボード上部には、スコアを可視化するチャートも表示されます：

![distribution.png](../images/distribution.png)

上の図では：

* 赤：`basic_summarize`
* 青：`better_summarize`
* 緑：`best_summarize`

このチャートから、`best_summarize` がテストに失敗しないだけでなく、全入力で他のプロンプトより高いスコアを出していることが分かります。
