# Evaluation with MT Bench

This notebook discusses how you can run E2E evaluations for your trained model with [MT Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Evaluating with MT Bench is a 2-step process. In the first step, we run inference for your model to generate answers for 80 multi-turn MT-bench questions. In the second step, we generate judgments (GPT-4 is the default judge) comparing your model's answers vs. reference answers. Each answer is scored [1, 10], considering factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail.

## Prerequisites and Configuration

First, start by cloning the [FastChat](https://github.com/lm-sys/FastChat) repo, which includes the MT Bench framework.


In [1]:
FAST_CHAT_REPO = "/home/gcpuser/Eval/FastChat"  # Folder to clone to.
! git clone https://github.com/lm-sys/FastChat.git $FAST_CHAT_REPO

Cloning into '/home/gcpuser/Eval/FastChat'...
remote: Enumerating objects: 8425, done.[K
remote: Counting objects: 100% (150/150), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 8425 (delta 135), reused 84 (delta 84), pack-reused 8275 (from 4)[K
Receiving objects: 100% (8425/8425), 34.52 MiB | 36.36 MiB/s, done.
Resolving deltas: 100% (6398/6398), done.


Then, navigate to that folder and pip install the packages `model_worker` and `llm_judge`.

In [2]:
import os

os.chdir(FAST_CHAT_REPO)
! pip install -e ".[model_worker,llm_judge]"

Obtaining file:///home/gcpuser/Eval/FastChat
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Collecting markdown2[all]
  Downloading markdown2-2.5.2-py3-none-any.whl.metadata (2.1 kB)
Collecting nh3
  Downloading nh3-0.2.19-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting pydantic-settings
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting shortuuid
  Downloading shortuuid-1.0.13-py3-none-any.whl.metadata (5.8 kB)
Collecting openai<1
  Downloading openai-0.28.1-py3-none-any.whl.metadata (11 kB)
Collecting anthropic>=0.3
  Downloading anthropic-0.40.0-py3-none-any.whl.metadata (23 kB)
Collecting wavedrom (from markdown2[all])
  Downloading wavedrom-2.0.3.post3.tar.gz (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When comparing your model's responses vs. the reference responses to calculate the score, a judge is needed. By default, the judge is set to GPT4. To access GPT-4 models, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [Open AI's quickstart webpage](https://platform.openai.com/docs/quickstart).

<b>⚠️ Cost considerations</b>: To estimate the cost of judging 160 examples (80 x 2-turn conversations) with GPT4, please visit [Open AI's pricing](https://openai.com/api/pricing/) page.

In [3]:
os.environ["OPENAI_API_KEY"] = ""  # NOTE: Set your OpenAI API key here

Finally, point to your model (`MODEL_PATH`). MT Bench supports HuggingFace repo IDs and paths to local folders that contain your model. 
Also, please provide a (human friendly) custom `MODEL_ID` for your model; this will be used to uniquely reference your model when generating judgments or inspecting scores.

In [4]:
MODEL_PATH = "meta-llama/Llama-3.2-1B-Instruct"
MODEL_ID = "my_model"

## Step 1: Run inference

Navigate to the LLM judge folder and run inference, passing in your model path and model id as shown below.

Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_model_answer.py#L209C1-L271C6)):
- You can change the location of the output file by setting `--answer-file=<file path>`.
- You can restrict the max number of generated tokens by your model by setting `--max-new-token=<number of tokens>`.
- You can specify the model revision to be loaded by `--revision=<model revision>`.
- You can set the number of GPUs to be used when running inference with your model with `--num-gpus-per-model=<num GPUs>` (if not set, the default is 1).
- You can restrict the GPU memory used when running inference by `--max-gpu-memory=<max memory>`.
- You can overwrite the default `dtype` with `--dtype=<dtype>` (if not set, the default is to use float16 on GPU, float32 on CPU).

In [5]:
LLM_JUDGE_FOLDER = f"{FAST_CHAT_REPO}/fastchat/llm_judge/"
os.chdir(LLM_JUDGE_FOLDER)
! python gen_model_answer.py --model-path $MODEL_PATH --model-id $MODEL_ID

Output to data/mt_bench/model_answer/my_model.jsonl
tokenizer_config.json: 100%|███████████████| 54.5k/54.5k [00:00<00:00, 6.50MB/s]
tokenizer.json: 100%|██████████████████████| 9.09M/9.09M [00:00<00:00, 17.8MB/s]
special_tokens_map.json: 100%|█████████████████| 296/296 [00:00<00:00, 2.77MB/s]
config.json: 100%|█████████████████████████████| 877/877 [00:00<00:00, 6.89MB/s]
model.safetensors: 100%|███████████████████| 2.47G/2.47G [00:59<00:00, 41.6MB/s]
generation_config.json: 100%|██████████████████| 189/189 [00:00<00:00, 1.30MB/s]
  0%|                                                    | 0/80 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may ob

Inspect the inference results folder to make sure your inference results are there. The default output filename is `<MODEL_ID>.jsonl`.

In [6]:
INFERENCE_RESULTS_FOLDER = f"{LLM_JUDGE_FOLDER}/data/mt_bench/model_answer/"
inference_result_files = os.listdir(INFERENCE_RESULTS_FOLDER)
print(f"files: {inference_result_files}")

files: ['my_model.jsonl']


## Step 2: Judge the model answers

In this notebook, we demonstrate the recommended "single-answer" grading mode, where the judge assigns (for each turn) a score on a scale of 10. There are two additional grading options, where the judged model is compared pairwise to a single baseline model (`pairwise-baseline`) or multiple baseline models (`pairwise-all`) and win rates are generated. For more details, please read FastChat's [other grading options](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#other-grading-options) section.

The command to invoke the GPT-4 judge to score each answer (single-answer grading) is shown below. Note that the `echo -ne '\n'` prefix is required because we are invoking the shell via a notebook and that script (`gen_judgment.py`) requires human verification by pressing "Enter". Piping the `\n` character into the script emulates pressing "Enter" right after executing `gen_judgment.py`.

Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_judgment.py#L170)):
- You can change the location of the judgement file by setting `--judge-file=<file path>`.
- If you want multiple concurrent API calls to the judge, you can set this with `--parallel=<number of concurrent API calls>` (default is 1).
- You can use a different judge model by setting `--judge-model=<judge model name>` (default is `gpt-4`). This option is not documented and might not be very informative if you are interested in generating comparative results, since the reference model is the default model. 
- If you want to update the model that generated the reference answers you can do so by `--baseline-model=<judge model name>` (default is `gpt-3.5-turbo`). This option is also not documented, since the reference answers are used for comparative analysis. 
- If you want to test judgement for a subset of the answers set `--first-n=<number of answers to judge>`. This flag is mainly used for debugging purposes; you can use it to reduce your judgment costs when testing the MT Bench framework. 

In [7]:
os.chdir(LLM_JUDGE_FOLDER)
! echo -ne '\n' | python gen_judgment.py --model-list $MODEL_ID

Stats:
{
    "bench_name": "mt_bench",
    "mode": "single",
    "judge": "gpt-4",
    "baseline": null,
    "model_list": [
        "my_model"
    ],
    "total_num_questions": 80,
    "total_num_matches": 160,
    "output_path": "data/mt_bench/model_judgment/gpt-4_single.jsonl"
}
  0%|                                                   | 0/160 [00:00<?, ?it/s]question: 81, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')
  1%|▎                                          | 1/160 [00:04<12:46,  4.82s/it]question: 82, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')
  1%|▌                                          | 2/160 [00:12<17:48,  6.77s/it]question: 83, turn: 1, model: my_model, score: 9, judge: ('gpt-4', 'single-v1')
  2%|▊                                          | 3/160 [00:19<17:41,  6.76s/it]question: 84, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')
  2%|█                                          | 4/160 [00:25<16:40,  6.4

Inspect the judgement folder to make sure your results are there. The default output filename is `gpt-4_single.jsonl`.

In [8]:
out_files = os.listdir(f"{LLM_JUDGE_FOLDER}/data/mt_bench/model_judgment/")
print(f"files: {out_files}")

files: ['gpt-4_single.jsonl']


Retrieve your aggregate judgment score (with per-turn breakdown), as shown below.

In [9]:
! python show_result.py --model-list $MODEL_ID

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                 score
model    turn         
my_model 1     5.84375

########## Second turn ##########
                score
model    turn        
my_model 2     4.9375

########## Average ##########
             score
model             
my_model  5.390625


## [Optional] Retain your configuration for reproducibility

In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. 

In [10]:
import datetime
import json

import git

evaluation_config_dict = {
    "fast_chat_repo": {
        "repo_tag": str(git.Repo(FAST_CHAT_REPO).tags[-1]),
        "commit_hash": git.Repo(FAST_CHAT_REPO).head.commit.hexsha,
    },
    "configs": {
        "model_path": MODEL_PATH,
        "model_id": MODEL_ID,
    },
    "timestamp": str(datetime.datetime.now()),
    "eval_metrics": "<add relevant metrics here>",
}

evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)
with open("./evaluation_config.json", "w") as output_file:
    output_file.write(evaluation_config_json)