# Calculate Agreement between Humans and GPT-4 as LLM judge with MT-bench Human Judgement Dataset

Source:

https://colab.research.google.com/drive/1ctgygDRJhVGUJTQy8-bRZCl1WNcT8De6?usp=sharing

https://huggingface.co/datasets/lmsys/mt_bench_human_judgments

LLM Judge

https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

## MT-Bench-101

MT-Bench-101 is specifically designed to evaluate the finegrained abilities of LLMs in multi-turn dialogues. 

By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.

Dataset:

https://github.com/mtbench101/mt-bench-101

Paper:

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large
Language Models in Multi-Turn Dialogues 

https://arxiv.org/abs/2402.14762

## Introduction

In this notebook, we will compute the agreement between humans and GPT-4 judge with MT-bench human judgement dataset (https://huggingface.co/datasets/lmsys/mt_bench_human_judgments). Our results show that humans and GPT-4 judge achieve over 80\% agreement, the same level of agreement between humans.

In [1]:
# !pip install datasets

In [2]:
import argparse
import json
import os

import numpy as np
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


### Download data


In [3]:
dataset = load_dataset("lmsys/mt_bench_human_judgments")
dataset["human"].to_json("human_judgments.json")
dataset["gpt4_pair"].to_json("gpt4_pair_judgments.json")

Generating gpt4_pair split: 100%|██████████| 2400/2400 [00:00<00:00, 119307.48 examples/s]
Generating human split: 100%|██████████| 3355/3355 [00:00<00:00, 174191.55 examples/s]
Creating json from Arrow format: 100%|██████████| 4/4 [00:00<00:00, 22.36ba/s]
Creating json from Arrow format: 100%|██████████| 3/3 [00:00<00:00, 26.28ba/s]


11356420

### Agreement Computation Code

In [4]:
def get_judge_name(judge):
    if isinstance(judge, list) and judge[0] == "gpt-4" and judge[1].startswith("pair"):
        return "gpt4-pair"
    if judge.startswith("expert"):
        return "human"
    if judge.startswith("author"):
        return "author"
    return judge


def revert(vote):
    if vote == "model_a":
        return "model_b"
    elif vote == "model_b":
        return "model_a"
    return vote


def get_mt_bench_votes_data(raw_votes):
    data = [{}, {}]

    for judge_votes in raw_votes:
        for vote in judge_votes:
            turn = vote["turn"] - 1
            if vote["model_a"] < vote["model_b"]:
                key = (vote["question_id"], vote["model_a"], vote["model_b"])
                winner = vote["winner"]
            else:
                key = (vote["question_id"], vote["model_b"], vote["model_a"])
                winner = revert(vote["winner"])
            judge = get_judge_name(vote["judge"])
            if key not in data[turn]:
                data[turn][key] = {}
            if judge not in data[turn][key]:
                data[turn][key][judge] = []
            data[turn][key][judge].append(winner)

    return data


def convertvote(vote):
    if "tie" in vote:
        return "tie"
    return vote


def equalvote(vote1, vote2):
    if "tie" in vote1 and "tie" in vote2:
        return True
    return vote1 == vote2


# data: Dict[qid -> List[vote]]
def get_mt_bench_agreement(data, judge1, judge2, ban):
    if judge1.startswith("gpt4") and judge2 == "human":
        stats = [0, 0]
        for votes in data.values():
            if judge1 not in votes or judge2 not in votes: continue
            assert len(votes[judge1]) == 1
            if convertvote(votes[judge1][0]) in ban: continue
            for v in votes[judge2]:
                if convertvote(v) in ban: continue
                stats[1] += 1
                stats[0] += equalvote(votes[judge1][0], v)
        return stats[0], stats[1]
    elif judge1 == "human" and judge2 == "human":
        stats = [0, 0]
        for votes in data.values():
            if "human" not in votes: continue
            for i in range(len(votes["human"]) - 1):
                for j in range(i + 1, len(votes["human"])):
                    if convertvote(votes["human"][i]) in ban or convertvote(votes["human"][j]) in ban:
                        continue
                    stats[1] += 1
                    stats[0] += equalvote(votes["human"][i], votes["human"][j])
        return stats[0], stats[1]
    else:
        raise Exception("Unsupported judges.")


def run_mt_bench_agreement(judges, votefiles):
    # votes[i]: List of votes
    votes = []
    for filename in votefiles:
        data = []
        for line in open(filename, "r"):
            data.append(json.loads(line))
        votes.append(data)

    data = get_mt_bench_votes_data(votes)

    agree, total = get_mt_bench_agreement(data[0], judges[0], judges[1], ban=[])
    print(f"turn 1 with tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[0], judges[0], judges[1], ban=["tie"])
    print(f"turn 1 without tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[1], judges[0], judges[1], ban=[])
    print(f"turn 2 with tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")
    agree, total = get_mt_bench_agreement(data[1], judges[0], judges[1], ban=["tie"])
    print(f"turn 2 without tie. #total: {total}, #agree: {agree}, ratio: {agree/total:.2f}")

### Results

In [5]:
# Compute agrement between GPT-4 and humans
run_mt_bench_agreement(["gpt4_pair", "human"], ["gpt4_pair_judgments.json", "human_judgments.json"])

turn 1 with tie. #total: 1343, #agree: 886, ratio: 0.66
turn 1 without tie. #total: 859, #agree: 727, ratio: 0.85
turn 2 with tie. #total: 1325, #agree: 871, ratio: 0.66
turn 2 without tie. #total: 864, #agree: 731, ratio: 0.85


In [6]:
# Compute agrement between humans and humans
run_mt_bench_agreement(["human", "human"], ["human_judgments.json"])

turn 1 with tie. #total: 721, #agree: 454, ratio: 0.63
turn 1 without tie. #total: 479, #agree: 388, ratio: 0.81
turn 2 with tie. #total: 707, #agree: 471, ratio: 0.67
turn 2 without tie. #total: 474, #agree: 388, ratio: 0.82
