# LLM Evals: Optimally Combining Multiple Evaluators with Token Probabilities, Structured Outputs, and the CROWDLAB Algorithm

In this notebook we delve into the problem of measuring the performance of multiple evaluators (Whether human or LLM-as-Judge) in LLM Evaluations.

No labeling strategy is perfect. The quality of LLM-as-Judge varies depending on problem context ([Bavaresco et al., 2024](https://arxiv.org/abs/2406.18403v1)) while using expert human annotators to provide ground-truth labels is expensive and time-consuming. In addition, human annotators are fallible and may provide annotations at a lower quality than state-of-the-art LLMs like GPT-4.

In this notebook, we replicate a popular academic paper on LLM-As-Judge, and in the process, showcase two methods, simple consensus, and a more advanced multiannotator consensus algorithm (CROWDLAB, [Goh et al., 2022](https://arxiv.org/abs/2210.06812)) implemented in [cleanlab](https://github.com/cleanlab/cleanlab), a popular open-source package for data and ML/AI.


###  Installing requirements

In [1]:
# Installing the necessary packages for the evaluation
# cleanlab: Provides an implementation of CROWDLAB algorithm
# datasets: for importing the reference datasets
# openai: To interact with OpenAI's API
# pandas: For data manipulation
# numpy: For numerical computations

!pip install cleanlab --quiet
!pip install datasets --quiet
!pip install openai --quiet --upgrade
!pip install pandas --quiet
!pip install numpy --quiet

## Example task: Evaluating LLM Responses in MT-Bench

As an example task for this notebook, we'll use  MT-Bench, a collection of pairwise comparison tasks used to benchmark LLM-as-Judge ([Zheng et al., 2024](https://arxiv.org/abs/2306.05685)). The MT-Bench dataset consists of 80 unique multi-step writing tasks executed by LLMs, with multiple humans as well as an LLM-as-judge (specifically, GPT-4) evaluating the performance of the tasks using pair-wise comparisons between two executions.

Here is an example task from the MT-Bench dataset, answered by two different models:

| Task | Model A Response | Model B Response |
| --- | --- | --- |
| "Compose an engaging travel blog post about a recent trip to Hawaii" | "I recently had the pleasure of visiting Hawaii and it quickly and it quickly became one of my favorite places..." | "Aloha! I recently had the pleasure of embarking on a trip..." |

Then, an evaluator (either human or the LLM-as-Judge) is asked to pick the better response between Model A and Model B.

To replicate this paper, we'll load the dataset, and then use both the simple and advanced multiannotator consensus algorithms to evaluate the performance of the LLM-as-Judge on the MT-Bench dataset.


### Preparing the MT-Bench dataset

Now, we'll load up the MT-Bench dataset and transform it into a format that can be used for evaluation.

In [2]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("lmsys/mt_bench_human_judgments")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# dataset has both "human" and "gpt4"-graded entries, which we can combine
dataset.keys()

dict_keys(['gpt4_pair', 'human'])

In [4]:
gpt4_graded_df = dataset["gpt4_pair"].to_pandas()
human_graded_df = dataset["human"].to_pandas()
combined_df = pd.concat([gpt4_graded_df, human_graded_df])

The original MT-Bench problems are "multi-turn" (That is, they involve multiple turns of interaction between the model and the evaluator). For simplicity, we will consider a "single-turn" version of the task, and use the evaluator ratings for the first turn.

Here's an example task:

In [5]:
combined_df['conversation_a'].iloc[0]

array([{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.', 'role': 'user'},
       {'content': 'I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'role': 'assistant'},
       {'content': 'Rewrite your previous response. Start every 

In [6]:
# Truncate to single-turn by taking initial ask and answer
combined_df['conversation_a'] = combined_df['conversation_a'].apply(lambda array: array[:2])
combined_df['conversation_b'] = combined_df['conversation_b'].apply(lambda array: array[:2])

In [7]:
# Limit rows to those judging the first turn of conversation:
combined_df = combined_df[combined_df.turn == 1]

In [8]:
#integer-ize winner labels for the cleanlab algorithm
mapping_dict = dict(model_a=0, model_b=1)
reverse_mapping = {v: k for k, v in mapping_dict.items()} # useful for interpreting results later
combined_df.loc[:, 'winner_binary'] = combined_df['winner'].apply(lambda s: mapping_dict.get(s))

### Analyzing the MT-Bench LLM-as-Judge dataset

We see that GPT4 has the most judgements, followed by human annotators. 
Even within human annotators, there is a large variance in how many examples each judge has graded. 

In addition, some examples have more judgements than others. 

This is a common problem in real-world datasets and a great example to tackle!

In [9]:
# Large variance in how many times each judge has graded a conversation
combined_df.judge.value_counts()

judge
gpt4_pair    1200
expert_24     103
author_4      102
author_0       92
expert_0       74
             ... 
expert_18       5
expert_54       5
expert_30       3
author_1        3
expert_52       2
Name: count, Length: 66, dtype: int64

Next, we examine the distribution of judges-per-example in the MT-Bench dataset:

In [10]:
combined_df_wide = combined_df[combined_df.turn==1].pivot_table(
    index=['question_id', 'model_a', 'model_b'],
    columns='judge',
    values=['winner_binary'],
    aggfunc='first'
)

In [11]:
combined_df_wide.count(axis=1).value_counts().sort_index()

1    882
2    411
3    124
4     17
5      2
6      2
Name: count, dtype: int64

We see that each evaluation has between one and six evaluators

# Approach 1: Simple calculation of consensus results

A simple way to aggregate multiple reviewers is to take consensus votes - this produces an answer quickly but does not take into account the quality of the reviewers, or utilize the number of reviewers in determining confidence

Here's how we can quickly generate consensus labels:

In [12]:
import numpy as np

consensus = combined_df_wide.mode(axis=1)
consensus_labels = consensus.iloc[:, 0]

results_df = pd.DataFrame({
    'winner': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_b'), 
                       combined_df_wide.index.get_level_values('model_a')),
    'loser': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_a'), 
                      combined_df_wide.index.get_level_values('model_b'))
})


wins = results_df['winner'].value_counts()
appearances = pd.concat([results_df['winner'], results_df['loser']]).value_counts()
win_rates = (wins / appearances).sort_values(ascending=False)

Using consensus labels as our ground truth, we can complete our evaluation by calculating ranked win rates:

In [13]:
for rank, (model, win_rate) in enumerate(win_rates.items(), 1):
    print(f"{rank}. {model}: Win Rate = {win_rate:.2f} ({wins[model]} wins out of {appearances[model]} appearances)")

1. gpt-4: Win Rate = 0.84 (408 wins out of 486 appearances)
2. claude-v1: Win Rate = 0.72 (319 wins out of 443 appearances)
3. gpt-3.5-turbo: Win Rate = 0.66 (325 wins out of 496 appearances)
4. vicuna-13b-v1.2: Win Rate = 0.52 (240 wins out of 458 appearances)
5. alpaca-13b: Win Rate = 0.20 (98 wins out of 493 appearances)
6. llama-13b: Win Rate = 0.10 (48 wins out of 500 appearances)


We can also measure judges by their level of agreement with the consensus. Understanding consensus is useful for understanding the quality of the judges, but high consensus doesn't necessarily indicate high quality evaluations. (For example, if all judges are low quality, they may all agree on the wrong answer!)

In [14]:
winner_binary_df = combined_df_wide['winner_binary']
vote_counts_row = winner_binary_df.notna().sum(axis=1)
vote_counts_judge = winner_binary_df.notna().sum()
majority_vote = winner_binary_df[vote_counts_row > 1].mode(axis=1).iloc[:, 0]

judge_agreement = {judge: {'agree': 0, 'total': 0} for judge in winner_binary_df.columns}
for judge in winner_binary_df.columns:
    judge_votes = winner_binary_df[judge]
    valid_votes = judge_votes[vote_counts_row > 1]
    agree_counts = (valid_votes == majority_vote[valid_votes.index]).sum()
    total_counts = valid_votes.notna().sum()
    judge_agreement[judge]['agree'] = agree_counts
    judge_agreement[judge]['total'] = total_counts

agreement_percentages = {judge: data['agree'] / data['total'] if data['total'] > 0 else 0 
                         for judge, data in judge_agreement.items()}
judge_metrics = pd.DataFrame({
    'Evaluations': vote_counts_judge,
    'Agreement': agreement_percentages
})

ranked_judges = judge_metrics[judge_metrics['Evaluations'] >= 10].sort_values('Evaluations', ascending=False)


In [15]:
print("\nJudge Summary for 10 most active judges:")
for judge, row in ranked_judges[:10].iterrows():
    print(f"{judge}: {int(row['Evaluations'])} evaluations, {row['Agreement']*100:.2f}% agreement")


Judge Summary for 10 most active judges:
gpt4_pair: 882 evaluations, 88.15% agreement
author_4: 71 evaluations, 94.34% agreement
author_0: 65 evaluations, 100.00% agreement
expert_0: 58 evaluations, 92.31% agreement
expert_24: 58 evaluations, 97.50% agreement
author_3: 36 evaluations, 95.83% agreement
author_2: 33 evaluations, 96.00% agreement
expert_9: 30 evaluations, 100.00% agreement
expert_50: 24 evaluations, 80.00% agreement
expert_51: 22 evaluations, 100.00% agreement


We've now replicated the main finding of the paper, which is that GPT4 reached about a high (above 80%) consensus with human evaluators!

If you are short on time, this agreement percentage calculation can help surface evaluators who tend to disagree with the consensus. 

# Approach 2: Advanced multiannotator algorithm utilizing GPT-4o-mini token probabilities, structured outputs, and the CROWDLAB algorithm 

The simple consensus method does not attempt to estimate the quality of the judges, nor does it place any confidence weighting on labels based on the quantity of judges involved. To improve our labeling, we can utilize a more advanced consensus algorithm. In this notebook, we'll use CROWDLAB, a consensus algorithm shown to outperform many other consensus models in a variety of settings ([Goh et al., 2022](https://arxiv.org/abs/2210.06812)) and implemented in the open-source package [cleanlab](http://github.com/cleanlab/cleanlab)

The algorithm requires two inputs:
1. Judgements from Human or AI evaluators, which we already have. 
2. A quantitative model score. The algorithm then combines the model score, which can be from any ML or AI-based model, with the evaluators. We'll use GPT-4o-mini to construct that now!


### Constructing a probabilistic model with GPT logprobs, structured outputs, and a multiannotator algorithm

The multiannotator algorithm in cleanlab combines the underlying probabilities of the model with the evaluators. To create probabilities, we'll extract token probabilities from GPT-4o-mini!

We'll start by creating a prompt that compares the two responses in MT-Bench. (This is also a setup you can use for general LLM-as-Judge tasks)

In [16]:
from textwrap import dedent

def conversation_to_text(conversation_obj_list, assistant_label):
    result_txt = ""
    for conv_obj in conversation_obj_list:
        result_txt += f"{conv_obj['role'].upper()} {assistant_label.upper() if conv_obj['role'] == 'assistant' else ''}: {conv_obj['content']} \n"
    return result_txt


def produce_prompt_for_llm_evaluation(conversation_a, conversation_b):
    prompt_preamble = f"""
    You are a logical and accurate conversation-reading and grading AI system.
    You will be shown instructions from USER and response by ASSISTANT A and ASSISTANT B.
    Read each conversation carefully and decide whether ASSISTANT B or ASSISTANT A better complies with the USER's instructions
    Please output ONLY A if ASSISTANT A better complies with the USER's demands, and output ONLY B if ASSISTANT B complies better with the USER's demands.
    
    USER instructions:
    {conversation_a[0]['content']}

    ASSISTANT A response:
    {conversation_a[1]['content']}

    END ASSISTANT A response

    ASSISTANT B response: 
    {conversation_b[1]['content']}

    END ASSISTANT B response

    Respond ONLY with the single letter, A or B, no punctuation, spacing, or additional characters.
    """
    return dedent(prompt_preamble)


In [17]:
example_prompt = produce_prompt_for_llm_evaluation(combined_df['conversation_a'].iloc[0], combined_df['conversation_b'].iloc[0])

In [18]:
print(example_prompt)


    You are a logical and accurate converation reading and grading AI system.
    You will be shown instructions from USER and response by ASSISTANT A and ASSISTANT B.
    Read each conversation carefully and decide whether ASSISTANT B or ASSISTANT A better complies with the USER's instructions
    Please output ONLY A if ASSISTANT A better complies with the USER's demands, and output ONLY B if ASSISTANT B complies better with the USER's demands.

    USER instructions:
    Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

    ASSISTANT A response:
    I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture o

In [19]:
from openai import OpenAI

openai_client = OpenAI()

In [21]:
from enum import Enum
from pydantic import BaseModel, Field
from typing import Literal, get_args

class ABChoice(BaseModel):
    choice: Literal["A", "B"] = Field(..., description="Choose either A or B")

def get_completion_with_probs(client: OpenAI, prompt: str, model_name: str, choice_schema: BaseModel, prob_rounding: int=4,  **kwargs):
    choices = get_args(choice_schema.model_fields.get('choice').annotation)
    if not choices:
        raise ValueError("choice_schema must have a field named 'choice' with a Literal type")

    completion = client.beta.chat.completions.parse(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        response_format=choice_schema,
        logprobs=True,
        **kwargs
    )
    
    probs = {k: 0.0 for k in choices}
    for token_info in completion.choices[0].logprobs.content:
        if token_info.token in choices:
            probs[token_info.token] = np.exp(token_info.logprob)
        for tlp in token_info.top_logprobs:
            if tlp.token in choices:
                probs[tlp.token] = max(probs[tlp.token], np.exp(tlp.logprob))
    
    total = sum(probs.values())
    return {k: round(v / total, prob_rounding) for k, v in probs.items()}


In [22]:
prompt="RESPOND RANDOMLY with EITHER the letter A or B, NO OTHER WORDS"

get_completion_with_probs(client=openai_client,
                      prompt=prompt,
                      choice_schema=ABChoice,
                      model_name="gpt-4o-mini",
                      top_logprobs=2)

{'A': 0.5622, 'B': 0.4378}

Interestingly, these probabilities vary by model:

In [23]:
get_completion_with_probs(client=openai_client,
                      prompt=prompt,
                      choice_schema=ABChoice,
                      model_name="gpt-4o-2024-08-06",
                      top_logprobs=2)

{'A': 0.5, 'B': 0.5}

In MT-Bench, many of the examples are judged multiple times, but we only need to score each conversation once, so we'll key by question and the two answerers, and then drop duplicates before creating the prompt for LLM-As-Judge for each example

In [24]:
for_llm_df = combined_df.drop_duplicates(subset=['question_id', 'model_a', 'model_b']).copy()

In [25]:
for_llm_df.loc[:, 'conversation_prompt_text'] = for_llm_df.loc[:, ['conversation_a', 'conversation_b']].apply(
    lambda s: produce_prompt_for_llm_evaluation(s['conversation_a'], s['conversation_b']),
    axis=1
)

In [28]:
for_llm_df.loc[:,'score_results'] = for_llm_df.loc[:, 'conversation_prompt_text'].apply(
    lambda s: get_completion_with_probs(
        prompt=s,
        client=openai_client,
        model_name="gpt-4o-mini",
        max_tokens=10,
        top_logprobs=10,
        choice_schema=ABChoice,
    )
)

We now extract the model results for each conversation:

In [30]:
score_results_only = for_llm_df.set_index(['question_id', 'model_a', 'model_b'])[['score_results']]
score_results_only[['A', 'B']] = score_results_only['score_results'].apply(lambda d: pd.Series([d.get('A', 0), d.get('B', 0)]))
score_results_only.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,score_results,A,B
question_id,model_a,model_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
81,alpaca-13b,claude-v1,"{'A': 0.0, 'B': 1.0}",0.0,1.0
81,alpaca-13b,gpt-3.5-turbo,"{'A': 0.0, 'B': 1.0}",0.0,1.0
81,alpaca-13b,gpt-4,"{'A': 0.0, 'B': 1.0}",0.0,1.0
81,alpaca-13b,vicuna-13b-v1.2,"{'A': 0.0, 'B': 1.0}",0.0,1.0
81,gpt-3.5-turbo,claude-v1,"{'A': 0.7982, 'B': 0.2018}",0.7982,0.2018
81,gpt-3.5-turbo,gpt-4,"{'A': 0.0534, 'B': 0.9466}",0.0534,0.9466
81,gpt-4,claude-v1,"{'A': 1.0, 'B': 0.0}",1.0,0.0
81,llama-13b,alpaca-13b,"{'A': 0.0004, 'B': 0.9996}",0.0004,0.9996
81,llama-13b,claude-v1,"{'A': 0.0, 'B': 1.0}",0.0,1.0
81,llama-13b,gpt-3.5-turbo,"{'A': 0.0, 'B': 1.0}",0.0,1.0


In [31]:
# Drop any answers not in the original dataset
score_results_only = score_results_only[score_results_only.index.isin(combined_df_wide.index)]

And can now feed the results into cleanlab:

In [32]:
from cleanlab.multiannotator import get_label_quality_multiannotator

results = get_label_quality_multiannotator(combined_df_wide, score_results_only[['A', 'B']].to_numpy(), verbose=False)

In [33]:
consensus_results = results["label_quality"]
consensus_results["consensus_label"] = consensus_results["consensus_label"].apply(lambda i: {0:"A",1:"B"}.get(i))

In [34]:
consensus_results.tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,consensus_label,consensus_quality_score,annotator_agreement,num_annotations
question_id,model_a,model_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
160,gpt-4,gpt-3.5-turbo,A,0.916097,1.0,1
160,gpt-4,llama-13b,A,0.916097,1.0,1
160,llama-13b,alpaca-13b,A,0.916097,1.0,3
160,llama-13b,claude-v1,B,0.916095,1.0,1
160,llama-13b,gpt-3.5-turbo,B,0.570378,0.5,2
160,llama-13b,gpt-4,B,0.916095,1.0,1
160,llama-13b,vicuna-13b-v1.2,B,0.916096,1.0,2
160,vicuna-13b-v1.2,claude-v1,B,0.916092,1.0,1
160,vicuna-13b-v1.2,gpt-3.5-turbo,A,0.907919,0.5,2
160,vicuna-13b-v1.2,llama-13b,A,0.916097,1.0,1


The produced consensus label here comes with a confidence score, which can be used to understand the reliability of the label.

In this example, we see that the `llama-13b` vs. `gpt-3.5-turbo` comparison has a low consensus quality score, while the `vicuna-13b-v1.2` vs. `gpt-3.5-turbo` has a high consensus quality score despite both having two disagreeing annotators! This is because the more advanced algorithm takes into account the quality of the annotators and the confidence of the provided model in its predictions.

Next, we look at per-annotator quality scores:

In [35]:
results["annotator_stats"]["worst_class"] = results["annotator_stats"]["worst_class"].apply(lambda i: {0:"A",1:"B"}.get(i))
results["annotator_stats"].sort_values("num_examples_labeled", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
Unnamed: 0_level_1,judge,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
winner_binary,gpt4_pair,0.962963,0.982993,A,882
winner_binary,author_4,0.943396,0.957746,B,71
winner_binary,author_0,1.000000,1.000000,A,65
winner_binary,expert_24,0.975000,0.982759,B,58
winner_binary,expert_0,1.000000,1.000000,B,58
winner_binary,...,...,...,...,...
winner_binary,expert_30,1.000000,1.000000,A,3
winner_binary,expert_54,1.000000,1.000000,A,3
winner_binary,author_1,0.500000,0.500000,A,2
winner_binary,expert_18,1.000000,1.000000,A,2


With this analysis, we find that LLM-As-Judge with GPT-4 has an even higher accuracy than computed by the simple consensus method.


### Limitations

The traditional consensus score is a simple and easy-to-understand method for combining multiple evaluators. However, it does not take into account the quality of the judges.

Conversely, the CROWDLAB multiannotator algorithm estimates annotator quality but is dependant on the quality of the provided model scores. If the model scores are not directionally accurate, or are predisposed towards a certain reviewer, the algorithm will not be able to accurately estimate the quality of the judges and the true labels.


## Conclusion

In this notebook, we demonstrated two methods for combining multiple evaluators (human or LLM-as-Judge) utilizing GPT token logprobs and structured outputs capabilities. We showed a simple method for computing consensus agreement, and then also demonstrated an advanced multiannotator algorithm that can be used to estimate the quality of the judges and true labels. 

## References
- [LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
](https://arxiv.org/abs/2406.18403v1) - Bavaresco et al. Published June 2024
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) - Zheng, Lianmin, et al. Published December 2024
- [CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators](https://arxiv.org/abs/2210.06812) -  Goh et al. Published January 2023
- [Estimate Consensus and Annotator Quality for Data Labeled by Multiple Annotators](https://docs.cleanlab.ai/stable/tutorials/multiannotator.html)
- [OpenAI Structured Outputs Guide](https://platform.openai.com/docs/guides/structured-outputs/)