# Combine Multiple Evaluators (Human or LLM-as-Judge) with CROWDLAB and GPT token probabilities

In this notebook we delve into the problem of measuring the performance of evaluators (Whether human or LLM-as-Judge) for complex tasks. 

No labeling strategy is perfect. The quality of LLM-as-Judge varies highly depending on problem context ([Bavaresco et al., 2024](https://arxiv.org/abs/2406.18403v1)). Using expert human annotators to provide ground-truth labels is expensive and time-consuming. In addition, human annotators are fallible and may provide annotations at a lower quality than state-of-the-art LLMs like GPT-4.

We showcase two methods, simple consensus, and an advanced opens-source algorithm (CROWDLAB) to produce a single label from multiple evaluators and estimate the reliability of of evaluators. 

## Setup

In [64]:
# Installing the necessary packages for the evaluation
# datasets: for importing the reference datasets
# openai: To interact with OpenAI's API
# cleanlab: Provides an implementation of CROWDLAB algorithm
# pandas: For data manipulation
# numpy: For numerical computations

!pip install datasets --quiet
!pip install openai --quiet
!pip install cleanlab --quiet
!pip install pandas --quiet
!pip install numpy --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

## Example task: Evaluating LLM Responses in MT-Bench

For the purpose of this notebook, we consider MT-Bench, a suite of pairwise comparison tasks used to benchmark LLM-as-a-Judge ([Zheng et al., 2024](https://arxiv.org/abs/2306.05685)). The MT-Bench dataset consists of 80 unique writing tasks executed by LLMs, with multiple humans as well as an LLM-as-judge (GPT-4) evaluating the performance of the tasks using pair-wise comparisons between two executions.

Here is an example task from the MT-Bench dataset:

| Task | Model A Response | Model B Response
| --- | --- | --- |
| "Compose an engaging travel blog post about a recent trip to Hawaii" | "I recently had the pleasure of visiting Hawaii and it quickly and it quickly became one of my favorite places..." | "Aloha! I recently had the pleasure of embarking on a trip..." |


We'll load up the MT-Bench dataset and transform it into a format that can be used for evaluation.


In [71]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("lmsys/mt_bench_human_judgments")

In [72]:
# dataset has both "human" and "gpt4"-graded entries, which we can combine
dataset.keys()

dict_keys(['gpt4_pair', 'human'])

In [73]:
gpt4_graded_df = dataset["gpt4_pair"].to_pandas()
human_graded_df = dataset["human"].to_pandas()
combined_df = pd.concat([gpt4_graded_df, human_graded_df], axis=0)

The original MT-Bench problems are "multi-turn" (That is, they involve multiple turns of interaction between the model and the evaluator). For simplicity, we will consider a "single-turn" version of the task, and use the evaluator ratings for the first turn.

In [143]:
# Example multi-turn conversation:
combined_df['conversation_a'].iloc[0]

array([{'content': 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.', 'role': 'user'},
       {'content': 'I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawaiian people and try my hand at traditional crafts and activities. I also had a chance to explore some of the natural wonders of the island, including the breathtaking Hanauma Bay and the majestic Waimea Canyon. Whether you’re looking for a relaxing beach vacation or an adventure filled with culture and nature, Hawaii is the perfect destination.', 'role': 'assistant'}],
      dtype=object)

In [75]:
# Truncate to single-turn:
combined_df['conversation_a'] = combined_df['conversation_a'].apply(lambda array: array[:2])
combined_df['conversation_b'] = combined_df['conversation_b'].apply(lambda array: array[:2])

In [76]:
# Limit rows to those judging the first turn of conversation:
combined_df = combined_df[combined_df.turn == 1]

In [77]:
combined_df.judge.value_counts()

judge
gpt4_pair    1200
expert_24     103
author_4      102
author_0       92
expert_0       74
             ... 
expert_18       5
expert_54       5
expert_30       3
author_1        3
expert_52       2
Name: count, Length: 66, dtype: int64

In [78]:
#integer-ize winner labels
mapping_dict = dict(model_a=0, model_b=1)
reverse_mapping = {v: k for k, v in mapping_dict.items()}
combined_df.loc[:, 'winner_binary'] = combined_df['winner'].apply(lambda s: mapping_dict.get(s))

Next, we examine the distribution of judges-per-example in the MT-Bench dataset:

In [79]:
combined_df_wide = combined_df[combined_df.turn==1].pivot_table(
    index=['question_id', 'model_a', 'model_b'],
    columns='judge',
    values=['winner_binary'],
    aggfunc='first'
)

In [80]:
combined_df_wide.count(axis=1).value_counts()

1    882
2    411
3    124
4     17
6      2
5      2
Name: count, dtype: int64

We see that each evaluation has between one and five evaluators

# Approach 1: Generating simple consensus results

In the absence of any other method, a simple way to aggregate multiple reviewers is to take consensus votes. This produces an answer but does not take into account the quality of the reviewers, or utilize the fact that we have multiple reviewers.

In [144]:
import numpy as np

consensus = combined_df_wide.mode(axis=1)
consensus_labels = consensus.iloc[:, 0]

results_df = pd.DataFrame({
    'winner': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_b'), 
                       combined_df_wide.index.get_level_values('model_a')),
    'loser': np.where(consensus_labels, combined_df_wide.index.get_level_values('model_a'), 
                      combined_df_wide.index.get_level_values('model_b'))
})


wins = results_df['winner'].value_counts()
appearances = pd.concat([results_df['winner'], results_df['loser']]).value_counts()
win_rates = (wins / appearances).sort_values(ascending=False)

This gets us the following ranked win rates:

In [145]:
for rank, (model, win_rate) in enumerate(win_rates.items(), 1):
    print(f"{rank}. {model}: Win Rate = {win_rate:.2f} ({wins[model]} wins out of {appearances[model]} appearances)")

1. gpt-4: Win Rate = 0.84 (408 wins out of 486 appearances)
2. claude-v1: Win Rate = 0.72 (319 wins out of 443 appearances)
3. gpt-3.5-turbo: Win Rate = 0.66 (325 wins out of 496 appearances)
4. vicuna-13b-v1.2: Win Rate = 0.52 (240 wins out of 458 appearances)
5. alpaca-13b: Win Rate = 0.20 (98 wins out of 493 appearances)
6. llama-13b: Win Rate = 0.10 (48 wins out of 500 appearances)


We can also measure judges by their level of agreement with the consensus. Understanding consensus is useful for understanding the quality of the judges, but high consensus doesn't necessarily indicate high quality evaluations. For example, if all judges are low quality, they may all agree on the wrong answer.

In [146]:
winner_binary_df = combined_df_wide['winner_binary']
vote_counts_row = winner_binary_df.notna().sum(axis=1)
vote_counts_judge = winner_binary_df.notna().sum()
majority_vote = winner_binary_df[vote_counts_row > 1].mode(axis=1).iloc[:, 0]

judge_agreement = {judge: {'agree': 0, 'total': 0} for judge in winner_binary_df.columns}
for judge in winner_binary_df.columns:
    judge_votes = winner_binary_df[judge]
    valid_votes = judge_votes[vote_counts_row > 1]
    agree_counts = (valid_votes == majority_vote[valid_votes.index]).sum()
    total_counts = valid_votes.notna().sum()
    judge_agreement[judge]['agree'] = agree_counts
    judge_agreement[judge]['total'] = total_counts

agreement_percentages = {judge: data['agree'] / data['total'] if data['total'] > 0 else 0 
                         for judge, data in judge_agreement.items()}
judge_metrics = pd.DataFrame({
    'Evaluations': vote_counts_judge,
    'Agreement': agreement_percentages
})

ranked_judges = judge_metrics[judge_metrics['Evaluations'] >= 10].sort_values('Evaluations', ascending=False)


In [153]:
print("\nJudge Summary:")
for judge, row in ranked_judges[:10].iterrows():
    print(f"{judge}: {int(row['Evaluations'])} evaluations, {row['Agreement']*100:.2f}% agreement")


Judge Summary:
gpt4_pair: 882 evaluations, 88.15% agreement
author_4: 71 evaluations, 94.34% agreement
author_0: 65 evaluations, 100.00% agreement
expert_0: 58 evaluations, 92.31% agreement
expert_24: 58 evaluations, 97.50% agreement
author_3: 36 evaluations, 95.83% agreement
author_2: 33 evaluations, 96.00% agreement
expert_9: 30 evaluations, 100.00% agreement
expert_50: 24 evaluations, 80.00% agreement
expert_51: 22 evaluations, 100.00% agreement


# Approach 2: Using CROWDLAB with GPT-4-o mini logprobs 

The above simple consensus methods do not take into account the quality of the judges. We can use the CROWDLAB algorithm to estimate the quality of the judges and the true answer to the problem. The CROWDLAB algorithm uses a probabilistic model to estimate the quality of the judges and the true answer to the problem. 

The CROWDLAB algorithm requires two inputs:
1. Judgements from Human or AI evaluators, which we already have. 
2. A quantitative model score. We'll use GPT-4o mini to construct that now!


### Constructing a probabilistic model with GPT logprobs

Here, we'll put use the underlying probabilities from GPT to construct a numerical model score for each response. We'll start by creating a prompt that compares the two responses in MT-Bench:

In [106]:
def conversation_to_text(conversation_obj_list, assistant_label):
    result_txt = ""
    for conv_obj in conversation_obj_list:
        result_txt += f"{conv_obj['role'].upper()} {assistant_label.upper() if conv_obj['role'] == 'assistant' else ''}: {conv_obj['content']} \n"
    return result_txt

In [154]:
from textwrap import dedent

def produce_prompt_for_llm_evaluation(conversation_a, conversation_b):
    prompt_preamble = f"""
    You are a logical and accurate converation reading and grading AI system.
    You will be shown two conversations between USER and ASSISTANT.
    Read each conversation carefully and decide which one better complies with the USER's instructions
    Please output ONLY "A" if the ASSISTANT in conversation A better complies with the USER's demands, and output only "B" if the ASSISTANT
    in conversation B better complies with the USER's demands

    <Answer A>
    {conversation_to_text(conversation_a, "a")}
    </Conversation A>

    That was conversation A, here is conversation B:

    <Conversation B>
    {conversation_to_text(conversation_b, "b")}
    </Conversation B>

    Please respond with "A" if Assistant A was better and "B" if Assistant B was better. ONLY RETURN "A" OR "B"
    """
    return dedent(prompt_preamble)


In [157]:
example_prompt = produce_prompt_for_llm_evaluation(combined_df['conversation_a'].iloc[0], combined_df['conversation_b'].iloc[0])

In [158]:
print(example_prompt)


    You are a logical and accurate converation reading and grading AI system.
    You will be shown two conversations between USER and ASSISTANT.
    Read each conversation carefully and decide which one better complies with the USER's instructions
    Please output ONLY "A" if the ASSISTANT in conversation A better complies with the USER's demands, and output only "B" if the ASSISTANT
    in conversation B better complies with the USER's demands

    <Answer A>
    USER : Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 
ASSISTANT A: I recently had the pleasure of visiting Hawaii and it quickly became one of my favorite places. From the stunning beaches to the lush mountains, this place has it all. The people are incredibly friendly and the culture is alive and well. One of the highlights of my trip was visiting the Polynesian Cultural Center. Here, I was able to learn about the culture of the native Hawai

In [113]:
from openai import OpenAI

openai_client = OpenAI()

In [159]:

def get_completion_with_probs(client, prompt, model_name, seed=123, max_tokens=10, temperature=0, top_logprobs=5, **kwargs):
    completion = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=temperature,
        seed=seed,
        logprobs=True,
        top_logprobs=top_logprobs,
        **kwargs
    )
    
    prob_dict = {
        tlp.token: round(np.exp(tlp.logprob), 2)
        for tlp in completion.choices[0].logprobs.content[0].top_logprobs
    }
    
    return prob_dict

In [115]:
prompt="Please respond randomly with EITHER the letter A or B, NO OTHER WORDS"

get_completion_with_probs(client=openai_client,
                      prompt=prompt,
                      model_name="gpt-4o-mini",
                      top_logprobs=2)

{'A': 0.95, 'B': 0.05}

(Interestingly, these probabilities vary by model:)

In [116]:
get_completion_with_probs(client=openai_client,
                      prompt=prompt,
                      model_name="gpt-4",
                      top_logprobs=2)

{'B': 0.65, 'A': 0.35}

In [160]:
get_completion_with_probs(client=openai_client,
                      prompt=prompt,
                      model_name="gpt-4o",
                      top_logprobs=2)

{'A': 0.75, 'B': 0.25}

In MT-Bench, many of the examples are judged multiple times, but we only need to score each conversation once, so we'll drop duplicates:

In [161]:
for_llm_df = combined_df.drop_duplicates(subset=['question_id', 'model_a', 'model_b'])

In [162]:
for_llm_df.loc[:, 'conversation_prompt_text'] = for_llm_df.apply(
    lambda s: produce_prompt_for_llm_evaluation(s['conversation_a'], s['conversation_b']),
    axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  for_llm_df.loc[:, 'conversation_prompt_text'] = for_llm_df.apply(


In [163]:
for_llm_df.loc[:,'score_results'] = for_llm_df['conversation_prompt_text'].apply(lambda s: get_completion_with_probs(prompt=s, client=openai_client, model_name="gpt-4o", max_tokens=10, top_logprobs=2))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  for_llm_df.loc[:,'score_results'] = for_llm_df['conversation_prompt_text'].apply(lambda s: get_completion_with_probs(prompt=s, client=openai_client, model_name="gpt-4o", max_tokens=10, top_logprobs=2))


We now extract the model results for each conversation:

In [185]:
score_results_only = for_llm_df.set_index(['question_id', 'model_a', 'model_b'])[['score_results']]
score_results_only['A'] = score_results_only['score_results'].apply(lambda d: d.get('A',0))
score_results_only['B'] = score_results_only['score_results'].apply(lambda d: d.get('B',0))
score_results_only.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,score_results,A,B
question_id,model_a,model_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
81,alpaca-13b,claude-v1,"{'B': 1.0, 'A': 0.0}",0.0,1.0
81,alpaca-13b,gpt-3.5-turbo,"{'B': 1.0, 'A': 0.0}",0.0,1.0
81,alpaca-13b,gpt-4,"{'B': 1.0, '""B': 0.0}",0.0,1.0
81,alpaca-13b,vicuna-13b-v1.2,"{'B': 1.0, 'A': 0.0}",0.0,1.0
81,gpt-3.5-turbo,claude-v1,"{'B': 0.73, 'A': 0.27}",0.27,0.73


In [186]:
score_results_only = score_results_only[score_results_only.index.isin(combined_df_wide.index)]

And can now feed the results into cleanlab:

In [187]:
from cleanlab.multiannotator import get_label_quality_multiannotator

results = get_label_quality_multiannotator(combined_df_wide, score_results_only[['A', 'B']].to_numpy(), verbose=False)

In [188]:
consensus_results = results["label_quality"]
consensus_results["consensus_label"] = consensus_results["consensus_label"].apply(lambda i: {0:"A",1:"B"}.get(i))

In [189]:
consensus_results.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,consensus_label,consensus_quality_score,annotator_agreement,num_annotations
question_id,model_a,model_b,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
81,alpaca-13b,claude-v1,B,0.916097,1.0,1
81,alpaca-13b,gpt-3.5-turbo,B,0.916097,1.0,3
81,alpaca-13b,gpt-4,B,0.916097,1.0,1
81,alpaca-13b,vicuna-13b-v1.2,B,0.916097,1.0,2
81,claude-v1,alpaca-13b,A,0.916095,1.0,1


The produced consensus label here comes with a confidence score, which can be used to understand the reliability of the label.

In [184]:
results["annotator_stats"]["worst_class"] = results["annotator_stats"]["worst_class"].apply(lambda i: {0:"A",1:"B"}.get(i))
results["annotator_stats"].sort_values("num_examples_labeled", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,annotator_quality,agreement_with_consensus,worst_class,num_examples_labeled
Unnamed: 0_level_1,judge,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
winner_binary,gpt4_pair,0.962963,0.982993,A,882
winner_binary,author_4,0.943396,0.957746,B,71
winner_binary,author_0,1.000000,1.000000,A,65
winner_binary,expert_0,1.000000,1.000000,B,58
winner_binary,expert_24,0.975000,0.982759,B,58
winner_binary,...,...,...,...,...
winner_binary,expert_30,1.000000,1.000000,A,3
winner_binary,expert_54,1.000000,1.000000,A,3
winner_binary,author_1,0.500000,0.500000,A,2
winner_binary,expert_18,1.000000,1.000000,A,2


Using a more sophisticated algorithm helps us to estimate the quality of the judges and the true answer to the problem.

### Limitations

The traditional consensus score is a simple and easy-to-understand method for combining multiple evaluators. However, it does not take into account the quality of the judges.

The quality of the CROWDLAB algorithm depends on the quality of the model scores. If the model scores are not directionally accurate, or are predisposed towards a certain reviewer, the CROWDLAB algorithm will not be able to accurately estimate the quality of the judges and the true answer to the problem.


## Conclusion

In this notebook, we demonstrated two methods for combining multiple evaluators (human or LLM-as-Judge) with GPT logprobs. We showed that the CROWDLAB algorithm can be used to estimate the quality of the judges and the true answer to the problem. We also showed that the quality of the CROWDLAB algorithm depends on the quality of the model scores. If the model scores are not accurate, the CROWDLAB algorithm will not be able to accurately estimate the quality of the judges and the true answer to the problem.

## References

- [LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
](https://arxiv.org/abs/2406.18403v1) - Bavaresco et al. Published June 2024
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685) - Zheng, Lianmin, et al. Published December 2024
- [CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators](https://arxiv.org/abs/2210.06812) - Goh et al. Published January 2023
