# Callibrating LLM-as-Judge with Human Annotations and Logprobs 

In this notebook we delve into the problem of measuring the performance of LLM-as-Judge techniques for complex tasks. 

The quality of LLM-as-Judge varies highly dependig on problem context ([Bavaresco et al., 2024](https://arxiv.org/abs/2406.18403v1)) and evaluating the quality of LLM-as-Judge is challenging. Using expert human annotators to provide ground-truth labels is expensive and time-consuming. In addition, human annotators are fallible and may provide annotations at a lower quality than state-of-the-art LLMs like GPT-4.

We showcase two methods, simple consensus, and a more advanced algorithm (Goh et al., 2024), to calibrate LLM-as-Judge with human annotations.

## Setup

In [1]:
# Installing the necessary packages for the evaluation
# datasets: for importing the reference datasets
# openai: To interact with OpenAI's API
# cleanlab: For the Open-Source implementation of CROWDLAB algorithm

!pip install datasets --quiet
!pip install openai --quiet
!pip install cleanlab --quiet

## Example task

For the purpose of this notebook, we consider MT-Bench, a suite of pairwise comparison tasks used to benchmark LLM-as-a-Judge ([Zheng et al., 2024](https://arxiv.org/abs/2306.05685)). The MT-Bench dataset consists of 80 unique tasks executed by LLMs, with human and LLM judges evaluating the performance of the tasks using pair-wise comparisons between two executions. 

In [2]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("lmsys/mt_bench_human_judgments")

  from .autonotebook import tqdm as notebook_tqdm


In [15]:
#show first 5 rows of dataset:
human_graded_df = pd.DataFrame(dataset['human'])

human_graded_df.head()

Unnamed: 0,question_id,model_a,model_b,winner,judge,conversation_a,conversation_b,turn
0,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1
1,81,alpaca-13b,gpt-3.5-turbo,model_b,author_2,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2
2,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1
3,81,alpaca-13b,gpt-3.5-turbo,model_b,expert_17,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,2
4,81,alpaca-13b,vicuna-13b-v1.2,model_b,expert_0,[{'content': 'Compose an engaging travel blog ...,[{'content': 'Compose an engaging travel blog ...,1
