# Evaluating LLM performance without ground truth using an LLM judge

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kluster-ai/klusterai-cookbook/blob/main/examples/llm-as-a-judge.ipynb)

How can we test a model's accuracy when the ground truth is unavailable? One approach could be to test the predictions made by the base model against a larger model, which, comparatively, should do better.

This tutorial uses a base model (`Llama-3.1-8B-Instruct-Turbo`) to classify a dataset based on a description. Next, we will use a larger model (`klusterai/Meta-Llama-3.3-70B-Instruct-Turbo`) as a judge, tasked to determine whether the base model's predicitions are correct. Since the dataset also contains the ground truth, the notebook also assesses how well the judge model performed.

A great breakdown on calculating a model's accuracy can be found in our <a href="/tutorials/klusterai-api/model-comparison/" target="_blank"> model comparison notebook</a>.

You'll be using the same dataset as in our <a href="/tutorials/klusterai-api/text-classification/text-classification-openai-api/" target="_blank">text classification notebook</a>, which is an extract from the IMDB top 1000 movies dataset categorized into 21 different genres.

## Prerequisites

Before getting started, ensure you have the following:

- **A kluster.ai account** - sign up on the <a href="https://platform.kluster.ai/signup" target="_blank">kluster.ai platform</a> if you don't have one
- **A kluster.ai API key** - after signing in, go to the <a href="https://platform.kluster.ai/apikeys" target="_blank">**API Keys**</a> section and create a new key. For detailed instructions, check out the <a href="/get-started/get-api-key/" target="_blank">Get an API key</a> guide

## Setup

In this notebook, we'll use Python's `getpass` module to input the key safely. After execution, please provide your unique kluster.ai API key (ensure no spaces).

In [2]:
from getpass import getpass

api_key = getpass("Enter your kluster.ai API key: ")

Enter your kluster.ai API key:  ········


Next, ensure you've installed the OpenAI Python library:

In [4]:
%pip install -q OpenAI

Note: you may need to restart the kernel to use updated packages.


With the OpenAI Python library installed, we import the necessary dependencies for the tutorial:

In [5]:
from openai import OpenAI

import pandas as pd
import time
import json
import os
from IPython.display import clear_output, display
import matplotlib.pyplot as plt
import urllib.request
import numpy as np

Then, initialize the `client` by pointing it to the kluster.ai endpoint and passing your API key.

In [6]:
# Set up the client
client = OpenAI(
    base_url="https://api.kluster.ai/v1",
    api_key=api_key,
)

## Get the data

Now that you've initialized an OpenAI-compatible client pointing to kluster.ai, we can talk about the data.

This notebook uses a dataset from the Top 1000 IMDb Movies dataset, which contains descriptions and genres for each movie. In some cases, a movie can have more than one label. When calculating the accuracy, we'll consider the prediction correct if the predicted genre matches at least one of the genres listed in the dataset, the ground truth. This ground truth allows the notebook to calculate the accuracy and measure how well a given LLM has performed.


In [7]:
# IMDB Top 1000 dataset:
url = "https://raw.githubusercontent.com/kluster-ai/klusterai-cookbook/refs/heads/main/data/imdb_top_1000.csv"
urllib.request.urlretrieve(url,filename='imdb_top_1000.csv')

# Load and process the dataset based on URL content
df = pd.read_csv('imdb_top_1000.csv', usecols=['Series_Title', 'Overview', 'Genre'])
df.head(3)

Unnamed: 0,Series_Title,Genre,Overview
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...


## Perform batch inference

To execute the batch inference job, we'll create the following functions:

1. **Create the batch job file** - we'll generate a JSON lines file with the desired requests to be processed by the model. Consequently, we'll create a file for the assistant model and one for the judge model. You can also work with a single file by providing the different models for each request
2. **Upload the batch job file** - once it is ready, we'll upload it to the <a href="https://platform.kluster.ai/signup" target="_blank">kluster.ai platform</a> using the API, where it will be processed. We'll receive a unique ID associated with our file
3. **Start the batch job** - after the file is uploaded, we'll initiate the job to process the uploaded data, using the file ID obtained before
4. **Monitor job progress** - (optional) track the status of the batch job to ensure it has been successfully completed
5. **Retrieve results** - once the job has completed execution, we can access and process the resultant data

Next, we will run the functions for the base model and feed the results to the pipeline using the judge model.

This notebook is prepared for you to follow along. Run the cells below to watch it all come together.

### Create the batch job file

The following snippets prepare the JSONL file, where each line represents a different request. The function is set to reuse between the base and judge models.

Note that each separate batch request can have its own model. Also, we are using a temperature of `0.5`, but feel free to change it and play around with the different outcomes (but we are only asking to respond with a single word, the genre).

In [21]:
# Ensure the directory exists
os.makedirs("llm_as_judge", exist_ok=True)

# Create the batch job file with the prompt and content for the model
def create_batch_file(data, model, system_prompt):
    batch_requests = []
    
    for index, row in data.iterrows():
        content = row["Overview"]

        request = {
            "custom_id": f"{model}-{index}-analysis",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": model,
                "temperature": 0.5,
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": content},
                ],
            },
        }
        
        batch_requests.append(request)

    return batch_requests


# Save file
def save_batch_file(batch_requests, model):
    filename = f"llm_as_judge/batch_job_{model}_request.jsonl"
    with open(filename, "w") as file:
        for request in batch_requests:
            file.write(json.dumps(request) + "\n")
    return filename


### Upload batch job file to kluster.ai

Once we've prepared our input file, it's time to upload them to the kluster.ai platform. To do so, you can use the `files.create` endpoint of the client, where the purpose is set to `batch`. This will return the file ID, which we need to log for the next steps.

In [11]:
def upload_batch_file(data_dir):
  print(f"Creating request for {data_dir}")
  
  with open(data_dir, 'rb') as file:
    upload_response = client.files.create(
    file=file,
    purpose="batch"
  )

  # Print job ID
  file_id = upload_response.id
  print(f"File uploaded successfully. File ID: {file_id}")

  return upload_response

### Start the job

Once all the files have been successfully uploaded, we're ready to start (create) the batch jobs by providing the file ID. To start each job, we use the `batches.create` method, for which we need to set the endpoint to `/v1/chat/completions`. This will return each batch job details, with each ID.

In [12]:
# Create batch job with completions endpoint
def create_batch_job(file_id):
  batch_job = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
  )

  print(f"Batch job created with ID {batch_job.id}")
  return batch_job

### Check job progress

Once your batch jobs have been created, you can track their progress.

To monitor the job's progress, we can use the `batches.retrieve` method and pass the batch job ID. The response contains a `status` field that tells whether it is completed and the subsequent status of each job separately. We can repeat this process for every batch job ID we get in the previous step.

The following snippet checks the status of all batch jobs every 10 seconds until the entire batch is completed.

In [34]:
def monitor_batch_job(job):
    completed = False

    # Loop until all jobs are completed
    while not completed:
        completed = True
        output_lines = []

        updated_job = client.batches.retrieve(job.id)
        status = updated_job.status

        # If job is completed
        if status == "completed":
            output_lines.append("Job completed!")
        # If job failed, cancelled or expired
        elif status in ["failed", "cancelled", "expired"]:
            output_lines.append(f"Job ended with status: {status}")
            break
        # If job is ongoing
        else:
            completed = False
            completed = updated_job.request_counts.completed
            total = updated_job.request_counts.total
            output_lines.append(
                f"Job status: {status} - Progress: {completed}/{total}"
            )

        # Check every 10 seconds
        if not completed:
            time.sleep(10)


## Get the results

When the batch job is completed, we'll retrieve the results and review the responses generated for each request. The results are parsed. To fetch the results from the platform, you must retrieve the `output_file_id` from the batch job and then use the `files.content` endpoint, providing that specific file ID. We will repeat this for every single batch job id. Note that the job status must be `completed` to retrieve the results!

In [16]:
#Parse results as a JSON object
def parse_json_objects(data_string):
    if isinstance(data_string, bytes):
        data_string = data_string.decode('utf-8')

    json_strings = data_string.strip().split('\n')
    json_objects = []

    for json_str in json_strings:
        try:
            json_obj = json.loads(json_str)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")

    return json_objects

# Retrieve results with job ID
def retrieve_results(batch_job):
    job = client.batches.retrieve(batch_job.id)
    result_file_id = job.output_file_id
    result = client.files.content(result_file_id).content

    # Parse JSON results
    return parse_json_objects(result)

Now that the basic inference pipeline has been established, let's run it first for the base model.

## Batch inference for the base model

This example uses Llama 3.1 8B as the base model. If you'd like to test different models, feel free to modify the scripts accordingly.

Please refer to the <a href="/get-started/models/#model-comparison-table" target="_blank">Supported models</a> section for a list of the models we support.

For the base model, the prompt is pretty similar to that of the <a href="/tutorials/klusterai-api/text-classification/text-classification-openai-api/" target="_blank">text classification notebook</a>, where we ask to classify each move genre based of a description, and providing a specific set of options as possible genres.

In [33]:
# System prompt
SYSTEM_PROMPT_BASE = """
    You are a helpful assistant who classifies movie genres based on the provided description. Choose one of the following options: 
    Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.
    Provide your response as a single word with the matching genre. Don't include punctuation.
    """

# Model
model_name = "Llama3.1-8B"
model = "klusterai/Meta-Llama-3.1-8B-Instruct-Turbo"

# Create batch file
batch_request = create_batch_file(df, model, SYSTEM_PROMPT_BASE)
filename = save_batch_file(batch_request, model_name)
print(f"Batch file created {filename}")

# Upload batch file
batch_file = upload_batch_file(filename)

# Create batch job
batch_job = create_batch_job(batch_file.id)

# Monitor batch job
monitor_batch_job(batch_job)


'Job status: in_progress - Progress: 194/1000'

Next, we can preview what one of the requests in a batch job files looks like:

In [27]:
!head -n 1 $filename

{"custom_id": "klusterai/Meta-Llama-3.1-8B-Instruct-Turbo-0-analysis", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "klusterai/Meta-Llama-3.1-8B-Instruct-Turbo", "temperature": 0.5, "messages": [{"role": "system", "content": "\n    You are a helpful assistant who classifies movie genres based on the provided description. Choose one of the following options: \n    Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.\n    Provide your response as a single word with the matching genre. Don't include punctuation.\n    "}, {"role": "user", "content": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency."}]}}


This example uses Llama 3.1 8B as the base model and the larger Llama 3.3 70B as the judge model (the artificial ground truth). If you'd like to test different models, feel free to modify the scripts accordingly.

Please refer to the <a href="/get-started/models/#model-comparison-table" target="_blank">Supported models</a> section for a list of the models we support.

For the base model, the prompt is pretty similar to that of the <a href="/tutorials/klusterai-api/text-classification/text-classification-openai-api/" target="_blank">text classification notebook</a>. For the judge model, we must be very specific about the task to be executed, providing unambiguous guidelines on what constitutes a correct and incorrect prediction by the base model. For example, you must also consider cases in which the base model offers a response that is not formatted correctly. You might need to tune each prompt to ensure the judge model accurately measures the base model response.

In [9]:


# Model
model = "klusterai/Meta-Llama-3.1-8B-Instruct-Turbo"

SYSTEM_PROMPT_JUDGE = """
    You will receive a movie description, a list of possible genres, and a predicted movie genre made by another LLM.
    Your task is to evaluate whether the predicted genre is ‘correct’ or ‘incorrect’ based on the following steps and requirements.
    
    Steps to Follow:
    1. Carefully read the movie description.
    2. Determine your own classification of the genre for the movie. Do not rely on the LLM's answer since it may be incorrect. Do not rely on individual words to identify the genre; read the whole description to identify the genre.
    3. Read the LLM answer (enclosed in double quotes) and evaluate if it is correct by following the Evaluation Criteria below.
    4. Provide your evaluation as 'correct' or 'incorrect'.
    
    Evaluation Criteria:
    - Ensure the LLM answer (enclosed in double quotes) is one of the provided genres. If it is not listed, the evaluation should be ‘incorrect’.
    - If the LLM answer (enclosed in double quotes) does not align with the movie description, the evaluation should be ‘incorrect’.
    - The first letter of the LLM answer (enclosed in double quotes) must be capitalized (e.g., Drama). If it has any other capitalization, the evaluation should be ‘incorrect’.
    - All other letters in the LLM answer (enclosed in double quotes) must be lowercase. Otherwise, the evaluation should be ‘incorrect’.
    - If the LLM answer consists of multiple words, the evaluation should be ‘incorrect’.
    - If the LLM answer includes punctuation, spaces, or additional characters, the evaluation should be ‘incorrect’.
    
    Output Rules:
    - Provide the evaluation with no additional text, punctuation, or explanation.
    - The output should be in lowercase.
    
    Final Answer Format:
    evaluation
    
    Example:
    correct
    """


## Build our evaluation pipeline

In this section, we'll create several utility functions that will help us:

1. Prepare our data for batch processing
2. Send requests to the kluster.ai API
3. Monitor the progress of our evaluation
4. Collect and analyze results

These functions will make our evaluation process more efficient and organized. Let's go through each one and understand its purpose.

1. **`create_tasks()`** - formats our data for the API
2. **`save_tasks()`** - prepares batch files for processing
3. **`monitor_job_status()`** - tracks evaluation progress
4. **`get_results()`** - collects and processes model outputs

### Create and manage batch files

A batch file in our context is a collection of requests that we'll send to our models for evaluation. Think of it as a organized list of tasks we want our models to complete.

We'll take the following steps to create batch files:

1. **Creating tasks** - we'll convert each movie description into a format LLMs can process
2. **Organizing data** -we'll add necessary metadata and instructions for each task
3. **Saving files** - we'll store these tasks in a structured format (JSONL) for processing

Let's break down the key components of our batch file creation:
- **`custom_id`** - helps us track individual requests
- **`system_prompt`** - provides instructions to the model
- **`content`** - the actual text we want to classify

This structured approach allows us to efficiently process multiple requests in parallel.

In [5]:
def create_tasks(user_contents, system_prompt, task_type, model):
    tasks = []
    for index, user_content in enumerate(user_contents):
        task = {
            "custom_id": f"{task_type}-{index}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": model,
                "temperature": 0,
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_content},
                ],
            }
        }
        tasks.append(task)
    return tasks

def save_tasks(tasks, task_type):
    filename = f"batch_tasks_{task_type}.jsonl"
    with open(filename, 'w') as file:
        for task in tasks:
            file.write(json.dumps(task) + '\n')
    return filename

### Upload files to kluster.ai

Now that we've prepared our batch files, we'll upload them to the <a href="https://platform.kluster.ai/" target="_blank">kluster.ai platform</a> for batch inference. This step is crucial for:

1. Getting our data to the models
2. Setting up the processing queue
3. Preparing for inference

Once the upload is complete, the following actions will take place:

1. The platform queues our requests
2. Models process them efficiently
3. Results are made available for collection

In [6]:
def create_batch_job(file_name):
    print(f"Creating batch job for {file_name}")
    batch_file = client.files.create(
        file=open(file_name, "rb"),
        purpose="batch"
    )

    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    return batch_job

### Check job progress

This function provides real-time monitoring of batch job progress:

- Continuously checks job status via the kluster.ai API
- Displays current completion count (completed/total requests)
- Updates status every 10 seconds until job is finished
- Automatically clears previous output for clean progress tracking

In [7]:
def parse_json_objects(data_string):
    if isinstance(data_string, bytes):
        data_string = data_string.decode('utf-8')

    json_strings = data_string.strip().split('\n')
    json_objects = []

    for json_str in json_strings:
        try:
            json_obj = json.loads(json_str)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")

    return json_objects

def monitor_job_status(client, job_id, task_type):
    all_completed = False

    while not all_completed:
        all_completed = True
        output_lines = []

        updated_job = client.batches.retrieve(job_id)

        if updated_job.status.lower() != "completed":
            all_completed = False
            completed = updated_job.request_counts.completed
            total = updated_job.request_counts.total
            output_lines.append(f"{task_type.capitalize()} job status: {updated_job.status} - Progress: {completed}/{total}")
        else:
            output_lines.append(f"{task_type.capitalize()} job completed!")

        # Clear the output and display updated status
        clear_output(wait=True)
        for line in output_lines:
            display(line)

        if not all_completed:
            time.sleep(10)

### Collect and process results

The `get_results()` function below does the following:

1. Retrieves the completed batch job results
2. Extracts the model's response content from each result
3. Returns a list of all model responses

In [8]:
def get_results(client, job_id):
    batch_job = client.batches.retrieve(job_id)
    result_file_id = batch_job.output_file_id
    result = client.files.content(result_file_id).content
    results = parse_json_objects(result)
    answers = []
    
    for res in results:
        result = res['response']['body']['choices'][0]['message']['content']
        answers.append(result)
    
    return answers

## Data acquisition

Now that we have covered the core general functions and workflow used for batch inference, in this guide, we’ll be using the IMDb Top 1000 dataset, which contains information about top-rated movies, including their descriptions and genres. Let's download it and see what it looks like.

In [9]:
# IMDB Top 1000 dataset:
url = "https://raw.githubusercontent.com/kluster-ai/klusterai-cookbook/refs/heads/main/data/imdb_top_1000.csv"
urllib.request.urlretrieve(url,filename='imdb_top_1000.csv')

# Load and process the dataset based on URL content
df = pd.read_csv('imdb_top_1000.csv', usecols=['Series_Title', 'Overview', 'Genre']).tail(300)
df[['Series_Title','Overview']].head(3)

Unnamed: 0,Series_Title,Overview
700,Wait Until Dark,A recently blinded woman is terrorized by a trio of thugs while they search for a heroin-stuffed doll they believe is in her apartment.
701,Guess Who's Coming to Dinner,A couple's attitudes are challenged when their daughter introduces them to her African-American fianc.
702,Bonnie and Clyde,"Bored waitress Bonnie Parker falls in love with an ex-con named Clyde Barrow and together they start a violent crime spree through the country, stealing cars and robbing banks."


## Performing batch inference

In this section, we will perform batch inference using the previously defined helper functions and the IMDb dataset. The goal is to classify movie genres based on their descriptions using a Large Language Model (LLM).

We define the input prompts for the LLM, which consist of a system prompt outlining the task and user content, which includes a list of movie descriptions from our dataset.

In [10]:
prompt_dict = {
    "ASSISTANT_PROMPT" : '''
        You are a helpful assitant that classifies movie genres based on the movie description. Choose one of the following options: 
        Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.
        Provide your response as a single word with the matching genre. Don't include punctuation.
    ''',
    "USER_CONTENTS" : df['Overview'].tolist()
}

Next, we'll create and save the tasks, submit the batch inference job, and monitor its progress. Once the process is complete, the predictions will be integrated into the dataset.

In [11]:
task_list = create_tasks(user_contents=prompt_dict["USER_CONTENTS"], 
                         system_prompt=prompt_dict["ASSISTANT_PROMPT"], 
                         model="klusterai/Meta-Llama-3.1-8B-Instruct-Turbo", 
                         task_type='assistant')
filename = save_tasks(task_list, task_type='assistant')
job = create_batch_job(filename)
monitor_job_status(client=client, job_id=job.id, task_type='assistant')
df['predicted_genre'] = get_results(client=client, job_id=job.id)

'Assistant job completed!'

## LLM as a judge

This section evaluates the performance of the initial LLM predictions. We use another LLM as a judge to assess whether the predicted genres align with the movie descriptions.

First, we define the input prompts for the LLM judge. These prompts include the movie description, a list of possible genres, and the genre predicted by the first LLM. The judge LLM evaluates the correctness of the predictions based on specific criteria.

In [12]:
prompt_dict = {
    "JUDGE_PROMPT" : '''
        You will be provided with a movie description, a list of possible genres, and a predicted movie genre made by another LLM. Your task is to evaluate whether the predicted genre is ‘correct’ or ‘incorrect’ based on the following steps and requirements.
        
        Steps to Follow:
        1. Carefully read the movie description.
        2. Determine your own classification of the genre for the movie. Do not rely on the LLM's answer since it may be incorrect. Do not rely on individual words to identify the genre; read the whole description to identify the genre.
        3. Read the LLM answer (enclosed in double quotes) and evaluate if it is the correct answer by following the Evaluation Criteria mentioned below.
        4. Provide your evaluation as 'correct' or 'incorrect'.
        
        Evaluation Criteria:
        - Ensure the LLM answer (enclosed in double quotes) is one of the provided genres. If it is not listed, the evaluation should be ‘incorrect’.
        - If the LLM answer (enclosed in double quotes) does not align with the movie description, the evaluation should be ‘incorrect’.
        - The first letter of the LLM answer (enclosed in double quotes) must be capitalized (e.g., Drama). If it has any other capitalization, the evaluation should be ‘incorrect’.
        - All other letters in the LLM answer (enclosed in double quotes) must be lowercase. Otherwise, the evaluation should be ‘incorrect’.
        - If the LLM answer consists of multiple words, the evaluation should be ‘incorrect’.
        - If the LLM answer includes punctuation, spaces, or additional characters, the evaluation should be ‘incorrect’.
        
        Output Rules:
        - Provide the evaluation with no additional text, punctuation, or explanation.
        - The output should be in lowercase.
        
        Final Answer Format:
        evaluation
        
        Example:
        correct
    ''',
    "USER_CONTENTS" : [f'''Movie Description: {row['Overview']}.
        Available Genres: Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western
        LLM answer: "{row['predicted_genre']}"
        ''' for _, row in df.iterrows()
        ]
}

Following the same set of steps as the previous inference, we will create and save the tasks, submit the batch inference job, and monitor its progress. Once the process is complete, the predictions will also be integrated into the dataset.

In [13]:
task_list = create_tasks(user_contents=prompt_dict["USER_CONTENTS"], 
                         system_prompt=prompt_dict["JUDGE_PROMPT"], 
                         task_type='judge', 
                         model="klusterai/Meta-Llama-3.3-70B-Instruct-Turbo")
filename = save_tasks(task_list, task_type='judge')
job = create_batch_job(filename)
monitor_job_status(client=client, job_id=job.id, task_type='judge')
df['judge_evaluation'] = get_results(client=client, job_id=job.id)

'Judge job completed!'

Now, we will calculate the LLM classification accuracy based on what the LLM judge considers correct or incorrect. For this purpose, we will compute the accuracy. If you are unfamiliar with accuracy metrics, please refer to our previous <a href="https://github.com/kluster-ai/klusterai-cookbook/blob/main/examples/model-comparison.ipynb" target="_blank">notebook</a>.

In [14]:
print('LLM Judge-determined accuracy: ',df['judge_evaluation'].value_counts(normalize=True)['correct'])

LLM Judge-determined accuracy:  0.86


## Conclusion

According to the LLM judge, the baseline model's accuracy was 82%. This demonstrates how, in situations where we lack ground truth, we can leverage a large-language model to evaluate the responses of another model. By doing so, we can establish a ground truth or an evaluation metric to assess model performance, refine prompts, or understand how well the model performs.

This approach is particularly valuable when dealing with large datasets containing thousands of entries, where manual evaluation would be impractical. Automating this process saves significant time and reduces costs by eliminating the need for extensive human annotations. Ultimately, it provides a scalable and efficient way to gain meaningful insights into model performance.

### (Optional) Validation against ground truth

According to the LLM judge, the baseline model's accuracy is 82%. But how accurate is this evaluation? In this particular case, the IMDb Top 1000 dataset provides ground truth labels, allowing us to calculate the accuracy of the predicted genres directly. Let's compare and see how close the results are.

In [15]:
print('LLM ground truth accuracy: ',df.apply(lambda row: row['predicted_genre'] in row['Genre'].split(', '), axis=1).mean())

LLM ground truth accuracy:  0.7833333333333333


Although the ground truth accuracy is not exactly identical to the evaluation provided by the LLM judge, in situations where we lack ground truth, using an LLM as an evaluator offers a valuable way to assess how well our baseline model is performing.