# Evaluating LLM performance without ground truth using an LLM judge

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kluster-ai/klusterai-cookbook/blob/main/examples/prompt-engineering.ipynb)

In our previous notebook we explored the idea of selecting the best model to perform a classification task. We did that by calculating the accuracy of each model based on a ground truth label. In real life applications, though, the ground truth is not always available and to create one we might depend on human anotation which is timeconsuming and costly. Even more if our model is predicting large volumes of data. 

In this notebook we'll explore the idea of leveraging the use of an LLM as a judge to evaluate another LLM's answer on a classification task and provide an evaluating base to understand how the base model is performing. 

Using the IMDb Top 1000 dataset, we will request the base model to classify the genre of the movies based on their description. Then we will request a different model to evaluate the first model's answer to decide if it's correct or not. As in this case we chose the IMDb 1000 Movie dataset which actually provides the ground truth we'll be able to check how good is our judge compared with the original ground truth that comes with the dataset.

## Config

Enter your personal kluster.ai API key (make sure it has no blank spaces). Remember to <a href="https://platform.kluster.ai/signup" target="_blank">sign up</a> if you don't have one yet.

In [1]:
from getpass import getpass
# Enter you personal kluster.ai API key (make sure in advance it has no blank spaces)
api_key = getpass("Enter your kluster.ai API key: ")

Enter your kluster.ai API key:  ········


## Setup

In [2]:
%pip install -q OpenAI

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import urllib.request
import pandas as pd
import numpy as np
import random
import requests
from openai import OpenAI
import time
import json
from IPython.display import clear_output, display
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score

pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000, 'display.max_colwidth', 500)

In [5]:
# Set up the client
client = OpenAI(
    base_url="https://api.kluster.ai/v1",
    api_key=api_key,
)

## 1. Perform a classification task

In [6]:
# IMDB Top 1000 dataset:
url = "https://raw.githubusercontent.com/kluster-ai/klusterai-cookbook/refs/heads/main/data/imdb_top_1000.csv"
urllib.request.urlretrieve(url,filename='imdb_top_1000.csv')

# Load and process the dataset based on URL content
df = pd.read_csv('imdb_top_1000.csv', usecols=['Series_Title', 'Overview', 'Genre']).tail(300)
df.head(3)

Unnamed: 0,Series_Title,Genre,Overview
700,Wait Until Dark,Thriller,A recently blinded woman is terrorized by a trio of thugs while they search for a heroin-stuffed doll they believe is in her apartment.
701,Guess Who's Coming to Dinner,"Comedy, Drama",A couple's attitudes are challenged when their daughter introduces them to her African-American fianc.
702,Bonnie and Clyde,"Action, Biography, Crime","Bored waitress Bonnie Parker falls in love with an ex-con named Clyde Barrow and together they start a violent crime spree through the country, stealing cars and robbing banks."


#### Create the Batch File

In [7]:
def create_tasks(df, task_type, system_prompt, model):
    tasks = []
    for index, row in df.iterrows():
        if task_type == 'assistant':
            content = row['Overview']
        elif task_type == 'judge':
            content = f'''
            Movie Description: {row['Overview']}.
            Available Genres: Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western
            LLM answer: "{row['predicted_genre']}"
            '''
        
        task = {
            "custom_id": f"{task_type}-{index}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": model,
                "temperature": 0,
                "response_format": {"type": "json_object"},
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": content},
                ],
            }
        }
        tasks.append(task)
    return tasks

def save_tasks(tasks, task_type):
    filename = f"batch_tasks_{task_type}.jsonl"
    with open(filename, 'w') as file:
        for task in tasks:
            file.write(json.dumps(task) + '\n')
    return filename

In [20]:
ASSISTANT_PROMPT = '''
    You are a helpful assitant that classifies movie genres based on the movie description. Choose one of the following options: 
    Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.
    Provide your response as a single word with the matching genre. Don't include punctuation.
    '''

task_list = create_tasks(df, system_prompt=ASSISTANT_PROMPT, model="klusterai/Meta-Llama-3.1-8B-Instruct-Turbo", task_type='assistant')
filename = save_tasks(task_list, task_type='assistant')

#### Upload Batch File to kluster.ai

In [21]:
def create_batch_job(file_name):
    print(f"Creating batch job for {file_name}")
    batch_file = client.files.create(
        file=open(file_name, "rb"),
        purpose="batch"
    )

    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    return batch_job

job = create_batch_job(filename)

Creating batch job for batch_tasks_assistant.jsonl


#### Check Job progress

In [22]:
def parse_json_objects(data_string):
    if isinstance(data_string, bytes):
        data_string = data_string.decode('utf-8')

    json_strings = data_string.strip().split('\n')
    json_objects = []

    for json_str in json_strings:
        try:
            json_obj = json.loads(json_str)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")

    return json_objects

def monitor_job_status(client, job_id, task_type):
    all_completed = False

    while not all_completed:
        all_completed = True
        output_lines = []

        updated_job = client.batches.retrieve(job_id)

        if updated_job.status.lower() != "completed":
            all_completed = False
            completed = updated_job.request_counts.completed
            total = updated_job.request_counts.total
            output_lines.append(f"{task_type.capitalize()} job status: {updated_job.status} - Progress: {completed}/{total}")
        else:
            output_lines.append(f"{task_type.capitalize()} job completed!")

        # Clear the output and display updated status
        clear_output(wait=True)
        for line in output_lines:
            display(line)

        if not all_completed:
            time.sleep(10)

monitor_job_status(client=client, job_id=job.id, task_type='assistant')

'Assistant job completed!'

#### Get the results

In [23]:
batch_job = client.batches.retrieve(job.id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
answers = []

for res in results:
    task_id = res['custom_id']
    result = res['response']['body']['choices'][0]['message']['content']
    answers.append(result) 

df['predicted_genre'] = answers

## 2. LLM as a Judge

Now that we identified what are kind of mistakes the LLM is doing, we'll modify the original prompt to help it perform better

In [24]:
JUDGE_PROMPT = '''
    You will be provided with a movie description, a list of possible genres, and a predicted movie genre made by another LLM. Your task is to evaluate whether the predicted genre is ‘correct’ or ‘incorrect’ based on the following steps and requirements.
    
    Steps to Follow:
    1. Carefully read the movie description.
    2. Determine your own classification of the genre for the movie. Do not rely on the LLM's answer since it may be incorrect. Do not rely on individual words to identify the genre; read the whole description to identify the genre.
    3. Read the LLM answer (enclosed in double quotes) and evaluate if it is the correct answer by following the Evaluation Criteria mentioned below.
    4. Provide your evaluation as 'correct' or 'incorrect'.
    
    Evaluation Criteria:
    - Ensure the LLM answer (enclosed in double quotes) is one of the provided genres. If it is not listed, the evaluation should be ‘incorrect’.
    - If the LLM answer (enclosed in double quotes) does not align with the movie description, the evaluation should be ‘incorrect’.
    - The first letter of the LLM answer (enclosed in double quotes) must be capitalized (e.g., Drama). If it has any other capitalization, the evaluation should be ‘incorrect’.
    - All other letters in the LLM answer (enclosed in double quotes) must be lowercase. Otherwise, the evaluation should be ‘incorrect’.
    - If the LLM answer consists of multiple words, the evaluation should be ‘incorrect’.
    - If the LLM answer includes punctuation, spaces, or additional characters, the evaluation should be ‘incorrect’.
    
    Output Rules:
    - Provide the evaluation with no additional text, punctuation, or explanation.
    - The output should be in lowercase.
    
    Final Answer Format:
    evaluation
    
    Example:
    correct
    '''

task_list = create_tasks(df, task_type='judge', system_prompt=JUDGE_PROMPT, model="klusterai/Meta-Llama-3.1-405B-Instruct-Turbo")
filename = save_tasks(task_list, task_type='judge')

#### Upload Batch File to kluster.ai

In [25]:
job = create_batch_job(filename)

Creating batch job for batch_tasks_judge.jsonl


#### Check job progress

In [26]:
monitor_job_status(client=client, job_id=job.id, task_type='judge')

'Judge job completed!'

#### Get the results

In [27]:
batch_job = client.batches.retrieve(job.id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
evaluations = []

for res in results:
    task_id = res['custom_id']
    result = res['response']['body']['choices'][0]['message']['content']
    evaluations.append(result)

df['judge_evaluation'] = evaluations

print('LLM Judge-determined accuracy: ',df['judge_evaluation'].value_counts(normalize=True)['correct'])

LLM Judge-determined accuracy:  0.8666666666666667


## Conclusion

According to the LLM Judge, the accuracy of the baseline model was 82%. This demonstrates how, in situations where we lack a ground truth, we can leverage a large-language model to evaluate the responses of another model. By doing so, we can establish a form of ground truth or an evaluation metric that allows us to assess model performance, refine prompts, or understand how well the model is performing overall.

This approach is particularly valuable when dealing with large datasets containing thousands of entries, where manual evaluation would be impractical. Automating this process not only saves significant time but also reduces costs by eliminating the need for extensive human annotations. Ultimately, it provides a scalable and efficient way to gain meaningful insights into model performance.

## (Optional) Validation against ground truth

According to the LLM Judge, the accuracy of the baseline model is 82%. But how accurate is this evaluation? In this particular case, the IMDb Top 1000 dataset provides ground truth labels, allowing us to directly calculate the accuracy of the predicted genres. Let’s compare and see how close the results are.

In [30]:
print('LLM ground truth accuracy: ',df.apply(lambda row: row['predicted_genre'] in row['Genre'].split(', '), axis=1).mean())

LLM ground truth accuracy:  0.7866666666666666


Although the ground truth accuracy is not exactly identical to the evaluation provided by the LLM Judge, in situations where we lack ground truth, using an LLM as an evaluator offers a valuable way to assess how well our baseline model is performing.