# LLM as a Judge to Detect Wrong Answers with kluster.ai

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kluster-ai/klusterai-cookbook/blob/main/examples/llm-as-a-judge.ipynb)

Welcome to the LLM as a Judge Notebook!

Large language models (LLMs) are powerful tools for generating predictions, answering questions, and solving a variety of tasks. However, assessing the quality of their outputs often requires human judgment. When using a batch API to generate large volumes of predictions, manually evaluating each response can be challenging, if not impractical. This notebook introduces the concept of using LLMs themselves as judges to evaluate the outputs of other LLMs using <a href="https://kluster.ai/" target="_blank">kluster.ai</a> Batch API. By automating the evaluation process, we can gain insights into both the accuracy and quality of model-generated responses.

This notebook guides you through evaluating LLM-generated answers using another LLM as a judge. 
1. **Answer Initial Question:** An LLM creates four distinct answers to a single question—one correct response and three artificially altered versions, each containing different types of errors.
2. **Evaluate Responses:** Submit these predictions to another LLM, which will act as the judge, evaluating the quality and providing feedback on each answer.


## Config

Enter your personal kluster.ai API key (make sure it has no blank spaces). Remember to <a href="https://platform.kluster.ai/signup" target="_blank">sign up</a> if you don't have one yet.

In [36]:
from getpass import getpass
# Enter you personal kluster.ai API key (make sure in advance it has no blank spaces)
api_key = getpass("Enter your kluster.ai API key: ")

Enter your kluster.ai API key:  ········


## Setup

In [37]:
%pip install OpenAI

Note: you may need to restart the kernel to use updated packages.


In [38]:
import os
import urllib.request
import pandas as pd
import numpy as np
import random
import requests
from openai import OpenAI
import time
import json
from IPython.display import clear_output, display

pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000, 'display.max_colwidth', 500)

In [39]:
# Set up the client
client = OpenAI(
    base_url="https://api.kluster.ai/v1",
    api_key=api_key,
)

## 1. Answer a Simple Question

In this first part of the process, we will use an LLM to generate answers to a simple question. For this question, the model will be asked to produce three variations of its response.

This setup allows us to create a diverse range of answers for each question, intentionally varying their quality. These variations will serve as input for the next stage, where another model, acting as the judge, will evaluate the responses to determine which are good, which are flawed, and why.

In [50]:
df = pd.DataFrame({
    "question": [
        "Describe the three laws of motion formulated by Isaac Newton.",
        "Describe the three laws of motion formulated by Isaac Newton. Provide a correct but vague and incomplete answer.",
        "Describe the three laws of motion formulated by Isaac Newton. Provide the correct names, but mix them in order (don’t mention anything about this).",
        "Describe the three laws of motion formulated by Isaac Newton. Invent about 80% of the answer and include some real information. Don’t specify which parts are real or invented.",
    ]})

Now, using <a href="https://kluster.ai/" target="_blank">kluster.ai</a>'s Batch API we’ll create the batch tasks, generate the batch file, and upload it to the kluster.ai. Once uploaded, we’ll wait for the job to complete and retrieve the results. This process handles the generation of answers from the first model and prepares us for the evaluation step.

#### Create the Batch File

In [51]:
def create_tasks(df, task_type, system_prompt):
    tasks = []
    for index, row in df.iterrows():
        if task_type == 'assistant':
            content = row['question']
        elif task_type == 'judge':
            content = f'''
            User question: {row['question']}. \n\nLLM answer: {row['answer']}
            '''

        task = {
            "custom_id": f"{task_type}-{index}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "klusterai/Meta-Llama-3.1-405B-Instruct-Turbo",
                "temperature": 0.5,
                "response_format": {"type": "json_object"},
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": content},
                ],
            }
        }
        tasks.append(task)
    return tasks

def save_tasks(tasks, task_type):
    filename = f"batch_tasks_{task_type}.jsonl"
    with open(filename, 'w') as file:
        for task in tasks:
            file.write(json.dumps(task) + '\n')
    return filename

In [52]:
ASSISTANT_PROMPT = '''
    Provide an answer to the question asked. Use no more than 150 words.
    '''

task_list = create_tasks(df, task_type='assistant', system_prompt=ASSISTANT_PROMPT)
filename = save_tasks(task_list, task_type='assistant')

#### Upload Batch File to kluster.ai

In [53]:
def create_batch_job(file_name):
    print(f"Creating batch job for {file_name}")
    batch_file = client.files.create(
        file=open(file_name, "rb"),
        purpose="batch"
    )

    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    return batch_job

job = create_batch_job(filename)

Creating batch job for batch_tasks_assistant.jsonl


#### Check Job progress

In [54]:
def parse_json_objects(data_string):
    if isinstance(data_string, bytes):
        data_string = data_string.decode('utf-8')

    json_strings = data_string.strip().split('\n')
    json_objects = []

    for json_str in json_strings:
        try:
            json_obj = json.loads(json_str)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")

    return json_objects

def monitor_job_status(client, job_id, task_type):
    all_completed = False

    while not all_completed:
        all_completed = True
        output_lines = []

        updated_job = client.batches.retrieve(job_id)

        if updated_job.status.lower() != "completed":
            all_completed = False
            completed = updated_job.request_counts.completed
            total = updated_job.request_counts.total
            output_lines.append(f"{task_type.capitalize()} job status: {updated_job.status} - Progress: {completed}/{total}")
        else:
            output_lines.append(f"{task_type.capitalize()} job completed!")

        # Clear the output and display updated status
        clear_output(wait=True)
        for line in output_lines:
            display(line)

        if not all_completed:
            time.sleep(10)

monitor_job_status(client=client, job_id=job.id, task_type='assistant')

'Assistant job completed!'

#### Get the results

In [55]:
batch_job = client.batches.retrieve(job.id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
answers = []

for res in results:
    task_id = res['custom_id']
    result = res['response']['body']['choices'][0]['message']['content']
    answers.append(result) 

df['answer'] = answers

In [56]:
df

Unnamed: 0,question,answer
0,Describe the three laws of motion formulated by Isaac Newton.,"Isaac Newton formulated three laws of motion that describe how objects move and respond to forces. \n\n1. The First Law (Law of Inertia): An object at rest will remain at rest, and an object in motion will continue to move with a constant velocity, unless acted upon by an external force.\n\n2. The Second Law (Law of Acceleration): The force applied to an object is equal to the mass of the object multiplied by its acceleration (F = ma). This law shows that the more massive an object is, the m..."
1,Describe the three laws of motion formulated by Isaac Newton. Provide a correct but vague and incomplete answer.,"Isaac Newton's laws of motion are three fundamental principles that describe how objects move and respond to forces. The first law is about things staying the same unless something else happens. The second law is related to the amount of ""oomph"" needed to change something's motion. The third law is about what happens when things interact with each other - basically, it's a two-way street. These laws, which were presented in the late 17th century, laid the foundation for classical mechanics a..."
2,"Describe the three laws of motion formulated by Isaac Newton. Provide the correct names, but mix them in order (don’t mention anything about this).","Isaac Newton formulated the following laws of motion:\n\n1. Newton's Third Law of Motion states that every object in a state of uniform motion will remain in that state of motion unless an external force acts upon it. This means that an object at rest will stay at rest, and an object in motion will continue moving, unless a force is applied.\n\n2. Newton's Second Law of Motion states that for every action, there is an equal and opposite reaction. This means that when two objects interact, th..."
3,Describe the three laws of motion formulated by Isaac Newton. Invent about 80% of the answer and include some real information. Don’t specify which parts are real or invented.,"Isaac Newton's three laws of motion revolutionized our understanding of the physical world. The first law, also known as the law of celestial harmony, states that an object at rest will remain at rest, while an object in motion will continue to move in a spiral pattern, seeking harmony with the universe. However, an external force can disrupt this harmony and alter the object's trajectory.\n\nThe second law, or the law of energetic equivalence, asserts that the force applied to an object is ..."


## 2. Evaluate the answers

In this section, we will use our model in a new role—as a judge. The model will evaluate the response it previously generated, analyzing its quality and providing a score. This step helps us assess how well the model performed in generating the original response, offering valuable insights into its effectiveness.

In [64]:
JUDGE_PROMPT = '''
    You are an evaluator tasked with assessing the quality of a response provided by an expert to a specific user question. 
    Generate a score from 1 to 5 based on relevance, accuracy and clarity and provide a brief explanation for your score. 
    Output your evaluation in this format: “Score: X/5, Explanation: [Your explanation]“
    Use no more than 50 words in the explanation.
    '''

Now, we’ll erase the information about which answer was generated by adding some invented details

In [65]:
df_copy = df.copy()
df_copy['question'] = df_copy['question'].iloc[0] # We don't give info to the judge about which one is true or invented.
task_list = create_tasks(df_copy, task_type='judge', system_prompt=JUDGE_PROMPT)
filename = save_tasks(task_list, task_type='judge')
job = create_batch_job(filename)
monitor_job_status(client=client, job_id=job.id, task_type='judge')

'Judge job completed!'

In [66]:
batch_job = client.batches.retrieve(job.id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
evals = []

for res in results:
    task_id = res['custom_id']
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    evals.append(result) 
    question = df_copy.iloc[int(index)]['question']
    answer = df_copy.iloc[int(index)]['answer']
    print(f'QUESTION: {question}')
    print(f'\n -------------------------- \n')
    print(f"Task ID: {task_id}. \n\nEVALUATION: {result}\n\nLLM ANSWER: {answer}")

df['evaluation'] = evals

QUESTION: Describe the three laws of motion formulated by Isaac Newton.

 -------------------------- 

Task ID: judge-0. 

EVALUATION: Score: 5/5, Explanation: The response accurately and clearly describes Newton's three laws of motion, providing concise definitions and explanations for each law. The answer is well-organized and relevant to the user's question, demonstrating a thorough understanding of the topic.

LLM ANSWER: Isaac Newton formulated three laws of motion that describe how objects move and respond to forces. 

1. The First Law (Law of Inertia): An object at rest will remain at rest, and an object in motion will continue to move with a constant velocity, unless acted upon by an external force.

2. The Second Law (Law of Acceleration): The force applied to an object is equal to the mass of the object multiplied by its acceleration (F = ma). This law shows that the more massive an object is, the more force is required to produce a given acceleration.

3. The Third Law (Law 