# LLM-as-a-Judge

This notebook showcases the different scenarios of LLM-as-a-Judge with concrete examples.

Its main purpose is to create alignment and structure the work that needs to be done to integrate AI-assisted judgments into the Search Relevance Workbench.

The different use cases where judgments are generated with language models:

1. The classic approach: Users define a judgment generation process by defining a query set, a LLM and a prompt. For every query in the query set the top n documents are retrieved by a search configuration and for each query-doc pair the LLM generates a judgment together with a reasoning statement.
2. The classic approach embedded in an experiment: Users run an experiment and want to create judgments "on-the-fly" by referencing an empty judgment list. The process of judgment.generation is identical to the first option. The difference is the user journey: create an empty list, run an experiment including judgment generation as part of the overarching process.
3. Filling in gaps: Users run an experiment and want to fill in any gaps. It is common to not have all query-doc pairs judged. With LLM-as-a-judge these gaps can be filled on the fly
4. Similarity-based judgments: Users want to generate judgments with LLMs that are based on the similarity of a provided reference statement to a retrieved document.

## Notebook requirements

* Python 3
* Follow the instructions in `README.md` to set up a virtual environment and install the required Python libraries
* Ollama is used as the local model serving engine
* The [Gemma 3](https://ollama.com/library/gemma3) model is used for LLM-based judgments. Download the model with `ollama run gemma3`
* [All-MiniLM](https://ollama.com/library/all-minilm) is used for the similarity-based judgments. Download the model with `ollama pull all-minilm`

### Query Set and Documents

We assume a simple set of three queries and have three documents.

Three example documents are taken from the [ESCI dataset](https://github.com/amazon-science/esci-data).

In [80]:
queries = ["barbie", "airpods", "apple airpods"]
documents = [
    {
        "id": "1",
        "abstract": "",
        "title": "Barbie Malibu House Playset 60 cm"
    },
    {
        "id": "2",
        "abstract": "Airpods Case - Airspo 7 in 1 Airpods Accessories Set Compatible with Airpods 1 & 2",
        "title": "Airpods Case - Airspo 7 in 1 Airpods Accessories Set Compatible with Airpods 1 & 2 Protective Silicone Cover Floral Print Cute Case (Black Rose)"
    },
    {
        "id": "3",
        "abstract": "Welcome to the World's First Apple Cider Vinegar Gummies! Vegan, Gluten Free, Non-GMO and made with the highest quality ingredients! Our patented formula combines Apple Cider Vinegar, traditionally used as a remedy for digestion, gut health, and appetite, plus essential Vitamins B9 and B12 that support overall good health. Taste the Apple. Not the Vinegar.",
        "title": """Products Apple Cider Vinegar Gummy Vitamins by Goli Nutrition - 3 Pack - (180 Count, Organic, Vegan, Gluten-Free, Non-GMO, with"The Mother", Vitamin B9, B12, Beetroot, Pomegranate)"""
    }
]

### Query-Doc Pairs

We combine every document with each query to have a list of 9 query-doc pairs.

We structure these in the format of a user message that we can pass to a instruction-based LLM.

In [81]:
user_messages = []

for query in queries:
    for doc in documents:
        user_message = {
            "role": "user",
            "content": f"""
            Query: {query}
            doc1:
                title: {doc['title']}
                abstract: {doc['abstract']}
            """
        }
        user_messages.append(user_message)

## Use Cases 1 & 2 - The Classic LLM-as-a-Judge Scenario

In [82]:
import ollama

In [83]:
from ollama import chat
from ollama import ChatResponse

# We create a prompt that consists of system message and user message.

# The system prompt contains the instructions for the LLM. These contain an explanation of the judgment scale,
# the output format and examples.
system_message = {
    "role": "system",
    "content": """You are evaluating the results from a search engine. For each query, you will be provided with multiple documents. Your task is to evaluate each document and assign a judgment on a scale of 0 to 3, where:
    - 0 indicates the document is irrelevant to the query.
    - 1 indicates the document is somewhat relevant to the query.
    - 2 indicates the document is mostly relevant to the query.
    - 3 indicates the document is perfectly relevant to the query.

    For each document, provide:
    1. An explanation of the judgment.
    2. The judgment value.

    The response should be in the following JSON format:
    {
      "explanation": "Your detailed reasoning behind the judgment",
      "judgment": <numeric value>
    }

    Here are three examples:
    User:
    Query: Farm animals

    doc1:
      title: All about farm animals
      abstract: This document is all about farm animals
    Assistant:
    {
      "explanation": "This document appears to perfectly respond to the user's query",
      "judgment": 3
    }

    User:
    Query: Farm animals

    doc2:
      title: Somewhat about farm animals
      abstract: This document somewhat talks about farm animals
    Assistant:
    {
      "explanation": "This document is somewhat relevant to the user's query",
      "judgment": 1
    }

    User:
    Query: Farm animals

    doc3:
      title: This document has nothing to do with farm animals
      abstract: We will talk about everything except for farm animals.
    Assistant:
    {
      "explanation": "This document is not relevant at all to the user's query",
      "judgment": 0
    }"""
  }

In [84]:
# We pass the system message and one user message at a time to the LLM and retrieve the answer
for msg in user_messages:
    response: ChatResponse = chat(model='gemma3', messages=[
      system_message, msg
    ])
    print(response['message']['content'])

```json
{
  "explanation": "This document is highly relevant to the query 'barbie'. The title explicitly mentions 'Barbie' and the abstract refers to a Barbie product (the Malibu House Playset).",
  "judgment": 3
}
```
```json
{
  "explanation": "This document is completely irrelevant to the query 'barbie'. It discusses Airpods accessories and has nothing to do with the Barbie brand or the topic of Barbie.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document discusses apple cider vinegar gummies, which has absolutely no connection to the query 'barbie'. It's entirely irrelevant.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document is completely irrelevant to the query 'airpods'. It discusses a Barbie play set and has no connection to the topic of wireless earbuds.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document is highly relevant to the query 'airpods'. It directly mentions 'airpods' in the title and abstract, and describes a case specificall

## Use Case 3 - Filling the Gaps

The LLM-as-a-Judge process only requests judgments for unrated query-doc pairs.

The basic process behind judging stays the same.

### Judgments

We now assume a judgment list that contains judgments for some of the query-doc pairs, not all.

In [85]:
judgments = [
    {
        "query": "barbie",
        "doc": "1",
        "judgment": 1
    },
    {
        "query": "barbie",
        "doc": "3",
        "judgment": 0
    },
    {
        "query": "airpods",
        "doc": "3",
        "judgment": 0
    }
]

In [86]:
# We now create user messages only for unrated query-doc pairs
user_messages = []

for query in queries:
    for idx, doc in enumerate(documents, start=1):
        if any(j["query"] == query and j["doc"] == str(idx) for j in judgments):
            continue  # Skip if a judgment exists
        
        user_message = {
            "role": "user",
            "content": f"""
            Query: {query}
            doc1:
                title: {doc['title']}
                abstract: {doc['abstract']}
            """
        }
        user_messages.append(user_message)

In [87]:
# We pass the system message and one user message at a time to the LLM and retrieve the answer
for msg in user_messages:
    response: ChatResponse = chat(model='gemma3', messages=[
      system_message, msg
    ])
    print(response['message']['content'])

```json
{
  "explanation": "This document is completely irrelevant to the query 'barbie'. It discusses Airpods accessories and has nothing to do with the famous doll.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document is completely irrelevant to the query 'airpods'. It discusses a Barbie toy house and has no connection to the topic of wireless earbuds.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document is highly relevant to the query 'airpods'. It specifically mentions 'Airpods' in the title and abstract, detailing a case accessory for Airpods.",
  "judgment": 3
}
```
```json
{
  "explanation": "This document is completely irrelevant to the user's query about 'apple airpods'. It discusses a Barbie Malibu House Playset and has no connection to the topic.",
  "judgment": 0
}
```
```json
{
  "explanation": "This document is highly relevant as it directly addresses the query 'apple airpods' by discussing an 'Airpods' accessory set. While it focuses on a case, 

## Use Case 4 - Similarity-based Judgments

For Similarity-based Judgments we do not assume an instruction-based LLM to act as a judge but rather measure the similarity of a returned document to the query or to a provided reference statement.

### Query-to-Doc Similarity 

In [88]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [89]:
#query = "barbie"
#document = "Barbie Malibu House Playset 60 cm"
query = "apple airpods"
document = "wireless earbuds"

In [90]:
query_embedding = ollama.embeddings(model='all-minilm', prompt=query)["embedding"]
document_embedding = ollama.embeddings(model='all-minilm', prompt=document)["embedding"]

In [91]:
similarity = cosine_similarity([query_embedding], [document_embedding])[0][0]
print(similarity)

0.43989050572567057


In [92]:
similarities = []
for query in queries:
    for doc in documents:
        query_embedding = ollama.embeddings(model='all-minilm', prompt=query)["embedding"]
        doc_text = " ".join([str(doc['title']),str(doc['abstract'])])
        document_embedding = ollama.embeddings(model='all-minilm', prompt=doc_text)["embedding"]
        similarity = cosine_similarity([query_embedding], [document_embedding])[0][0]
        similarities.append({"query": query, "document": doc['title'], "similarity": similarity})        

In [93]:
import json

def print_json_values(json_array):
    for obj in json_array:
        values = []
        for key, value in obj.items():
            if isinstance(value, list):
                values.append("\n".join(map(str, value)))  # Join list elements with tab
            else:
                values.append(str(value))
        print("\n".join(values) + "\n")  # Print values separated by tabs

print_json_values(similarities)

barbie
Barbie Malibu House Playset 60 cm
0.5484308971684816

barbie
Airpods Case - Airspo 7 in 1 Airpods Accessories Set Compatible with Airpods 1 & 2 Protective Silicone Cover Floral Print Cute Case (Black Rose)
0.12998155612524992

barbie
Products Apple Cider Vinegar Gummy Vitamins by Goli Nutrition - 3 Pack - (180 Count, Organic, Vegan, Gluten-Free, Non-GMO, with"The Mother", Vitamin B9, B12, Beetroot, Pomegranate)
0.025337382598659565

airpods
Barbie Malibu House Playset 60 cm
0.0606708010043219

airpods
Airpods Case - Airspo 7 in 1 Airpods Accessories Set Compatible with Airpods 1 & 2 Protective Silicone Cover Floral Print Cute Case (Black Rose)
0.6365950399555693

airpods
Products Apple Cider Vinegar Gummy Vitamins by Goli Nutrition - 3 Pack - (180 Count, Organic, Vegan, Gluten-Free, Non-GMO, with"The Mother", Vitamin B9, B12, Beetroot, Pomegranate)
-0.08574967492363803

apple airpods
Barbie Malibu House Playset 60 cm
0.06269648046198087

apple airpods
Airpods Case - Airspo 7 in 

### Reference-to-Doc Similarity

Reference statements are more common in Question-Answering or similar systems.

For this scenario we create an additional query set that contains a reference for each query.



In [94]:
queries = [
    {
        "query": "What's the captial of Germany",
        "reference": "Berlin is the capital of Germany."
    },
    {
        "query": "who was the 30th president of the united states",  
        "reference": "Calvin Coolidge was America's 30th president from 1923-1929."
    },
    {
        "query": "How many books are in the Harry Potter series",
        "reference": "There are 7 books in the Harry Potter series."
    }
]
documents = [
    {
        "id": "1",
        "title": "Bonn",
        "abstract": "Before Berlin, Bonn was the capital of capital of Germany from 1949 to 1990,",
    },
    {
        "id": "2",
        "title": "Harry Potter Fandom",
        "abstract": "8 Harry Potter movies exist.",
    },
    {
        "id": "3",
        "title": "Calvin Coolidge",
        "abstract": "In 1872 Calvin Coolidge was born in Plymouth Notch. In 1923 he became America's 30th president.",
    }
]

In [95]:
similarities = []
for query in queries:
    reference_text = query['reference']
    for doc in documents:
        reference_embedding = ollama.embeddings(model='all-minilm', prompt=reference_text)["embedding"]
        doc_text = " ".join([str(doc['title']),str(doc['abstract'])])
        document_embedding = ollama.embeddings(model='all-minilm', prompt=doc_text)["embedding"]
        similarity = cosine_similarity([reference_embedding], [document_embedding])[0][0]
        similarities.append({"query": query['query'], "reference": reference_text, "document": doc['title'], "similarity": similarity})

In [96]:
print_json_values(similarities)

What's the captial of Germany
Berlin is the capital of Germany.
Bonn
0.6175566088869178

What's the captial of Germany
Berlin is the capital of Germany.
Harry Potter Fandom
-0.04672439933953921

What's the captial of Germany
Berlin is the capital of Germany.
Calvin Coolidge
0.015167892858684733

who was the 30th president of the united states
Calvin Coolidge was America's 30th president from 1923-1929.
Bonn
0.11039106903291925

who was the 30th president of the united states
Calvin Coolidge was America's 30th president from 1923-1929.
Harry Potter Fandom
-0.038401802991445425

who was the 30th president of the united states
Calvin Coolidge was America's 30th president from 1923-1929.
Calvin Coolidge
0.7403091215042535

How many books are in the Harry Potter series
There are 7 books in the Harry Potter series.
Bonn
-0.11268842032783113

How many books are in the Harry Potter series
There are 7 books in the Harry Potter series.
Harry Potter Fandom
0.6414299004788524

How many books are i