### LLM as a Judge

This notebook demonstrates using a large language model (LLM) to **evaluate responses** or make judgments on specific questions. 

We will:

- Format a prompt with examples and a target question.
- Send it to the LLM for evaluation.
- Collect the critique and outcome in a structured way.

### Setup

This cell installs and imports all the necessary libraries, loads environment variables, and sets up the API key for using the Google GenAI client.

- **pandas**: For data manipulation.  
- **os & dotenv**: To load environment variables securely.  
- **pydantic**: To define structured output schemas.  
- **google.genai**: To interact with the Google LLM API.  
- **yaml**: Optional, for reading or writing YAML configurations.

We also load the `GOOGLE_API_KEY` from a `.env` file to authenticate requests to the LLM.


In [1]:
import pandas as pd 
import os 
from pydantic import BaseModel, Field 
from google import genai 
from google.genai import types 
from dotenv import load_dotenv 
import yaml 

load_dotenv() 

GOOGLE_API_KEY= os.getenv("GOOGLE_API_KEY")

### Define Output Schema

We define a **Pydantic model** to structure the LLM's output:

- `critique`: A detailed explanation of the LLM's reasoning.  
- `outcome`: The final judgment based on the critique (e.g., "pass" or "fail").  

Using a schema ensures the model's responses are **consistent and machine-readable**.


In [2]:
class LLMValidatorOutput(BaseModel): 
    critique: str = Field( description="The formulated detailed critique explaining your reasoning") 
    outcome: str = Field(description="The outcome of the critique (pass/fail)")

### Load Prompt Template

Here we load the **prompt template** from a YAML file (`prompt.yml`).

In [3]:
prompt = yaml.safe_load(open("prompt.yml", "r")) 
prompt_template = prompt['template']['content']

### Format Examples for Prompt

This function converts a DataFrame of examples into a **structured text format** that the LLM can understand.  

- Each example includes:  
  - `<nlq>`: The user question.  
  - `<response>`: The AI's answer.  
  - `<evaluation>`: A JSON-like critique and outcome.  
- The function wraps each example in `<example-{i}>` tags for clarity.  

This makes it easy to feed multiple examples into the prompt in a consistent format.


In [4]:
def format_examples(df_examples: pd.DataFrame) -> str: 
    examples_list = [ 
        f""" <example-{i}> 
        <nlq>{example["User"].strip()}</nlq> 
        <response>{example["AI"].strip()}</response> 
        <evaluation> 
        {{ 
            "critique": "{example["Critique"].strip()}", 
            "outcome": "{example["Judgement"].strip().lower()}" 
            }} 
            </evaluation> 
            </example-{i}> """ 
            for i, example in df_examples.iterrows() 
            ] 
    return "\n".join(examples_list)

### Initialize LLM Client

We set up the **Google GenAI client** and define a configuration for generating content:

- `temperature=0.0`: Ensures deterministic outputs for consistent evaluation.  
- `response_mime_type='application/json'`: Returns the output in JSON format.  
- `response_schema=LLMValidatorOutput`: Validates that the LLM's response follows our predefined schema (`critique` and `outcome`).  

In [5]:
llm_instance = genai.Client(api_key=GOOGLE_API_KEY) 
config = types.GenerateContentConfig( 
    temperature=0.0, 
    response_mime_type='application/json', 
    response_schema=LLMValidatorOutput 
    )

### Load Ground Truth Data

We load the **ground truth dataset** containing example features, user questions, AI responses, critiques, and judgments 

This dataset will be used to:

- Provide examples to the LLM.  
- Compare new AI responses against established evaluations.  
- Generate structured prompts for the LLM to critique and judge. 

In [6]:
df_groundtruth = pd.read_csv("data/groundtruth-vs2.csv") 
df_groundtruth

Unnamed: 0,ID,Feature,Scenario,User,AI,Judgement,Critique
0,1,Product order,Order Tracking,Track an order,"For purchases made directly from our website, ...",PASS,The AI's response comprehensively covers how t...
1,2,Product order,Delivery Options,"Hello, is it possible to schedule a delivery?","For purchases made directly from our website, ...",FAIL,The user's question is clear and straightforwa...
2,3,Product order,Order cancellation,"I would like to cancel my order, please.","For purchases made directly from our website, ...",PASS,"Critique:\nThe user’s question, ""I want to can..."
3,4,Product order,Lost Package,Lost package / missing parcel.,"For purchases made directly from our website, ...",FAIL,"Critique:\nThe user’s query, ""lost parcel,"" in..."
4,5,Product order,Modify Order Before Shipment,"Hello, I would like to change the size of a pr...","For purchases made directly from our website, ...",FAIL,Critique:\nThe user’s question is clear: they ...
5,6,Return Management,Initiate Product Return,How to return a product.,"For purchases fromour website, you can return ...",PASS,"Critique:\nThe user’s question, ""How to return..."
6,7,Return Management,Request Refund,How can I get refunded?,"For purchases made directly from our website, ...",PASS,"Critique:\nThe user’s question, ""How do I get ..."
7,8,Return Management,Seller Non-Response,You replied to me one week ago that the seller...,"For purchases from Marketplace sellers, you mu...",FAIL,critique:\nThe user’s question indicates frust...
8,9,Payment Options,Access to invoice,Download an invoice.,"To download your invoice, first log in to your...",PASS,Critique:\nThe AI's response effectively instr...
9,11,Payment Options,Accepted Payment Methods,What payment methods are accepted?,"For purchases on our website, the following pa...",PASS,Critique:\nIntroduction:\nThe user inquired ab...


### Load Chatbot Answers

We load the dataset containing **AI-generated answers** to various user questions.  

This dataset will be used to:

- Feed questions and answers into the LLM for evaluation.  
- Compare the LLM's critique and judgment against ground truth examples.

In [7]:
df_qa = pd.read_csv("data/chatbot-answers-vs2.csv") 
df_qa

Unnamed: 0,question,llm_answer,Feature
0,Track an order,"For purchases made directly from our website, ...",Product Order
1,How to return a product,"For purchases made directly from our website, ...",Return Management
2,Download an invoice,"To download your invoice, follow these steps:\...",Payment Options
3,Cancel an order,"For purchases made directly from our website, ...",Product Order
4,Refund,"For purchases made directly from our website, ...",Return Management
5,Can i use my visa card,"I’m sorry, but Visa cards are not accepted for...",Payment Options


### Select Test Case

Here we select a **single test example** from the chatbot answers dataset to evaluate.  

This allows us to **focus on one case** while testing the LLM evaluation workflow.


In [23]:
df_test= df_qa.iloc[0:1].reset_index(drop=True) 
# df_test= df_qa.iloc[1:2].reset_index(drop=True) 
# df_test= df_qa.iloc[5:6].reset_index(drop=True) 
df_test

Unnamed: 0,question,llm_answer,Feature
0,Can i use my visa card,"I’m sorry, but Visa cards are not accepted for...",Payment Options


### Generate Formatted Prompt

This loop constructs the **final prompt** for each test case in `df_test`:

- Retrieves relevant examples from the ground truth dataset that match the current feature.  
- Fills the prompt template with:
  - Feature name  
  - Example blocks  (Few-shot Prompting)
  - Test question and the AI's answer  

This step ensures the LLM receives a **well-structured, context-rich prompt** for evaluation.


In [24]:
import pprint 

for idx, row in df_test.iterrows(): 
    prompt_template = prompt_template 
    examples_text = format_examples(df_groundtruth[df_groundtruth["Feature"].str.lower() == row["Feature"].lower()]) 
    
    formatted_prompt_str = prompt_template.format( 
        feature=row.get("Feature", ""), 
        examples=examples_text, 
        question=row.get("question", ""), 
        llm_answer=row.get("llm_answer", "") ) 

formatted_prompt_str = formatted_prompt_str.strip() 
pprint.pp(formatted_prompt_str)

('You are an AI response evaluator with advanced capabilities to judge if a '
 'response is appropriate, helpful, and meets specific evaluation criteria. \n'
 'Your role is to assess responses involving Payment Options-related queries, '
 'guided by detailed evaluation standards.\n'
 '\n'
 '### Guidelines for Evaluation:\n'
 'Consider the following criteria when evaluating the response:\n'
 '\n'
 '1. **Accuracy and Relevance**  \n'
 "  - Does the response directly address the customer's query?  \n"
 '  - Is the information factually accurate and aligned with official company '
 'policies?  \n'
 '\n'
 '2. **Completeness**  \n'
 "  - Does the response fully address all aspects of the customer's query?  \n"
 '  - Are there any critical details, steps, or aspects of the query left '
 'unaddressed?  \n'
 '\n'
 '3. **Clarity and Understandability**  \n'
 '  - Is the response written in simple and clear language, avoiding '
 'unnecessary jargon?  \n'
 '  - Is it structured logically so the cu

### Send Prompt to LLM

Here we send the **formatted prompt** to the LLM for evaluation:

- `model="gemini-2.0-flash"` specifies the LLM to use.  
- `contents=formatted_prompt_str` provides the structured input prompt.  
- `config=config` applies our generation settings, including:
  - Deterministic output (`temperature=0.0`)  
  - JSON response format  
  - Schema validation (`LLMValidatorOutput`)  

The LLM will return a **critique and outcome** based on the provided examples and test case.


In [None]:
response = llm_instance.models.generate_content( 
    # model="gemini-2.0-flash-lite",
    model="gemini-2.0-flash", 
    contents=formatted_prompt_str, 
    config=config 
    )

### Display LLM Outcome

Here we extract the **final judgment** from the LLM's response:

- `response.parsed.outcome` contains the verdict (e.g., `"pass"` or `"fail"`).  
- This shows the LLM's overall evaluation of the AI-generated answer based on the provided examples and prompt.  


In [29]:
response.parsed.outcome

'pass'

### Display LLM Critique

After sending the prompt, we extract and display the **detailed critique** from the LLM's response:

- `response.parsed.critique` contains the reasoning behind the LLM's judgment.   

This helps us understand **why the LLM evaluated the answer as it did**.


In [30]:
pprint.pp(response.parsed.critique)

("The response directly answers the user's question about using a Visa card. "
 'It states that Visa cards are not accepted and suggests alternative payment '
 "methods. The response is clear, concise, and directly addresses the user's "
 'query. It adheres to a professional tone and provides helpful information by '
 "suggesting alternatives. However, without knowing the specific service, it's "
 'impossible to verify the accuracy of the statement about Visa card '
 'acceptance. Assuming the information is accurate, the response is helpful '
 "and meets the user's needs.")
