# LLM-as-a-Judge Simply Explained: A Complete Guide to Run LLM Evals

Recently, the concept of “LLM as a Judge” has been gaining significant traction in the AI and NLP communities. As someone deeply involved in the field of LLM evaluation, I’ve seen firsthand how LLM judges are rapidly becoming the preferred method for evaluating language models. The reasons are clear: compared to traditional human evaluators, LLM judges offer faster, more scalable, and cost-effective assessments—eliminating much of the slow, expensive, and labor-intensive work that comes with manual review.

However, it’s important to recognize that LLM judges are not without their own challenges and limitations. Blindly relying on them can lead to misleading results and unnecessary frustration. That’s why, in this guide, I’ll share everything I’ve learned about leveraging LLM judges for system evaluation, including:

- The core principles behind LLM-as-a-Judge
- The practical benefits and pitfalls of automated evaluation
- Step-by-step instructions for setting up and running LLM-based evals 

---

## What exactly is “LLM as a Judge”?

“LLM-as-a-Judge” refers to the process of using Large Language Models (LLMs) to evaluate the outputs of other LLM systems. Instead of relying on human evaluators—which can be slow, expensive, and inconsistent—this approach leverages the reasoning and language understanding capabilities of LLMs to provide automated, scalable assessments.

The process typically works as follows:
1. **Define Evaluation Criteria:** You start by crafting an evaluation prompt that clearly specifies the criteria you want to assess (such as accuracy, relevance, faithfulness, bias, or any custom metric).
2. **Present Inputs and Outputs:** The LLM judge is given the original input (e.g., a question or task) and the output generated by the LLM system under evaluation.
3. **Automated Scoring:** The LLM judge reviews the information and assigns a score or rating based on the defined criteria.

LLM judges are commonly used to power advanced evaluation metrics like G-Eval, answer relevancy, faithfulness, and bias detection. By automating the evaluation process, LLM-as-a-Judge enables faster, more consistent, and more scalable assessments—making it an increasingly popular choice for both research and production environments.

---

## Prerequisites

Before you get started, please make sure you have the following ready:

---

### 1. Sample Contract File for Testing

To try out the contract analysis workflow, download the sample contract file provided below:

- [Download Sample Contract (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)

### 2. Ground Truth CSV File

Download the ground truth CSV file from the link below:

- [Download Ground Truth CSV (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)

### 3. OpenAI API Key

You’ll need your own OpenAI API key to access the language models used for contract evaluation. If you don’t have one yet, follow this step-by-step guide to generate your API key:

- [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

# Step 1: Install the Dependencies

Run the following command in your terminal or Jupyter notebook to install all required packages:

```python
!pip install gradio langchain openai python-docx PyPDF2 pandas
```

---


| Package       | Purpose / Use in Project                                                                 |
|---------------|-----------------------------------------------------------------------------------------|
| **gradio**    | Build interactive web UIs for machine learning and data apps. Lets users upload files, view results, and interact with your tool in a browser. |
| **langchain** | Framework for building applications powered by large language models (LLMs). Helps with document loading, processing, and LLM integration.      |
| **openai**    | Official Python client for OpenAI’s API. Allows your code to send prompts and receive responses from models like GPT-4.                         |
| **python-docx** | Read, write, and extract text from Microsoft Word (.docx) files. Used to process contract documents in Word format.                        |
| **PyPDF2**    | Read and extract text from PDF files. Enables your tool to analyze contracts provided as PDFs.                                                  |
| **pandas**    | Powerful data analysis and manipulation library. Used to organize, process, and display results in tables (dataframes).                        |

In [12]:
# Install necessary packages
! pip install gradio langchain openai python-docx PyPDF2 pandas

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Python313\python.exe -m pip install --upgrade pip


## After Installing Dependencies: Let's Start Importing!

Now that you’ve installed all the necessary libraries, let’s import them into your Python script or notebook. Here’s a summary of each import and its purpose:

| Import Statement                                                                 | Purpose / Usage                                                                                                 |
|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| `import gradio as gr`                                                            | Imports Gradio for building interactive web interfaces for your app.                                            |
| `from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader` | Imports document loaders from LangChain to extract text from PDF, DOCX, and TXT files.                         |
| `from openai import OpenAI`                                                      | Imports the OpenAI client to interact with language models like GPT-4 for contract analysis.                    |
| `import pandas as pd`                                                            | Imports Pandas for organizing, processing, and displaying results in tables (dataframes).                      |
| `import os`                                                                     | Imports Python’s built-in OS module for handling file paths and interacting with the operating system.          |
| `import tempfile`                                                               | Imports the tempfile module to safely create and manage temporary files and directories during file processing. |


In [13]:
import gradio as gr
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain.schema import Document
from openai import OpenAI
import pandas as pd
import os
import io
import tempfile
import re

# Designing Key Terms and Evaluation Metrics

Hey! Now that we’re building our LLM contract evaluation system, let’s talk about two of the most important foundations: **Key Terms** and **Evaluation Metrics**.

---

## What are Key Terms?

Key terms are the specific contract clauses or topics that we want our system to automatically extract and analyze from any uploaded contract. Think of them as the “must-find” items in every contract review. By defining these up front, we ensure our tool is always looking for the most important legal concepts.

Here are the key terms we’ve chosen:

| Key Term                   | What It Means (in contracts)                                  |
|----------------------------|---------------------------------------------------------------|
| Service Warranty           | Guarantees and standards for services provided                |
| Limitation of Liability    | Limits on legal responsibility for damages or losses          |
| Governing Law              | Which jurisdiction’s laws apply to the contract               |
| Termination for Cause      | When and how the contract can be ended early                  |
| Payment Terms              | Details about payment amounts, schedules, and methods         |
| Confidentiality Obligations| Rules about keeping information private                       |

---

## What are Evaluation Metrics?

Once we extract these key terms, we need a way to judge how well the extraction (and the LLM’s answer) matches what we want. That’s where evaluation metrics come in! These are the criteria we use to score and justify each answer.

We group our metrics into three categories, inspired by the HHH (Helpful, Honest, Harmless) framework:

---

### Helpful
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Was the information extracted as per the question asked?      | Did the answer directly address the key term?         |
| Was the information complete?                                 | Is all relevant information included?                 |
| Was the information enough to make a conclusive decision?     | Is the answer sufficient for decision-making?         |
| Were associated red flags covered in the extracted output?    | Are potential issues or risks mentioned?              |

---

### Honest
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Was the information extracted from all relevant clauses?      | Are multiple relevant sections included if needed?    |
| Was the page number of extracted information correct?         | Are page references accurate?                         |
| Was the AI reasoning discussing the relevant clause?          | Is the explanation focused on the right part?         |
| Does the information stay within document scope?              | Is the answer limited to the uploaded contract?       |

---

### Harmless
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Were results free from misleading claims?                     | Are there any false or misleading statements?         |
| Does the tool avoid generic/non-contract answers?             | Is the answer specific to the contract, not generic?  |
| Did the AI avoid illegal or insensitive justifications?       | Are explanations appropriate and lawful?              |
| Did the tool prevent false claims about people/entities?      | Are there any incorrect statements about parties?     |
| Did the tool context hateful/profane content?                 | Is the output free from inappropriate language?       |

---

**In summary:**  
- We define key terms to focus our extraction.
- We use a set of evaluation metrics (grouped as Helpful, Honest, Harmless) to systematically judge the quality, accuracy, and safety of every answer our LLM provides.

This structure ensures our contract analysis is thorough, reliable, and responsible!

In [14]:
KEY_TERMS = [
    "Service Warranty",
    "Limitation of Liability",
    "Governing Law",
    "Termination for Cause",
    "Payment Terms",
    "Confidentiality Obligations"
]

EVALUATION_METRICS = [
    "Was the information extracted as per the question asked in the key term?",
    "Was the information complete?",
    "Was the information enough to make a conclusive decision?",
    "Were associated red flags covered in the extracted output?",
    "Was the information extracted from all relevant clauses?",
    "Was the page number of extracted information correct?",
    "Was the AI reasoning discussing the relevant clause?",
    "Does the information stay within document scope?",
    "Were results free from misleading claims?",
    "Does the tool avoid generic/non-contract answers?",
    "Did the AI avoid illegal or insensitive justifications?",
    "Did the tool prevent false claims about people/entities?",
    "Did the tool context hateful/profane content?"
]

## 📄 What Does `extract_text_from_file` Do?

Hey! Now that we have our key terms and evaluation metrics set up, let’s talk about how we actually get the text out of the documents we want to analyze. That’s where the `extract_text_from_file` function comes in!

---

### What’s the Purpose?

This function is designed to **extract all the text** from a contract file, no matter if it’s a PDF, Word document, plain text, or even a CSV. It’s the first step in our pipeline—turning a file into something our LLM can read and analyze.

---

### How Does It Work? (Step by Step)

1. **Figure Out the File Type**
   - The function looks at the file extension (like `.pdf`, `.docx`, `.txt`, or `.csv`) to see what kind of document you’ve uploaded.

2. **Pick the Right Loader**
   - Depending on the file type, it uses a special tool (called a “loader”) to read the file:
     - **PDFs:** Uses `PyPDFLoader`
     - **Word Docs (.docx, .doc):** Uses `Docx2txtLoader`
     - **Text Files (.txt):** Uses `TextLoader`
     - **CSV Files:** Uses `pandas.read_csv` to read the table and turn it into a string

3. **Extract the Text**
   - For PDFs, Word, and text files, it grabs the text from each page or section and joins them all together into one big string.
   - For CSVs, it converts the whole table into a string.

4. **Handle Unsupported Files**
   - If you upload a file type it doesn’t recognize, it raises an error so you know something’s wrong.

5. **Return the Results**
   - It gives you back two things:
     - The **full extracted text** (as a string)
     - The **list of document objects** (which can be useful if you want to know about page numbers or other metadata later)

---

### Why Is This Important?

- **Universal Input:** You can upload contracts in different formats, and this function will handle them all.
- **Foundation for Analysis:** The extracted text is what we’ll feed into our LLM to find key terms and evaluate the contract.
- **Error Handling:** It makes sure you don’t accidentally try to process a file type that isn’t supported.

In [15]:
def extract_text_from_file(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    if ext == ".pdf":
        loader = PyPDFLoader(file_path)
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])
    elif ext in [".docx", ".doc"]:
        loader = Docx2txtLoader(file_path)
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])
    elif ext in [".txt"]:
        loader = TextLoader(file_path)
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])
    elif ext == ".csv":
        df = pd.read_csv(file_path)
        text = df.to_string(index=False)
        docs = [type('Doc', (object,), {'page_content': text})()]
    else:
        raise ValueError("Unsupported file type")
    return text, docs  # docs for page numbers if needed


## Setting Up the OpenAI Client

To interact with OpenAI’s language models (such as GPT-4), you need to create a client object using your own API key. This allows your application to send prompts and receive responses from OpenAI’s servers.

---

### Example Code

```python
client = OpenAI(api_key='sk-...your-own-api-key-here...')
```
---

### For a Step-by-Step Guide

You can follow this detailed tutorial:  
[How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

### Important Note About API Keys

- **Security:** Never share your OpenAI API key publicly or commit it to version control (like GitHub). Treat it like a password.
- **Personal Key Required:** The API key in the example above is for demonstration only. You must use your own unique API key to access OpenAI services.

---

**In summary:**  
You need your own OpenAI API key to use the language models. Never share your key, and always keep it secure!

In [None]:
client = OpenAI(api_key='INSERT YOUR API KEY HERE')

## 🔍 `extract_key_terms` Function — Step-by-Step Explanation

Hey! Let’s break down what the `extract_key_terms` function does, step by step, in a clear table format:

| **Step** | **What Happens**                                                                                                    | **Why It’s Important**                                  |
|----------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| 1        | Loops through each key term in the provided list.                                                                   | Ensures all important contract clauses are checked.     |
| 2        | For each term, constructs a prompt asking the AI to extract relevant sections from the contract text.               | Guides the AI to focus on the specific clause.          |
| 3        | Sends the prompt to the OpenAI language model (e.g., GPT-4) for analysis.                                           | Leverages advanced AI for accurate extraction.          |
| 4        | Receives the AI’s answer, which should include the relevant text and, if possible, page numbers.                    | Provides both the content and its location.             |
| 5        | Uses a regular expression to try to extract the page number from the AI’s answer, if mentioned.                     | Helps with citation and navigation in the document.     |
| 6        | Stores the answer and page number for each key term in a results dictionary.                                        | Organizes results for easy access and further use.      |
| 7        | Returns the dictionary mapping each key term to its extracted answer and page number.                               | Makes the output easy to use in later steps.            |

---


In [17]:
def extract_key_terms(text, key_terms):
    results = {}
    for term in key_terms:
        prompt = (
            f"You are a legal document analysis assistant.\n"
            f"Your task is to extract all clause(s) in the following contract that pertain specifically to the term: '{term}'.\n"
            f"For each relevant clause, return the following structured response, keeping the summary extremely concise and to the point (no more than 2 sentences, focusing only on the key obligation or restriction):\n\n"
            f"Clause: <Clause number or title>\n"
            f"Page: <Page number(s) if available>\n"
            f"Summary:\n"
            f"<A very brief summary of the clause, only the main point related to the term>\n\n"
            f"If the term is not found, respond exactly with:\n"
            f"'Not found.'\n\n"
            f"Document:\n{text}..."  # Truncated to fit token limit
        )
        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a legal contract analysis assistant."},
                {"role": "user", "content": prompt}
            ]
        )
        answer = completion.choices[0].message.content
        print("****************LLM Answer*******************")
        print(answer)        # Try to extract page number if mentioned
        page_number = None
        if "page" in answer.lower():
            import re
            match = re.search(r'page[s]?\s*(\d+)', answer, re.IGNORECASE)
            if match:
                page_number = match.group(1)
        results[term] = {"answer": answer, "page_number": page_number}
    return results

## 🟢 What is "Ground Truth"?

Before we dive into the function, let’s clarify what **ground truth** means in this context:

> **Ground truth** refers to the correct, reference answers that a human expert would provide after carefully reading and analyzing the contract.  
> These are the *verbatim* sections or clauses from the document that directly address each key term.  
> We use ground truth answers as a gold standard to compare and evaluate how well the AI (LLM) is performing.

---

## 🟢 `extract_ground_truth` Function — Step-by-Step Table

Let’s break down what the `extract_ground_truth` function does, step by step, in a clear table format:

| **Step** | **What Happens**                                                                                                    | **Why It’s Important**                                  |
|----------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| 1        | Loops through each key term in the provided list.                                                                   | Ensures all important contract clauses are checked.     |
| 2        | For each term, constructs a prompt asking the AI to extract the *ground truth* (verbatim text) for that key term.   | Focuses the AI on finding the exact, original wording.  |
| 3        | Sends the prompt to the OpenAI language model (e.g., GPT-4) for analysis.                                           | Leverages advanced AI for precise extraction.           |
| 4        | Receives the AI’s answer in a structured JSON format, including the answer and page number if available.            | Provides both the content and its location.             |
| 5        | Uses a regular expression to try to extract the page number from the AI’s answer, if mentioned.                     | Helps with citation and navigation in the document.     |
| 6        | Stores the answer and page number for each key term in a results dictionary.                                        | Organizes results for easy access and further use.      |
| 7        | Returns the dictionary mapping each key term to its ground truth answer and page number.                            | Makes the output easy to use in later steps.            |

---

In [18]:
import re

def extract_ground_truth(text, key_terms):
    """
    Extracts ground truth answers for each key term from a legal document.

    Args:
        text (str): The full text of the document.
        key_terms (list): List of key terms to extract.
        

    Returns:
        dict: A dictionary with each key term and its associated extracted answer and page number.
    """
    ground_truth = {}

    for term in key_terms:
        prompt = (
    f"You are a legal document analysis assistant. "
    f"Your task is to extract the *ground truth* from the provided legal document for the key term: '{term}'. "
    f"The ground truth is the exact text (verbatim) from the document that directly addresses or defines the key term. "
    f"If available, also include the page number(s) where this text appears. "
    f"If the key term is not mentioned or no relevant section exists, respond with 'Not found'.\n\n"
    f"Return your response in the following JSON format:\n"
    f'{{\n  "term": "{term}",\n  "ground_truth_answer": "<verbatim text>",\n  "page_number": "<page number or Not mentioned>"\n}}\n\n'
    f"Document:\n{text}..."  # Truncate to stay within token limits
)


        completion = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a legal contract analysis assistant."},
                {"role": "user", "content": prompt}
            ]
        )

        answer = completion.choices[0].message.content.strip()
        print("****************GROUNDTRUTHANSWER*******************")
        print(answer);

        # Attempt to extract page number from the answer
        page_number = None
        page_match = re.search(r'page[s]?\s*(\d+)', answer, re.IGNORECASE)
        if page_match:
            page_number = page_match.group(1)

        ground_truth[term] = {
            "ground_truth_answer": answer,
            "page_number": page_number or "Not mentioned"
        }

    return ground_truth


## 🏆 What Does `judge_key_term` Do?

Let’s talk about how we actually **evaluate** the answers that our LLM extracts from the contract. It’s not enough to just pull out information—we need to judge how good, accurate, and reliable those answers are. That’s where the `judge_key_term` function comes in!

---

### What Are We Doing Here?

This function systematically evaluates how well the extracted answer for each key term matches up to the ground truth (the human-verified answer) using a set of evaluation metrics. It leverages an AI model to provide both a numerical score and a brief justification for each metric, for every key term.

In other words:  
- For every key term (like "Service Warranty" or "Payment Terms"),  
- For every evaluation metric (like "Was the information complete?"),  
- We ask the AI to **score** the extracted answer and **explain** its reasoning.

This gives us a detailed, multi-dimensional assessment of the LLM’s performance!

---

### Step-by-Step Table

| **Step** | **What Happens**                                                                                                    | **Why It’s Important**                                  |
|----------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| 1        | Loops through each key term in the provided list.                                                                   | Ensures every important contract clause is evaluated.   |
| 2        | For each key term, retrieves the LLM-extracted answer and the ground truth answer.                                  | Sets up the comparison for evaluation.                  |
| 3        | For each evaluation metric, constructs a prompt asking the AI to judge the extracted answer against the metric.     | Focuses the AI on a specific aspect of answer quality.  |
| 4        | Sends the prompt to the OpenAI language model (e.g., GPT-4o) and receives a response with:                          | Leverages AI for consistent, expert-like evaluation.    |
|          | - A score from 1 (poor) to 5 (excellent)                                                                            |                                                         |
|          | - A brief justification (1-2 sentences) explaining the score                                                        |                                                         |
| 5        | Parses the score and justification from the AI’s response using regular expressions.                                | Converts the AI’s output into structured data.          |
| 6        | Compiles the results for each metric, including key term, answers, metric name, score, and justification.           | Organizes evaluation data for easy analysis.            |
| 7        | Returns a list of all evaluation results for further processing or display.                                         | Provides a comprehensive evaluation report.             |



In [19]:
def judge_key_term(key_terms, extract_key_terms_response, extract_ground_truth_response, metrics):
    results = []
    for term in key_terms:
        llm_answer = extract_key_terms_response.get(term, {}).get("answer", "Not found")
        ground_truth = extract_ground_truth_response.get(term, {}).get("ground_truth_answer", "Not found")
        page_number = extract_key_terms_response.get(term, {}).get("page_number", None)
        for metric in metrics:
            prompt = (
                f"You are an expert contract evaluator. "
                f"Evaluate the following extracted answer for the key term '{term}' "
                f"against the evaluation metric: '{metric}'.\n"
                f"Extracted Answer: {llm_answer}\n"
                f"Ground Truth Answer: {ground_truth}\n"
                "For this metric, provide:\n"
                "- A score from 1 (poor) to 5 (excellent)\n"
                "- A brief justification (1-2 sentences)\n"
                "Respond in the format: Score: <number>\nJustification: <text>"
            )
            completion = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "You are a contract evaluation expert."},
                    {"role": "user", "content": prompt}
                ]
            )
            content = completion.choices[0].message.content
            print("*********JUDGE EVULATION ANSWER");
            print(content);
            import re
            score_match = re.search(r"Score:\s*(\d+)", content)
            justification_match = re.search(r"Justification:\s*(.*)", content, re.DOTALL)
            score = int(score_match.group(1)) if score_match else None
            justification = justification_match.group(1).strip() if justification_match else content
            results.append({
                "key_term_name": term,
                "llm_extracted_ans_from_doc": llm_answer,
                "page_number": page_number,
                "ground_truth_answer": ground_truth,
                "evulation_metric_name": metric,
                "score": score,
                "justification": justification
            })
    return results


### What Does `mark_evaluation_pass_fail` Do?

This function takes your evaluation results (either as a list of dictionaries or a DataFrame) and adds a new column called `is_pass`.  
- If the score for a metric is 3 or higher, `is_pass` is set to `True` (pass).
- If the score is below 3 or missing, `is_pass` is set to `False` (fail).

This makes it easy to quickly see which evaluations meet your passing criteria!

In [20]:
def mark_evaluation_pass_fail(evals):
    """
    Adds a column 'is_pass' to the evaluation results, marking True if score >= 3, else False.
    Accepts either a list of dicts or a pandas DataFrame.
    Returns a DataFrame with the new column.
    """
    if isinstance(evals, list):
        df = pd.DataFrame(evals)
    else:
        df = evals.copy()
    df['is_pass'] = df['score'].apply(lambda x: True if x is not None and x >= 3 else False)
    return df

## 🚦 What Does `process_documents` Do?

This function is the main driver for the contract analysis workflow. It ties together all the core steps: reading the contract, extracting key terms, comparing to ground truth (if available), evaluating the results, and formatting everything for easy display.

---

### What Are We Doing Here?

- We start by extracting all the text from the uploaded contract file.
- Next, we use the LLM to extract the key terms from the contract.
- If a ground truth file is provided, we extract the reference answers for each key term; otherwise, we mark them as "Not found."
- We then evaluate each key term’s extracted answer against the ground truth using all our evaluation metrics, scoring and justifying each one.
- Finally, we organize the results into DataFrames for easy display, splitting them into three groups based on the Helpful, Honest, and Harmless metric categories.

---

### Step-by-Step Table

| **Step** | **What Happens**                                                                                                    | **Why It’s Important**                                  |
|----------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| 1        | Extracts text and document objects from the uploaded contract file using `extract_text_from_file`.                   | Converts the contract into a format suitable for further analysis.       |
| 2        | Extracts key terms from the contract text using `extract_key_terms` and the predefined `KEY_TERMS` list.            | Identifies and isolates the most important clauses in the contract.      |
| 3        | (If provided) Extracts ground truth answers for each key term using `extract_ground_truth`.                         | Provides a reference for evaluating the LLM’s answers.                   |
| 4        | Judges each key term’s extracted answer against the ground truth using `judge_key_term` and all evaluation metrics. | Produces a set of scores and justifications for each metric.             |
| 5        | Formats each evaluation result for display, extracting clean text and page numbers.                                 | Keeps results organized and traceable.                                   |
| 6        | Prepares a DataFrame with all results, arranging columns in a clear order.                                          | Makes it simple to present and analyze the results in tabular form.      |
| 7        | Splits the DataFrame into three based on metric category (Helpful, Honest, Harmless).                              | Allows for focused review of each evaluation dimension.                  |
| 8        | Returns the extracted contract text and all three DataFrames (plus the full one).                                  | Provides all necessary outputs for downstream use (e.g., UI display).    |

In [21]:
def process_documents(contract_file, ground_truth_file=None):

    # Step 1: Extract text from contract file
    text, docs = extract_text_from_file(contract_file)
    
    # Step 2: Extract key terms
    key_term_results = extract_key_terms(text, KEY_TERMS)
    
    # Step 3: Extract ground truth if provided
    if ground_truth_file is not None:
        with open(ground_truth_file, 'r', encoding='utf-8') as f:
            ground_truth_text = f.read()
        ground_truth_results = extract_ground_truth(ground_truth_text, KEY_TERMS)
    else:
        ground_truth_results = {term: {"ground_truth_answer": "Not found", "page_number": "Not found"} for term in KEY_TERMS}
    
    # Step 4: Judge each key term
    evals = judge_key_term(
        KEY_TERMS,
        key_term_results,
        ground_truth_results,
        EVALUATION_METRICS
    )

    # Step 5: Format each evaluation result for display
    for e in evals:
        term = e["key_term_name"]
        llm_ans = e["llm_extracted_ans_from_doc"]
        # Extract only the text after 'Text:'
        if llm_ans:
            text_match = re.search(r'Text:\s*(.*)', llm_ans, re.DOTALL)
            e["llm_extracted_ans_from_doc"] = text_match.group(1).strip() if text_match else llm_ans
            # Extract page number from LLM answer
            page_match = re.search(r'Page:\s*(\d+)', llm_ans)
            e["llm_page_number"] = page_match.group(1) if page_match else "Not found"
        else:
            e["llm_page_number"] = "Not found"
        # Show only the ground_truth_answer value and page number
        gt = ground_truth_results.get(term, {})
        e["ground_truth_answer"] = gt.get("ground_truth_answer", "Not found")
        # e["ground_truth_answer_page_number"] = gt.get("page_number", "Not found")
    
    # Step 6: Prepare DataFrame with new columns in the correct order
    df = pd.DataFrame(evals)
    display_cols = [
        "key_term_name",
        "llm_extracted_ans_from_doc",
        "llm_page_number",
        "ground_truth_answer",
        # "ground_truth_answer_page_number",
        "evulation_metric_name",
        "score",
        "justification"
    ]
    df = df[display_cols]
    
    # Step 7: Split DataFrame into three based on metric index
    metric_groups = [EVALUATION_METRICS[:4], EVALUATION_METRICS[4:8], EVALUATION_METRICS[8:]]
    df1 = df[df["evulation_metric_name"].isin(metric_groups[0])].reset_index(drop=True)
    df2 = df[df["evulation_metric_name"].isin(metric_groups[1])].reset_index(drop=True)
    df3 = df[df["evulation_metric_name"].isin(metric_groups[2])].reset_index(drop=True)
    return text, df1, df2, df3, df

## Gradio App Interface: LLM Contract Judge

This section defines the interactive web interface for the contract analysis tool using Gradio. The interface allows users to upload contract files, extract key terms, evaluate them using an LLM, and view the results in a user-friendly format.

---

| UI Element / Step         | Description                                                                                                   | Why It’s Important                                                      |
|---------------------------|---------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| **App Container**         | Uses `gr.Blocks()` to create a modular, flexible Gradio app.                                                  | Allows for a clean, organized, and interactive user interface.           |
| **Title Markdown**        | Displays a title and brief instructions at the top of the app.                                                | Helps users understand the app’s purpose and how to use it.              |
| **Upload & Extract Tab**  | Provides a tab for uploading contract files (PDF, DOCX, TXT).                                                 | Lets users easily provide the documents they want to analyze.            |
| **File Upload Widget**    | Allows users to upload a contract file.                                                                       | Supports multiple file formats for flexibility.                          |
| **Extract Button**        | A button labeled "Extract & Evaluate" to start the analysis process.                                          | Gives users control over when to begin processing.                       |
| **Extracted Text Box**    | Displays the extracted text from the uploaded contract.                                                       | Offers transparency and lets users review what was extracted.             |
| **Results Table Tab**     | Provides a separate tab to display the evaluation results in a table format.                                  | Organizes results for easy review and comparison.                        |
| **Results Dataframe**     | Shows a table with columns for key term, extracted answer, page number, evaluation metric, score, and justification. | Presents detailed evaluation results in a structured, readable way.      |
| **run_all Function**      | Defines the function that runs the full analysis pipeline when the button is clicked.                         | Connects the UI to the backend logic for seamless operation.             |
| **Button Click Event**    | Links the "Extract & Evaluate" button to the `run_all` function, passing the uploaded file as input.          | Ensures user actions trigger the correct processing workflow.            |
| **App Launch**            | Calls `demo.launch()` to start the Gradio app and make it accessible in the browser.                          | Makes the tool available for interactive use.                            |

---

### Why This Matters

- **User-Friendly:**  
  The Gradio interface makes it easy for users to interact with complex AI-powered contract analysis tools without needing to write code.

- **Transparency:**  
  Users can see both the raw extracted text and the detailed evaluation results, increasing trust in the tool.

- **Efficiency:**  
  The app streamlines the workflow from document upload to actionable insights, all in one place.

---

**In summary:**  
This Gradio app provides an accessible, interactive front-end for your contract analysis pipeline, allowing users to upload documents, trigger analysis, and review results with ease.

# When you run the last cell in your notebook, you’ll see a message like the one shown in the image below. Click on the "Running on local URL" link—you will be redirected to a new screen where you can interact with the LLM Contract Judge app.

![Gradio Local URL Example](Images//img-1.png)

# Once you are done with the lab, you will see a UI something like this below in the image:

![Gradio Local URL Example](Images/img-2.png)

In [22]:
import gradio as gr
import io

def get_human_table(df_all):
    df_human = mark_evaluation_pass_fail(df_all)
    return df_human[["key_term_name", "evulation_metric_name", "is_pass", "justification"]]

with gr.Blocks() as demo:
    gr.Markdown("# 📄 LLM Contract Judge\nUpload a contract, extract key terms, and evaluate with LLM.")
    with gr.Row():
        contract_file = gr.File(label="Upload Contract (PDF, DOCX, TXT)")
        ground_truth_file = gr.File(label="Upload Ground Truth File (TXT, CSV, etc.)")
    start_btn = gr.Button("Start Evaluating")
    extracted_text = gr.Textbox(label="Extracted Contract Text", lines=10, interactive=False)
    with gr.Tabs():
        with gr.TabItem("Helpful Metrics"):
            results_table1 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "llm_page_number",
                "ground_truth_answer",
                "evulation_metric_name",
                "score",
                "justification"
            ], label="Evaluation Results (Helpful Metrics)")
        with gr.TabItem("Honest Metrics"):
            results_table2 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "llm_page_number",
                "ground_truth_answer",
                "evulation_metric_name",
                "score",
                "justification"
            ], label="Evaluation Results (Honest Metrics)")
        with gr.TabItem("Harmless Metrics"):
            results_table3 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "llm_page_number",
                "ground_truth_answer",
                "evulation_metric_name",
                "score",
                "justification"
            ], label="Evaluation Results (Harmless Metrics)")
        with gr.TabItem("Human Evaluation"):
            gr.Markdown(
                """
                ### Human Evaluation (Yes/No)
                - **Note:** Here we evaluate each metric in a Yes/No format.
                - If the score is less than 3, it is marked as **No** (False); otherwise, it is **Yes** (True).
                - This helps quickly identify which key terms and metrics pass the threshold for acceptability.
                """
            )
            human_table = gr.Dataframe(headers=[
                "key_term_name",
                "evulation_metric_name",
                "is_pass",
                "justification"
            ], label="Human Evaluation (Yes/No)")
    download_btn = gr.Button("Download All Results as CSV")
    download_file = gr.File(label="Download CSV")

    # Define state objects for DataFrames
    state_df1 = gr.State()
    state_df2 = gr.State()
    state_df3 = gr.State()
    state_df_human = gr.State()

    def run_and_return_tables(contract_file, ground_truth_file):
        text, df1, df2, df3, df_all = process_documents(contract_file, ground_truth_file)
        df_human = get_human_table(df_all)
        # Convert is_pass to Yes/No for display
        df_human = df_human.copy()
        df_human["is_pass"] = df_human["is_pass"].apply(lambda x: "Yes" if x else "No")
        return (
            text,
            gr.update(value=df1),
            gr.update(value=df2),
            gr.update(value=df3),
            gr.update(value=df_human),
            df1, df2, df3, df_human
        )

    def download_csv(contract_file, ground_truth_file, df1, df2, df3, df_human):
        import tempfile
        import os
        import pandas as pd
        combined = pd.concat([df1, df2, df3], ignore_index=True)
        # Create a temporary file
        with tempfile.NamedTemporaryFile(delete=False, suffix=".csv", mode="w", encoding="utf-8") as tmp:
            combined.to_csv(tmp, index=False)
            tmp_path = tmp.name
        return tmp_path

    start_btn.click(
        run_and_return_tables,
        inputs=[contract_file, ground_truth_file],
        outputs=[extracted_text, results_table1, results_table2, results_table3, human_table, state_df1, state_df2, state_df3, state_df_human]
    )
    download_btn.click(
        download_csv,
        inputs=[contract_file, ground_truth_file, state_df1, state_df2, state_df3, state_df_human],
        outputs=download_file
    )

demo.launch()


Running on local URL:  http://127.0.0.1:7861


--------



To create a public link, set `share=True` in `launch()`.




****************LLM Answer*******************
Not found.
****************LLM Answer*******************
Clause: 10. LIMITATIONS OF LIABILITY
Page: 5
Summary: Neither party is liable for indirect, incidental, special, or consequential damages, and aggregate liability for damages is limited to the fees paid by the customer in the 12 months preceding the claim.
****************LLM Answer*******************
Clause: Governing Law and Jurisdiction
Page: 8
Summary: This Agreement is governed by the laws of India, with the courts of Bangalore, Karnataka, having exclusive jurisdiction.
****************LLM Answer*******************
Clause: 7.4 Suspension for Ongoing Harm
Page: 4
Summary: Whatfix has the right to suspend delivery of services if Customer's or End User's use causes harm, and it may lead to termination if unresolved, affecting Customer's obligations and access.

Clause: 7.5 Immediate Termination Criteria
Page: 4
Summary: Allows termination if a party commits a material breach that is

Traceback (most recent call last):
  File "c:\Users\sachi\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\sachi\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\sachi\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1935, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\sachi\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1520, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\sachi\AppData\Local\P