# LLM-as-a-Judge Simply Explained: A Complete Guide to Run LLM Evals

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](<https://colab.research.google.com/github/sachin0034/hands_on_AI_introduction_to_AI_evaluations-4038348/blob/main/Lab-1%28LLM_as_judge%29/LLM_As_A_Judge.ipynb>)

"LLM-as-a-judge" is a technique where large language models — like GPT — are used to evaluate the quality of outputs generated by other AI models or systems. Instead of relying on human evaluators, which can be time-consuming and expensive, we use an LLM to act as the judge, scoring or ranking generated content based on factors like correctness, coherence, relevance, or even tone and style.

This approach became popular because evaluating open-ended text (like summaries, chatbot replies, or creative writing) is inherently subjective. Traditional metrics like accuracy or BLEU scores often fall short since there’s no single 'right' answer. LLMs help fill that gap by providing nuanced judgments, often closer to how a human would interpret or assess the output.

So in essence, LLM-as-a-judge is a scalable, cost-effective, and surprisingly reliable way to evaluate the quality of language model outputs — especially when human evaluation isn’t feasible at scale.

---

## Getting Started

---

## Prerequisites

Before you get started, please make sure you have the following ready:

### 1. Sample Contract File for Testing

To try out the contract analysis workflow, download the sample contract file provided below:

- [Download Sample Contract (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)

### 2. OpenAI API Key

You’ll need your own OpenAI API key to access the language models used for contract evaluation. If you don’t have one yet, follow this step-by-step guide to generate your API key:

- [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

## Step 1: Install the Dependencies
Each dependency serves a specific purpose in the LLM Judge Lab:

| Package        | Purpose / Use in Project                                                                                     |
|----------------|--------------------------------------------------------------------------------------------------------------|
| **gradio**     | Builds a web-based UI for interaction. Allows users to input text, upload files, and view model evaluations. |
| **langchain**  | Manages the logic of LLM interactions — from document processing to chaining LLM calls.                      |
| **openai**     | Connects the system to OpenAI’s models (e.g., GPT-4) for generating judgments or scores.                     |
| **python-docx**| Parses and extracts content from `.docx` files for evaluation.                                               |
| **PyPDF2**     | Extracts text from PDFs, enabling the model to assess uploaded PDF documents.                                |
| **pandas**     | Structures and displays results in tables or dataframes for better analysis and comparison.                  |

In [None]:
# Install necessary packages
! pip install gradio langchain openai python-docx PyPDF pandas langchain-community

## Step 2 : Import Dependencies

Now that you've installed all the necessary libraries, it's time to import them into your Python script or Jupyter notebook.

- Start by importing **Gradio** to build the interactive web interface for your LLM-as-a-judge lab.

- Next, bring in **document loaders from LangChain** — specifically for handling PDF, DOCX, and plain text files. These will help you extract content from user-uploaded documents.

- Then, import the **OpenAI client**, which you'll use to connect to models like GPT-4 for analyzing and judging text.

- You’ll also want **Pandas** to organize and display results in table formats, especially when dealing with comparisons or scores.

- Finally, include Python’s built-in **os** and **tempfile** modules. These are useful for file path handling and safely working with temporary files during processing.

Once these imports are in place, you're ready to move on to building the file processing and evaluation pipeline!


In [None]:
import gradio as gr
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain.schema import Document
from openai import OpenAI
import pandas as pd
import os
import io
import tempfile
import re
import time

## Step 3: Key Terms and Evaluation Metrics

### Here we are setting our key terms and evaluation metrics.Key terms are the specific pieces of information we need to extract from the contract and evaluation metrics are the criteria on which we will evaluate the llm responses.

### Key Terms

These are the critical contract clauses we want the LLM to extract and analyze:

- **Product Name**  
- **Limitation of Liability In Months**  
- **Governing Law**  

Each of these helps focus the LLM’s attention on high-priority legal elements.

---
```
| **Category** | **Metric**                                                  | **What It Measures**                                             |
|--------------|-------------------------------------------------------------|------------------------------------------------------------------|
| **Helpful**  | Was the information extracted as per the question asked?    | Did the answer directly address the key term?                    |
| **Helpful**  | Was the information complete?                               | Is all relevant information included?                            |
| **Helpful**  | Was the information enough to make a conclusive decision?   | Is the answer sufficient for decision-making?                    |
| **Helpful**  | Were associated red flags covered in the extracted output?  | Are potential issues or risks mentioned?                         |
| **Honest**   | Was the information extracted from all relevant clauses?    | Are multiple relevant sections included if needed?               |
| **Honest**   | Was the page number of extracted information correct?       | Are page references accurate?                                    |
| **Honest**   | Was the AI reasoning discussing the relevant clause?        | Is the explanation focused on the right part?                    |
| **Honest**   | Does the information stay within document scope?            | Is the answer limited to the uploaded contract?                  |
| **Harmless** | Were results free from misleading claims?                   | Are there any false or misleading statements?                    |
| **Harmless** | Does the tool avoid generic/non-contract answers?           | Is the answer specific to the contract, not generic?             |
| **Harmless** | Did the AI avoid illegal or insensitive justifications?     | Are explanations appropriate and lawful?                         |
| **Harmless** | Did the tool prevent false claims about people/entities?    | Are there any incorrect statements about parties?                |
| **Harmless** | Did the tool context hateful/profane content?               | Is the output free from inappropriate language?                  |
```




In [3]:
KEY_TERMS = [
    "Product Name",
    "Limitation of Liability In Months",
    "Governing Law"
]

EVALUATION_METRICS = [
    "Was the information extracted as per the question asked in the key term?",
    "Was the information complete?",
    "Was the information enough to make a conclusive decision?",
    "Was the AI reasoning discussing the relevant clause?",
    "Does the information stay within document scope?",
    "Were results free from misleading claims?",
    "Does the tool avoid generic/non-contract answers?",
    "Did the tool prevent false claims about people/entities?"
]

# Additional evaluation metrics you can use just add them to the above Evaluation metrics:
# - Were associated red flags covered in the extracted output?
# - Was the information extracted from all relevant clauses?
# - Was the page number of extracted information correct?
# - Did the AI avoid illegal or insensitive justifications?
# - Did the tool context hateful/profane content?


## Step 4: Extract Text from Documents

This step ensures that all files—no matter the format—are converted into a **standardized format** for the LLM to analyze.  
It keeps the pipeline consistent, reliable, and ready for downstream tasks like key term extraction or clause classification.

### What File Types Are Supported?

```
| File Type | Extensions     | Extracted Using      
|-----------|----------------|-----------------------
| 📄 PDF    | `.pdf`         | `PyPDFLoader`         
| 📝 Word   | `.docx`, `.doc`| `Docx2txtLoader`      
| 📃 Text   | `.txt`         | `TextLoader`          
| 📊 CSV    | `.csv`         | `pandas`              

```
### What Does the Function Return?

- `text`: Complete raw text from the document  
- `docs`: Structured content, including page-wise segmentation (useful for referencing clauses by page)


In [4]:
def extract_text_from_file(file_path):
    # Get the file extension and convert it to lowercase
    ext = os.path.splitext(file_path)[1].lower()

    # Handle PDF files
    if ext == ".pdf":
        loader = PyPDFLoader(file_path)  # Use PyPDFLoader to read PDF
        docs = loader.load()  # Load document into LangChain Document objects
        text = "\n".join([doc.page_content for doc in docs])  # Combine all page content

    # Handle Word documents (.docx, .doc)
    elif ext in [".docx", ".doc"]:
        loader = Docx2txtLoader(file_path)  # Use Docx2txtLoader for Word files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle plain text files
    elif ext in [".txt"]:
        loader = TextLoader(file_path)  # Use TextLoader for .txt files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle CSV files
    elif ext == ".csv":
        df = pd.read_csv(file_path)  # Read CSV using pandas
        text = df.to_string(index=False)  # Convert DataFrame to plain string
        # Wrap in a dummy doc-like object to keep consistent structure
        docs = [type('Doc', (object,), {'page_content': text})()]

    # Unsupported file types
    else:
        raise ValueError("Unsupported file type")

    # Return both raw text and structured docs for further processing
    return text, docs  # docs may include page-level details


## Step 5: Set Up OpenAI Client

To use GPT models, you’ll need to set up access using your **OpenAI API key**.

### Requirements

- You **must** have a valid OpenAI API key to proceed.
- **Keep your API key secure** — never commit it to public repositories or share it.

### How to Get an API Key

> **Don’t have one yet?**  
> Follow this simple guide to create your own API key:  
> 👉 [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

Once you have the key, you can place your key here in the code :

```python
api_key = '<Insert Your API Key>'
```

In [None]:
api_key = '<Insert Your API Key>'

try:
    client = OpenAI(api_key=api_key)
    # Minimal API call to check if the key is valid
    client.models.list()
    print("✅ OpenAI API key is valid.")
except Exception as e:
    print("❌ Invalid OpenAI API key or connection error:", e)
    raise RuntimeError("OpenAI API key check failed. Please provide a valid key.")

## Step 6: Extract Key Terms from the Document

The **`extract_key_terms(text, key_terms)`** function is used to pull out and summarize the most important legal clauses from your uploaded document.

It takes two inputs:

- `text`: The full contract content, extracted earlier using `extract_text_from_file()`
- `key_terms`: A list of specific legal terms you defined earlier (e.g. "Payment Terms", "Governing Law")

---

### How the Function Works:

Once the document is uploaded and text is extracted, this function:

1. Loops through each key term from the list.
2. For each term:
   - It sends a carefully crafted prompt (along with the full text) to the OpenAI model.
   - It asks the model to find the key terms, provide the value , and return it in a structured **JSON format**.
3. Then, it:
   - Parses the JSON response from the model.
   - Extracts the **value** and **page number** (if mentioned).
   - Stores the result in a dictionary.

---

✅ This helps transform lengthy legal documents into clear, structured insights you can quickly review and evaluate.


In [None]:
def extract_key_terms(text, key_terms):
    import json
    import re

    def safe_json_parse(response_text):
        """Safely parse JSON from LLM response with fallback strategies."""
        # Strategy 1: Try direct JSON parsing
        try:
            return json.loads(response_text.strip())
        except json.JSONDecodeError:
            pass

        # Strategy 2: Extract from code blocks
        json_patterns = [
            r'```json\s*(.*?)\s*```',
            r'```\s*(.*?)\s*```',
            r'\{.*\}'
        ]

        for pattern in json_patterns:
            json_match = re.search(pattern, response_text, re.DOTALL)
            if json_match:
                try:
                    return json.loads(json_match.group(1).strip() if 'json' in pattern else json_match.group(0))
                except json.JSONDecodeError:
                    continue

        # Fallback: return default structure
        return {"Value": "Not found", "Page Number": None, "Section": None}

    results = {}
    for term in key_terms:
        prompt = (
            f"Act as a legal expert. From this contract text, extract the value for '{term}'.\n\n"
            f"Contract Text: {text}\n\n"
            f"Instructions:\n"
            f"1. Provide a one-word answer for '{term}' if found\n"
            f"2. Include page number if available\n"
            f"3. If not found, use 'Not found'\n\n"
            f"Return ONLY valid JSON in this exact format:\n"
            f'{{"Value": "your_answer", "Page Number": "page_number_or_null", "Section": "section_name"}}'
        )

        try:
            completion = client.chat.completions.create(
                model="gpt-4o-mini",  # Fixed model name
                messages=[
                    {"role": "system", "content": "You are a legal contract analysis assistant. Always return valid JSON only."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1  # Lower temperature for consistency
            )
            answer = completion.choices[0].message.content
            print("****************LLM Answer*******************")
            print(answer)
            print("*********************************************")

            # Use safe JSON parsing
            parsed_response = safe_json_parse(answer)
            print(f"DEBUG: Parsed JSON successfully: {parsed_response}")

            value = parsed_response.get("Value", "Not found")
            print(f"DEBUG: Extracted value: {value}")

            # Extract page number with multiple fallback options
            page_number = (
                parsed_response.get("Page Number") or
                parsed_response.get("PageNumber") or
                parsed_response.get("page_number")
            )
            print(f"DEBUG: Direct page number extraction: {page_number}")

            # If not found, try extracting from Section field
            if not page_number:
                section = parsed_response.get("Section", "")
                print(f"DEBUG: Section field content: {section}")
                if section and "Page" in str(section):
                    page_match = re.search(r'Page (\d+)', str(section))
                    if page_match:
                        page_number = page_match.group(1)
                        print(f"DEBUG: Extracted page number from section: {page_number}")

        except Exception as e:
            print(f"DEBUG: API call or parsing failed: {e}")
            value = "Error"
            page_number = None

        final_result = {"Value": value, "page_number": page_number}
        print(f"DEBUG: Final result for term '{term}': {final_result}")
        print("=" * 60)

        results[term] = final_result

    print(f"DEBUG: All results: {results}")
    return results

## Step 7: Judge the LLM's Response

The `judge_llm()` function is responsible for **evaluating the quality of each extracted answer** provided by the LLM for the key terms.

---

### Inputs

- `text`: The contract text (already extracted earlier)
- `key_terms`: The list of legal terms you’re looking for
- `extract_key_terms_response`: The previous step’s output (values + page numbers)
- `metrics`: A list of evaluation metrics.

---

### How the Evaluation Works

1. Loops through each key term.
2. For each term and each evaluation metric:
   - Sends a prompt to GPT to **judge the quality** of the LLM's extracted answer.
   - GPT responds with:
     - A **score between 0 to 5**
     - A brief **justification**
3. The score is interpreted as:
   - **Score ≥ 3** → ✅ `LLM_Judge_Response = True` (Pass)
   - **Score < 2** → ❌ `LLM_Judge_Response = False` (Fail)


In [None]:
def judge_llm(key_terms, extract_key_terms_response, metrics):
    results = []  # List to store evaluation results for each key term

    print(f"DEBUG: Starting evaluation for {len(key_terms)} key terms and {len(metrics)} metrics")
    print(f"DEBUG: Key terms: {key_terms}")
    print(f"DEBUG: Metrics: {metrics}")
    print(f"DEBUG: Extract key terms response: {extract_key_terms_response}")
    print("=" * 80)

    for term in key_terms:
        print(f"\nDEBUG: Processing term: '{term}'")

        # Get the LLM's extracted answer for the current key term
        llm_answer = extract_key_terms_response.get(term, {}).get("Value", "Not found")
        page_number = extract_key_terms_response.get(term, {}).get("page_number", None)

        print(f"DEBUG: LLM Answer: {llm_answer}")
        print(f"DEBUG: Page Number: {page_number}")

        for metric in metrics:
            print(f"\n  DEBUG: Evaluating metric: '{metric}'")

            # Construct the evaluation prompt to ask the LLM to judge its own answer
            prompt = (
            f"You are an expert contract lawyer. Evaluate the extracted answer for the key term '{term}' using the evaluation metrics provided.\n\n"

            f"KEY TERM: {term}\n"
            f"EVALUATION METRICS:\n{metrics}\n\n"
            f"EXTRACTED ANSWER:\n{llm_answer}\n\n"

            "INSTRUCTIONS:\n"
            "- Check if the answer clearly addresses the key term and meets the evaluation metrics.\n"
            "- Assign a score from 0 to 5 using the criteria below:\n\n"
            "- Provide a short justification why you score this metric a particular score.\n"

            "SCORING GUIDE:\n"
            "Score 0 : Key term not addressed at all.\n"
            "Score 1 : Answer is irrelevant or empty.\n"
            "Score 2 : Some relevant info, but fails to meet metrics or is incomplete.\n"
            "Score 3 : Adequate answer, meets around half of the metrics with acceptable accuracy.\n"
            "Score 4 : Strong answer, meets most metrics with good clarity and detail.\n"
            "Score 5 : Excellent answer, complete, accurate, and meets nearly all metrics with clear legal context.\n\n"

            "RESPONSE FORMAT:\n"
            "Score: <number>\n"
            "Justification: <text>\n"
        )
            print(f"  DEBUG: Prompt length: {len(prompt)} characters")

            # Call the LLM to get its evaluation response
            completion = client.chat.completions.create(
                model="gpt-4.1-nano",
                messages=[
                    {"role": "system", "content": "You are a contract evaluation expert."},
                    {"role": "user", "content": prompt}
                ]
            )

            content = completion.choices[0].message.content
            print("  *********JUDGE EVALUATION ANSWER*********")
            print(content)
            print("  *****************************************")

            import re  # Use regex to extract structured score and justification

            # Extract score from the response
            score_match = re.search(r"Score:\s*(\d+)", content)
            score = int(score_match.group(1)) if score_match else None
            print(f"  DEBUG: Extracted score: {score}")

            if not score_match:
                print(f"  DEBUG: Score regex didn't match. Raw content: {repr(content)}")

            # Extract justification from the response
            justification_match = re.search(r"Justification:\s*(.*)", content, re.DOTALL)
            justification = justification_match.group(1).strip() if justification_match else content
            print(f"  DEBUG: Extracted justification: {justification[:100]}...")  # Show first 100 chars

            if not justification_match:
                print(f"  DEBUG: Justification regex didn't match. Using full content as justification")

            # Determine pass/fail status based on score threshold
            pass_fail = True if score is not None and score >= 3 else False
            print(f"  DEBUG: Pass/Fail (score >= 3): {pass_fail}")

            # Create result entry
            result_entry = {
                "key_term_name": term,
                "llm_extracted_ans_from_doc": llm_answer,
                "page_number": page_number,
                "evulation_metric_name": metric,
                "LLM_Judge_Response": pass_fail,
                "justification": justification
            }

            print(f"  DEBUG: Result entry: {result_entry}")

            # Append results for this term + metric
            results.append(result_entry)
            print(f"  DEBUG: Added result. Total results so far: {len(results)}")

        print(f"DEBUG: Completed all metrics for term '{term}'")
        print("-" * 60)

    print(f"\nDEBUG: Final results count: {len(results)}")
    print(f"DEBUG: Sample result: {results[0] if results else 'No results'}")
    return results  # Return all evaluation results

## Function: process_documents_with_progress

Runs the full contract analysis pipeline with real-time progress using Gradio.

---

### Overview

- Extracts text from a contract file.
- Identifies key terms using `extract_key_terms()`.
- Evaluates responses with `judge_llm()` against defined metrics.
- Returns structured results as DataFrames.

---

### Parameters

- `contract_file`: Uploaded contract file (PDF, DOCX, etc.).

---

### Flow Summary

1. Extract text from the document.
2. Pull key terms using the LLM.
3. Score results against evaluation metrics.
4. Organize and return results in DataFrames.

---
> ## **Note**  
> - Relies on `extract_text_from_file`, `extract_key_terms`, and `judge_llm`.  
> - Uses predefined key terms and evaluation metrics for analysis.



In [None]:
def process_documents_with_progress(contract_file, progress=gr.Progress()):
    """
    Process documents with progress updates
    """
    try:
        # Step 1: Extract text from contract file
        progress(0.1, desc="📄 Extracting text from contract file...")
        text, docs = extract_text_from_file(contract_file)
        progress(0.2, desc="✅ Text extraction completed")

        # Step 2: Extract key terms
        progress(0.3, desc="🔍 Extracting key terms from contract...")
        key_term_results = extract_key_terms(text, KEY_TERMS)
        progress(0.5, desc="✅ Key terms extraction completed")

        # Step 3: Judge each key term
        progress(0.6, desc="⚖️ Evaluating key terms with LLM judge...")
        evals = judge_llm(
            KEY_TERMS,
            key_term_results,
            EVALUATION_METRICS
        )
        progress(0.8, desc="✅ Evaluation completed")

        # Step 4: Format results
        progress(0.9, desc="📊 Formatting results for display...")

        # Format each evaluation result for display
        for e in evals:
            term = e["key_term_name"]
            llm_ans = e["llm_extracted_ans_from_doc"]
            # Extract only the text after 'Text:'
            if llm_ans:
                text_match = re.search(r'Text:\s*(.*)', llm_ans, re.DOTALL)
                e["llm_extracted_ans_from_doc"] = text_match.group(1).strip() if text_match else llm_ans

        # Prepare DataFrame with new columns in the correct order
        df = pd.DataFrame(evals)
        display_cols = [
            "key_term_name",
            "llm_extracted_ans_from_doc",
            "page_number",
            "evulation_metric_name",
            "LLM_Judge_Response",
            "justification"
        ]
        df = df[display_cols]

        # Split DataFrame into three based on metric index
        metric_groups = [EVALUATION_METRICS[:3], EVALUATION_METRICS[3:6], EVALUATION_METRICS[6:8]]
        df1 = df[df["evulation_metric_name"].isin(metric_groups[0])].reset_index(drop=True)
        df2 = df[df["evulation_metric_name"].isin(metric_groups[1])].reset_index(drop=True)
        df3 = df[df["evulation_metric_name"].isin(metric_groups[2])].reset_index(drop=True)

        progress(1.0, desc="🎉 Processing completed successfully!")

        return text, df1, df2, df3, df

    except Exception as e:
        progress(1.0, desc=f"❌ Error occurred: {str(e)}")
        raise e

## Gradio UI: LLM Contract Judge

A simple web interface to upload contracts, extract key terms, evaluate them with an LLM, and view or download the results.

---

### What It Does

- Accepts contract files (PDF, DOCX, TXT)
- Extracts and evaluates key legal terms using GPT
- Displays results grouped by evaluation metrics
- Allows easy CSV export of final outputs

---

### Inputs

- `contract_file`: Upload contract document
- `start_btn`: Starts the evaluation process
- `download_btn`: Exports all results as a CSV

---

### Flow Summary

1. User uploads contract and clicks **Start Evaluating**
2. `run_and_return_tables()` calls `process_documents_with_progress()`
3. Extracted results appear in 3 separate tabs
4. User clicks **Download** to export all results as CSV

---

> ## **Note:**  
> - When you run the Gradio app, you'll see:  
>   `* Running on local URL: http://127.0.0.1:7868`
> - Click the URL to open the Gradio UI in a new browser tab.
> - After processing, click the **Download CSV** button.
> - A temporary file named `temp.csv` will be generated for download.





In [None]:
# Create the main Gradio interface using Blocks
with gr.Blocks() as demo:
    # Title Markdown
    gr.Markdown("# 📄 LLM Contract Judge\nUpload a contract, extract key terms, and evaluate with LLM.")

    # File upload component in a horizontal row layout
    with gr.Row():
        contract_file = gr.File(label="Upload Contract (PDF, DOCX, TXT)")

    # Button to trigger the evaluation process
    start_btn = gr.Button("🚀 Start Evaluating", variant="primary")

    # A non-editable textbox to show progress or status updates
    progress_text = gr.Textbox(
        label="Processing Status",
        value="Ready to start evaluation...",
        interactive=False
    )

    # Output box to display raw extracted contract text
    extracted_text = gr.Textbox(label="Extracted Contract Text", lines=10, interactive=False)

    # Tabbed interface for displaying different metric evaluation results
    with gr.Tabs():
        # Helpful Metrics tab
        with gr.TabItem("Helpful Metrics"):
            results_table1 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "page_number",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Helpful Metrics)")

        # Honest Metrics tab
        with gr.TabItem("Honest Metrics"):
            results_table2 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "page_number",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Honest Metrics)")

        # Harmless Metrics tab
        with gr.TabItem("Harmless Metrics"):
            results_table3 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "page_number",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Harmless Metrics)")

    # Button to download evaluation results as a CSV
    download_btn = gr.Button("📥 Download All Results as CSV")

    # File component to show the downloadable CSV file
    download_file = gr.File(label="Download CSV")

    # State variables to store DataFrames for use during download
    state_df1 = gr.State()
    state_df2 = gr.State()
    state_df3 = gr.State()
    state_df_all = gr.State()

    # Main function to process contract and return data for all tables
    def run_and_return_tables(contract_file, progress=gr.Progress()):
        if not contract_file:
            # If file not uploaded, return error message and clear outputs
            return (
                "Please upload a contract file first.",
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                None, None, None, None
            )

        try:
            # Update UI to show progress
            progress_text = "🔄 Starting document processing..."

            # Function processes the document and returns the text and 3 metrics tables
            text, df1, df2, df3, df_all = process_documents_with_progress(contract_file, progress)

            return (
                text,
                gr.update(value=df1),
                gr.update(value=df2),
                gr.update(value=df3),
                df1, df2, df3, df_all
            )

        except Exception as e:
            # Return error details in case of failure
            error_msg = f"❌ Error during processing: {str(e)}"
            return (
                error_msg,
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                None, None, None, None
            )

    # Function to generate and return the downloadable CSV file from results
    def download_csv(contract_file, df_all):
        if df_all is None:
            return None

        try:
            # Define the columns we need from the original data
            required_cols = [
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "page_number",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ]

            # Filter to only include the specified columns
            df_filtered = df_all[required_cols].copy()

            # Create the pivot table to transform data from long to wide format
            # Each key_term will become a row, and each metric will become a column
            pivot_df = df_filtered.pivot_table(
                index=['key_term_name', 'llm_extracted_ans_from_doc', 'page_number'],
                columns='evulation_metric_name',
                values='LLM_Judge_Response',
                aggfunc='first'  # Take first value if there are duplicates
            ).reset_index()

            # Flatten the column names (remove multi-level index)
            pivot_df.columns.name = None

            # Rename the basic columns to match your desired format
            pivot_df = pivot_df.rename(columns={
                'key_term_name': 'key_term_name',
                'llm_extracted_ans_from_doc': 'value',
                'page_number': 'page_number'
            })

            # Create a justification column by combining all justifications for each key term
            justification_df = df_filtered.groupby(['key_term_name', 'llm_extracted_ans_from_doc', 'page_number'])['justification'].apply(
                lambda x: ' | '.join(x.dropna().unique())  # Combine unique justifications with separator
            ).reset_index()

            # Merge the justification back to the pivot table
            final_df = pivot_df.merge(
                justification_df,
                left_on=['key_term_name', 'value', 'page_number'],
                right_on=['key_term_name', 'llm_extracted_ans_from_doc', 'page_number'],
                how='left'
            )

            # Drop the duplicate column from merge
            if 'llm_extracted_ans_from_doc' in final_df.columns:
                final_df = final_df.drop('llm_extracted_ans_from_doc', axis=1)

            # Reorder columns: basic info first, then metrics, then justification
            basic_cols = ['key_term_name', 'value', 'page_number']
            metric_cols = [col for col in final_df.columns if col not in basic_cols + ['justification']]
            column_order = basic_cols + metric_cols + ['justification']

            final_df = final_df[column_order]

            # Write to a temporary file and return the path
            with tempfile.NamedTemporaryFile(delete=False, suffix=".csv", mode="w", encoding="utf-8") as tmp:
                final_df.to_csv(tmp, index=False)
                tmp_path = tmp.name

            return tmp_path

        except Exception as e:
            print(f"Error in download_csv: {str(e)}")
            return None

    # Trigger processing function when 'Start Evaluating' is clicked
    start_btn.click(
        run_and_return_tables,
        inputs=[contract_file],
        outputs=[extracted_text, results_table1, results_table2, results_table3, state_df1, state_df2, state_df3, state_df_all]
    )

    # Trigger CSV download when 'Download' is clicked
    download_btn.click(
        download_csv,
        inputs=[contract_file, state_df_all],
        outputs=download_file
    )

# Launch the Gradio app
demo.launch()