# LLM-as-a-Judge Simply Explained: A Complete Guide to Run LLM Evals

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sachin0034/MLAI-community-labs/blob/main/Class-Labs/Lab-14%28LLM-Judge-lab%29/Lab-2%28LLM-as-a-Judge%20with%20LLM%29/LLM_As_A_Judge.ipynb)

"LLM-as-a-judge" is a technique where large language models — like GPT — are used to evaluate the quality of outputs generated by other AI models or systems. Instead of relying on human evaluators, which can be time-consuming and expensive, we use an LLM to act as the judge, scoring or ranking generated content based on factors like correctness, coherence, relevance, or even tone and style.

This approach became popular because evaluating open-ended text (like summaries, chatbot replies, or creative writing) is inherently subjective. Traditional metrics like accuracy or BLEU scores often fall short since there’s no single 'right' answer. LLMs help fill that gap by providing nuanced judgments, often closer to how a human would interpret or assess the output.

So in essence, LLM-as-a-judge is a scalable, cost-effective, and surprisingly reliable way to evaluate the quality of language model outputs — especially when human evaluation isn’t feasible at scale.

---

## Prerequisites

Before you get started, please make sure you have the following ready:

---

### 1. Sample Contract File for Testing

To try out the contract analysis workflow, download the sample contract file provided below:

- [Download Sample Contract (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)

### 2. OpenAI API Key

You’ll need your own OpenAI API key to access the language models used for contract evaluation. If you don’t have one yet, follow this step-by-step guide to generate your API key:

- [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

## Step 1: Install the Dependencies
Each dependency serves a specific purpose in the LLM Judge Lab:

| Package        | Purpose / Use in Project                                                                                     |
|----------------|--------------------------------------------------------------------------------------------------------------|
| **gradio**     | Builds a web-based UI for interaction. Allows users to input text, upload files, and view model evaluations. |
| **langchain**  | Manages the logic of LLM interactions — from document processing to chaining LLM calls.                      |
| **openai**     | Connects the system to OpenAI’s models (e.g., GPT-4) for generating judgments or scores.                     |
| **python-docx**| Parses and extracts content from `.docx` files for evaluation.                                               |
| **PyPDF2**     | Extracts text from PDFs, enabling the model to assess uploaded PDF documents.                                |
| **pandas**     | Structures and displays results in tables or dataframes for better analysis and comparison.                  |

In [1]:
# Install necessary packages
! pip install gradio langchain openai python-docx PyPDF pandas langchain-community

Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Collecting PyPDF
  Downloading pypdf-5.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata

## 🚀 Step 2 : After Installing Dependencies: Let's Start Importing!

Now that you've installed all the necessary libraries, it's time to import them into your Python script or Jupyter notebook.

Start by importing **Gradio** to build the interactive web interface for your LLM-as-a-judge lab.

Next, bring in **document loaders from LangChain** — specifically for handling PDF, DOCX, and plain text files. These will help you extract content from user-uploaded documents.

Then, import the **OpenAI client**, which you'll use to connect to models like GPT-4 for analyzing and judging text.

You’ll also want **Pandas** to organize and display results in table formats, especially when dealing with comparisons or scores.

Finally, include Python’s built-in **os** and **tempfile** modules. These are useful for file path handling and safely working with temporary files during processing.

Once these imports are in place, you're ready to move on to building the file processing and evaluation pipeline!


In [2]:
import gradio as gr
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain.schema import Document
from openai import OpenAI
import pandas as pd
import os
import io
import tempfile
import re

## 🧠 Step 3: Define Key Terms and Evaluation Metrics

Now, let’s define the **what** and the **how** of the evaluation:

### 🔑 Key Terms

These are the critical contract clauses we want the LLM to extract and analyze:

- **Service Warranty**  
- **Limitation of Liability**  
- **Governing Law**  
- **Termination for Cause**  
- **Payment Terms**  
- **Confidentiality Obligations**

Each of these helps focus the LLM’s attention on high-priority legal elements.

---

### Helpful
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Was the information extracted as per the question asked?      | Did the answer directly address the key term?         |
| Was the information complete?                                 | Is all relevant information included?                 |
| Was the information enough to make a conclusive decision?     | Is the answer sufficient for decision-making?         |
| Were associated red flags covered in the extracted output?    | Are potential issues or risks mentioned?              |

---

### Honest
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Was the information extracted from all relevant clauses?      | Are multiple relevant sections included if needed?    |
| Was the page number of extracted information correct?         | Are page references accurate?                         |
| Was the AI reasoning discussing the relevant clause?          | Is the explanation focused on the right part?         |
| Does the information stay within document scope?              | Is the answer limited to the uploaded contract?       |

---

### Harmless
| Metric                                                        | What It Measures                                      |
|---------------------------------------------------------------|-------------------------------------------------------|
| Were results free from misleading claims?                     | Are there any false or misleading statements?         |
| Does the tool avoid generic/non-contract answers?             | Is the answer specific to the contract, not generic?  |
| Did the AI avoid illegal or insensitive justifications?       | Are explanations appropriate and lawful?              |
| Did the tool prevent false claims about people/entities?      | Are there any incorrect statements about parties?     |
| Did the tool context hateful/profane content?                 | Is the output free from inappropriate language?       |

---




In [3]:
KEY_TERMS = [
    "Service Warranty",
    "Limitation of Liability",
    "Governing Law",
    "Termination for Cause",
    "Payment Terms",
    "Confidentiality Obligations"
]

EVALUATION_METRICS = [
    "Was the information extracted as per the question asked in the key term?",
    "Was the information complete?",
    "Was the information enough to make a conclusive decision?",
    "Were associated red flags covered in the extracted output?",
    "Was the information extracted from all relevant clauses?",
    "Was the page number of extracted information correct?",
    "Was the AI reasoning discussing the relevant clause?",
    "Does the information stay within document scope?",
    "Were results free from misleading claims?",
    "Does the tool avoid generic/non-contract answers?",
    "Did the AI avoid illegal or insensitive justifications?",
    "Did the tool prevent false claims about people/entities?",
    "Did the tool context hateful/profane content?"
]

## 📂 Step 4: Extract Text from Documents

To analyze uploaded files, we first extract their text using the `extract_text_from_file` function.

It supports:

- **PDF** (`.pdf`) via `PyPDFLoader`
- **Word** (`.docx`, `.doc`) via `Docx2txtLoader`
- **Text** (`.txt`) via `TextLoader`
- **CSV** (`.csv`) via `pandas`

The function returns:
- `text`: Complete extracted content
- `docs`: Structured data (useful for page references)

✅ This ensures consistent input for the LLM across all major file types.


In [4]:
def extract_text_from_file(file_path):
    # Get the file extension and convert it to lowercase
    ext = os.path.splitext(file_path)[1].lower()

    # Handle PDF files
    if ext == ".pdf":
        loader = PyPDFLoader(file_path)  # Use PyPDFLoader to read PDF
        docs = loader.load()  # Load document into LangChain Document objects
        text = "\n".join([doc.page_content for doc in docs])  # Combine all page content

    # Handle Word documents (.docx, .doc)
    elif ext in [".docx", ".doc"]:
        loader = Docx2txtLoader(file_path)  # Use Docx2txtLoader for Word files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle plain text files
    elif ext in [".txt"]:
        loader = TextLoader(file_path)  # Use TextLoader for .txt files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle CSV files
    elif ext == ".csv":
        df = pd.read_csv(file_path)  # Read CSV using pandas
        text = df.to_string(index=False)  # Convert DataFrame to plain string
        # Wrap in a dummy doc-like object to keep consistent structure
        docs = [type('Doc', (object,), {'page_content': text})()]

    # Unsupported file types
    else:
        raise ValueError("Unsupported file type")

    # Return both raw text and structured docs for further processing
    return text, docs  # docs may include page-level details


## 🔐 Step 5: Set Up OpenAI Client

To use GPT models, you must have your own OpenAI API key.

**Important:**
- You cannot proceed without a valid API key.
- Keep your key secure and never share it or commit it to public repositories.

> 🔎 **Note:** If you don’t have an API key, follow this guide to get one:  
> [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)


In [10]:
api_key = 'Insert Your API Key Here'

try:
    client = OpenAI(api_key=api_key)
    # Minimal API call to check if the key is valid
    client.models.list()
    print("✅ OpenAI API key is valid.")
except Exception as e:
    print("❌ Invalid OpenAI API key or connection error:", e)
    raise RuntimeError("OpenAI API key check failed. Please provide a valid key.")

❌ Invalid OpenAI API key or connection error: Error code: 401 - {'error': {'message': 'Incorrect API key provided: Insert Y************Here. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}


RuntimeError: OpenAI API key check failed. Please provide a valid key.

## 🧠 Step 6: Extract Key Terms from the Document

The `extract_key_terms(text, key_terms)` function is used to pull out and summarize the most important legal clauses from your uploaded document.

It takes two inputs:

- `text`: The full contract content, extracted earlier using `extract_text_from_file()`
- `key_terms`: A list of specific legal terms you defined earlier (e.g. "Payment Terms", "Governing Law")

---

### 🔍 What It Does:

Once the document is uploaded and text is extracted, this function:

1. Loops through each key term from the list.
2. For each term:
   - It sends a carefully crafted prompt (along with the full text) to the OpenAI model (GPT-4).
   - It asks the model to find the clause, summarize it in simple language, and return it in a structured **JSON format**.
3. Then, it:
   - Parses the JSON response from the model.
   - Extracts the **summary** and **page number** (if mentioned).
   - Stores the result in a dictionary.

---

### 📤 Output:

Returns a dictionary that looks like this:

```python
{
  "Payment Terms": {
    "Summary": "Payments must be made within 30 days of invoice.",
    "page_number": "5"
  },
  "Confidentiality Obligations": {
    "Summary": "Parties must keep all shared data confidential.",
    "page_number": "7"
  },
  ...
}
```


In [11]:
def extract_key_terms(text, key_terms):
    results = {}
    for term in key_terms:
        prompt = (
            f"You are a legal document analysis assistant.\n"
            f"Find the'{term}' in the contract and provide me the value only.\n\n"
            f"Instructions:\n"
            f"1. If found, provide a brief summary in simple language and to the point\n"
            f"2. Include the section title and page number if available\n"
            f"3. Quote only the most important sentence from the actual clause\n"
            f"4. If not found, respond with: 'Not found.'\n\n"
            f"Format your response in JSON format:\n"
            f"**Section:** [Title] (Page [number])\n"
            f"**Summary:** [Brief explanation in plain English]\n"
            f"**Key Quote:** \"[Most relevant sentence]\"\n\n"
            f"Document:\n{text}..."
        )
        completion = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a legal contract analysis assistant. Provide clear, concise explanations that non-lawyers can understand."},
                {"role": "user", "content": prompt}
            ]
        )
        answer = completion.choices[0].message.content
        print("****************LLM Answer*******************")
        print(answer)

        # Parse JSON response to extract summary
        import json
        import re
        try:
            # Extract JSON from response if it's wrapped in ```json blocks
            json_match = re.search(r'```json\s*(.*?)\s*```', answer, re.DOTALL)
            if json_match:
                json_str = json_match.group(1)
            else:
                json_str = answer

            parsed_response = json.loads(json_str)
            summary = parsed_response.get("Summary", "Not found")

            # Extract page number from Section field
            page_number = None
            section = parsed_response.get("Section", "")
            if "Page" in section:
                page_match = re.search(r'Page (\d+)', section)
                if page_match:
                    page_number = page_match.group(1)
        except json.JSONDecodeError:
            summary = "Not found"
            page_number = None

        results[term] = {"Summary": summary, "page_number": page_number}
    return results

## ✅ Step 7: Judge the LLM's Response

The `judge_llm()` function is responsible for **evaluating the quality of each extracted answer** provided by the LLM for the key terms.

---

### 📥 It Takes:
- `text`: The contract text (already extracted earlier)
- `key_terms`: The list of legal terms you’re looking for
- `extract_key_terms_response`: The previous step’s output (summaries + page numbers)
- `metrics`: A list of evaluation metrics (e.g. accuracy, completeness, clarity)

---

### 🧠 What It Does:

1. Loops over each key term.
2. For every term and metric:
   - Sends a prompt to GPT to **evaluate the LLM's extracted answer**.
   - The model gives a **score between 0 to 5** and a **short justification**.
3. Interprets the score:
   - **Score > 3** → ✅ `LLM_Judge_Response = True` (Pass)
   - **Score ≤ 3** → ❌ `LLM_Judge_Response = False` (Fail)

---

### 🎯 Why This Step Matters:

This step **objectively measures** how well the AI extracted each legal term, giving you confidence in the quality of analysis.

It adds a **scoring + explanation layer**, making your contract analyzer smarter and more reliable.


In [12]:
def judge_llm(key_terms, extract_key_terms_response, metrics):
    results = []  # List to store evaluation results for each key term

    for term in key_terms:
        # Get the LLM's extracted answer for the current key term
        llm_answer = extract_key_terms_response.get(term, {}).get("Summary", "Not found")

        for metric in metrics:
            # Construct the evaluation prompt to ask the LLM to judge its own answer
            prompt = (
                f"You are an expert contract lawyer. Carefully analyze the following extracted answer for the key term '{term}' "
                f"using ALL of the following evaluation metrics:\n"
                f"{metrics}\n"
                f"Extracted Answer: {llm_answer}\n"
                "Scoring Instructions:\n"
                "- Use a score from 0 to 5, where:\n"
                "    0 = The key term is not found, not addressed, or the answer is completely missing/irrelevant.\n"
                "    1 = Very poor: answer is mostly missing, incorrect, or fails almost all metrics.\n"
                "    2 = Poor: answer is incomplete, incorrect, or fails most metrics.\n"
                "    3 = Fair: answer is partially correct, covers some metrics but has notable gaps or errors.\n"
                "    4 = Good: answer is mostly correct, covers most metrics, but could be improved.\n"
                "    5 = Excellent: answer is fully correct, complete, and meets all metrics.\n"
                "- If the extracted answer is 'Not found.' or does not address the key term at all, you MUST give a score of 0.\n"
                "- Carefully consider each metric before assigning a score. Do NOT skip intermediate scores (2, 3) if appropriate.\n"
                "- Optimize your evaluation for accuracy and completeness.\n"
                "Based on all the metrics above, provide:\n"
                "- A single overall score from 0 (not found) to 5 (excellent)\n"
                "- A short justification (1-2 sentences) to the point for your overall score\n"
                "Respond in the format: Score: <number>\nJustification: <text>"
            )

            # Call the LLM to get its evaluation response
            completion = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a contract evaluation expert."},
                    {"role": "user", "content": prompt}
                ]
            )

            content = completion.choices[0].message.content
            print("*********JUDGE EVULATION ANSWER")
            print(content)

            import re  # Use regex to extract structured score and justification

            # Extract score from the response
            score_match = re.search(r"Score:\s*(\d+)", content)
            score = int(score_match.group(1)) if score_match else None

            # Extract justification from the response
            justification_match = re.search(r"Justification:\s*(.*)", content, re.DOTALL)
            justification = justification_match.group(1).strip() if justification_match else content

            # Determine pass/fail status based on score threshold
            pass_fail = True if score is not None and score > 3 else False

            # Append results for this term + metric
            results.append({
                "key_term_name": term,
                "llm_extracted_ans_from_doc": llm_answer,
                "evulation_metric_name": metric,
                "LLM_Judge_Response": pass_fail,
                "justification": justification
            })

    return results  # Return all evaluation results



## 📌 Function: process_documents_with_progress

Runs a full contract analysis pipeline with live progress updates using Gradio.

🧾 What it does:
-----------------
- Extracts raw text from an uploaded contract file.
- Identifies and extracts key terms from the contract using predefined keywords.
- Uses a language model to evaluate the quality of the extracted key term answers against defined metrics.
- Formats the results into separate DataFrames for display and review.

📥 Parameters:
-----------------
- `contract_file`: Uploaded contract file (e.g. PDF, DOCX).
- `progress` (optional): Gradio progress handler to display status updates in the UI.

📤 Returns:
-----------------
- `text`: Extracted text content from the document.
- `df1`: Evaluation results for the first group of metrics.
- `df2`: Evaluation results for the second group of metrics.
- `df3`: Evaluation results for the third group of metrics.
- `df`: Full evaluation DataFrame.

🛠️ Internal Flow:
-----------------
1. Extracts text from the document.
2. Passes extracted text to `extract_key_terms()` to isolate relevant terms.
3. Sends key term results to `judge_llm()` for scoring against evaluation metrics.
4. Cleans and organizes the judged results.
5. Splits final results into groups for easier UI display.

⚠️ Notes:
-----------------
- The function depends on earlier steps (`extract_text_from_file`, `extract_key_terms`, and `judge_llm`).
- It takes both `text` and `key_terms` as input for evaluation.
- It uses the key terms to extract specific content, wraps it in JSON, then evaluates and returns a structured summary.
```


In [13]:
import io
import time


def process_documents_with_progress(contract_file, progress=gr.Progress()):
    """
    Process documents with progress updates
    """
    try:
        # Step 1: Extract text from contract file
        progress(0.1, desc="📄 Extracting text from contract file...")
        text, docs = extract_text_from_file(contract_file)
        progress(0.2, desc="✅ Text extraction completed")

        # Step 2: Extract key terms
        progress(0.3, desc="🔍 Extracting key terms from contract...")
        key_term_results = extract_key_terms(text, KEY_TERMS)
        progress(0.5, desc="✅ Key terms extraction completed")

        # Step 3: Judge each key term
        progress(0.6, desc="⚖️ Evaluating key terms with LLM judge...")
        evals = judge_llm(
            KEY_TERMS,
            key_term_results,
            EVALUATION_METRICS
        )
        progress(0.8, desc="✅ Evaluation completed")

        # Step 4: Format results
        progress(0.9, desc="📊 Formatting results for display...")

        # Format each evaluation result for display
        for e in evals:
            term = e["key_term_name"]
            llm_ans = e["llm_extracted_ans_from_doc"]
            # Extract only the text after 'Text:'
            if llm_ans:
                text_match = re.search(r'Text:\s*(.*)', llm_ans, re.DOTALL)
                e["llm_extracted_ans_from_doc"] = text_match.group(1).strip() if text_match else llm_ans
                # # Extract page number from LLM answer
                # page_match = re.search(r'Page:\s*(\d+)', llm_ans)
                # e["llm_page_number"] = page_match.group(1) if page_match else "Not found"
            else:
                e["llm_page_number"] = "Not found"

        # Prepare DataFrame with new columns in the correct order
        df = pd.DataFrame(evals)
        display_cols = [
            "key_term_name",
            "llm_extracted_ans_from_doc",
            # "llm_page_number",
            "evulation_metric_name",
            "LLM_Judge_Response",
            "justification"
        ]
        df = df[display_cols]

        # Split DataFrame into three based on metric index
        metric_groups = [EVALUATION_METRICS[:4], EVALUATION_METRICS[4:8], EVALUATION_METRICS[8:]]
        df1 = df[df["evulation_metric_name"].isin(metric_groups[0])].reset_index(drop=True)
        df2 = df[df["evulation_metric_name"].isin(metric_groups[1])].reset_index(drop=True)
        df3 = df[df["evulation_metric_name"].isin(metric_groups[2])].reset_index(drop=True)

        progress(1.0, desc="🎉 Processing completed successfully!")

        return text, df1, df2, df3, df

    except Exception as e:
        progress(1.0, desc=f"❌ Error occurred: {str(e)}")
        raise e

## 📌 Gradio UI: LLM Contract Judge

🧾 What it does:
-----------------
A web UI to upload a contract, extract key terms, evaluate them using an LLM, and view or download the results.

📥 Inputs:
-----------------
- `contract_file`: Upload contract (PDF, DOCX, TXT)
- `start_btn`: Starts the evaluation process
- `download_btn`: Downloads combined results as a CSV

📤 Outputs:
-----------------
- `progress_text`: Shows current processing status
- `extracted_text`: Displays the raw extracted contract text
- `results_table1/2/3`: Show evaluation results grouped by helpful, honest, and harmless metrics
- `download_file`: Final downloadable CSV file of all results

🛠️ Internal Flow:
-----------------
1. User uploads a file and clicks "Start Evaluating".
2. `run_and_return_tables()` calls `process_documents_with_progress()`.
3. Results are returned to the UI and shown in 3 separate tabs.
4. User can click "Download" to export all tables as a CSV.

🎯 Key Features:
-----------------
- Live progress updates using Gradio's `progress` utility
- State handling to preserve processed DataFrames
- Easy CSV export after evaluation



# When you run the last cell in your notebook, you’ll see a message like the one shown in the image below. Click on the "Running on local URL" link—you will be redirected to a new screen where you can interact with the LLM Contract Judge app.

![Gradio Local URL Example](Images//img-1.png)

# Once you are done with the lab, you will see a UI something like this below in the image:

![Gradio Local URL Example](Images/img-2.png)

In [None]:
import gradio as gr
import pandas as pd
import tempfile
import os

# Create the main Gradio interface using Blocks
with gr.Blocks() as demo:
    # Title Markdown
    gr.Markdown("# 📄 LLM Contract Judge\nUpload a contract, extract key terms, and evaluate with LLM.")

    # File upload component in a horizontal row layout
    with gr.Row():
        contract_file = gr.File(label="Upload Contract (PDF, DOCX, TXT)")

    # Button to trigger the evaluation process
    start_btn = gr.Button("🚀 Start Evaluating", variant="primary")

    # A non-editable textbox to show progress or status updates
    progress_text = gr.Textbox(
        label="Processing Status",
        value="Ready to start evaluation...",
        interactive=False
    )

    # Output box to display raw extracted contract text
    extracted_text = gr.Textbox(label="Extracted Contract Text", lines=10, interactive=False)

    # Tabbed interface for displaying different metric evaluation results
    with gr.Tabs():
        # Helpful Metrics tab
        with gr.TabItem("Helpful Metrics"):
            results_table1 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Helpful Metrics)")

        # Honest Metrics tab
        with gr.TabItem("Honest Metrics"):
            results_table2 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Honest Metrics)")

        # Harmless Metrics tab
        with gr.TabItem("Harmless Metrics"):
            results_table3 = gr.Dataframe(headers=[
                "key_term_name",
                "llm_extracted_ans_from_doc",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ], label="Evaluation Results (Harmless Metrics)")

    # Button to download evaluation results as a CSV
    download_btn = gr.Button("📥 Download All Results as CSV")

    # File component to show the downloadable CSV file
    download_file = gr.File(label="Download CSV")

    # State variables to store DataFrames for use during download
    state_df1 = gr.State()
    state_df2 = gr.State()
    state_df3 = gr.State()
    state_df_all = gr.State()

    # Main function to process contract and return data for all tables
    def run_and_return_tables(contract_file, progress=gr.Progress()):
        if not contract_file:
            # If file not uploaded, return error message and clear outputs
            return (
                "Please upload a contract file first.",
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                None, None, None, None
            )

        try:
            # Update UI to show progress
            progress_text = "🔄 Starting document processing..."

            # Function processes the document and returns the text and 3 metrics tables
            text, df1, df2, df3, df_all = process_documents_with_progress(contract_file, progress)

            return (
                text,
                gr.update(value=df1),
                gr.update(value=df2),
                gr.update(value=df3),
                df1, df2, df3, df_all
            )

        except Exception as e:
            # Return error details in case of failure
            error_msg = f"❌ Error during processing: {str(e)}"
            return (
                error_msg,
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                gr.update(value=None),
                None, None, None
            )

    # Function to generate and return the downloadable CSV file from results
    def download_csv(contract_file, df_all):
        if df_all is None:
            return None

        try:
            # Define the display columns in the same format as original
            display_cols = [
                "key_term_name",
                "llm_extracted_ans_from_doc",
                # "llm_page_number",
                "evulation_metric_name",
                "LLM_Judge_Response",
                "justification"
            ]

            # Filter to only include the specified columns
            final_df = df_all[display_cols]

            # Write to a temporary file and return the path
            with tempfile.NamedTemporaryFile(delete=False, suffix=".csv", mode="w", encoding="utf-8") as tmp:
                final_df.to_csv(tmp, index=False)
                tmp_path = tmp.name

            return tmp_path

        except Exception as e:
            print(f"Error in download_csv: {str(e)}")
            return None

    # Trigger processing function when 'Start Evaluating' is clicked
    start_btn.click(
        run_and_return_tables,
        inputs=[contract_file],
        outputs=[extracted_text, results_table1, results_table2, results_table3, state_df1, state_df2, state_df3, state_df_all]
    )

    # Trigger CSV download when 'Download' is clicked
    download_btn.click(
        download_csv,
        inputs=[contract_file, state_df_all],
        outputs=download_file
    )

# Launch the Gradio app
demo.launch()