# LLM-as-a-Judge with Advanced Azure AI Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](<https://colab.research.google.com/github/sachin0034/MLAI-community-labs/blob/main/Class-Labs/Lab-14(LLM-Judge-lab)/Lab-2(LLM-as-a-Judge%20with%20Advanced%20Azure%20AI%20Evaluation)/LLM_as_a_Judge_with_Advanced_Azure_AI_Evaluation.ipynb>)

In previous iterations of our Legal Document Analyzer, we leveraged OpenAI models for both the extraction of key terms from legal contracts and their subsequent evaluation. As demonstrated in our prior lab, this approach provided initial insights into the capabilities of large language models for legal text analysis. Building upon that foundation, this current project significantly enhances our evaluation methodology by integrating pre-trained evaluators from Azure AI. This shift allows us to utilize specialized, robust metrics for assessing the quality of our LLM's extractions, specifically focusing on aspects like groundedness, coherence, relevance, and fluency, thereby providing a more comprehensive and nuanced understanding of model performance.

# Intro to Azure AI Evaluation
Azure AI evaluations are a set of tools and features within Azure AI Studio designed to assess the performance and quality of generative AI models and applications. They provide a structured way to measure various aspects of AI responses, including accuracy, groundedness, and safety, using both built-in and customizable metrics.

[Read more about Azure AI Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk)

---

## Prerequisites

> ⚠️ **Note:** Make sure you have an active **Azure subscription account**. Without it, you won't be able to run this notebook with Azure integrations.  

---

Before you get started, please make sure you have the following ready:

---

### 1. Sample Contract File for Testing

To try out the contract analysis workflow, download the sample contract file provided below:

- [Download Sample Contract (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)

---

### 2. OpenAI API Key

You’ll need your own OpenAI API key to access the language models used for contract evaluation. If you don’t have one yet, follow this step-by-step guide to generate your API key:

- [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

### 3. Azure Setup (for Azure OpenAI / Azure AI Foundry Users)

If you're using Azure services, ensure you have the following:

- ✅ **An active Azure subscription**
- ✅ **Sufficient Azure portal permissions** (Contributor, Cognitive Services Contributor, or Owner at the subscription level)
- ✅ **An Azure AI Foundry resource or an Azure OpenAI resource deployed**

---


### 📦 Installation

Install all the required packages using the following command:

#### ✅ Package Breakdown

| Package                  | Purpose                                                                 |
|--------------------------|-------------------------------------------------------------------------|
| `langchain`              | Framework for developing LLM-powered applications                      |
| `pypdf`                  | Extract text from PDF documents                                        |
| `docx2txt`               | Read and convert `.docx` files to plain text                           |
| `pandas`                 | Data manipulation and table formatting                                 |
| `openai`                 | Access to OpenAI’s GPT models                                          |
| `gradio`                 | Frontend interface to run your LLM app interactively                   |
| `azure-ai-generative`    | Azure SDK for working with generative AI models                        |
| `langchain-community`    | Community-contributed integrations and tools for LangChain             |
| `azure-ai-evaluation`    | Evaluation tools for scoring LLM responses                            |
| `azure-ai-projects`      | Tools to manage and structure LLM workflows in Azure                   |
| `semantic-kernel`        | SDK for integrating AI models with symbolic reasoning & memory         |



In [None]:
! pip install langchain pypdf docx2txt pandas openai gradio azure-ai-generative langchain-community azure-ai-evaluation  azure-ai-projects semantic-kernel

In [23]:
import os
import io
import tempfile
import re
import json
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd
from langchain.schema import Document
from openai import OpenAI
from openai import APIError, APIConnectionError, RateLimitError
import time
import gradio as gr

# Azure AI Evaluation imports
from azure.ai.evaluation import GroundednessEvaluator, CoherenceEvaluator, RelevanceEvaluator, FluencyEvaluator

## ⚙️ Configuration Setup

### 1. 🔑 OpenAI API Key

To use the language models for contract evaluation, you’ll need your own OpenAI API key.

If you haven’t generated one yet, follow this guide:

- 👉 [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

### 2. ☁️ Azure Configuration (for Azure OpenAI / Azure AI Foundry Users)

If you're integrating with Azure services, ensure the following:

- ✅ **An active Azure subscription**
- ✅ **Proper permissions in the Azure portal** (Contributor, Cognitive Services Contributor, or Owner at subscription level)
- ✅ **An Azure AI Foundry or Azure OpenAI resource deployed**

> **⚠️ Note:** Azure OpenAI and Foundry services **require a premium Azure subscription**.

---

### 📘 Reference Documentation

**Azure AI Foundry Setup:**
- [Create an Azure AI Foundry Project](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-foundry&pivots=fdp-project)


**Azure OpenAI Configuration:**
- [Azure OpenAI API](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/reference)


---


In [37]:
# Initialize OpenAI client

# Get your open api key from here : https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327
client = OpenAI(api_key='Insert Your API Key')

# Initialize Azure AI project and Azure OpenAI connection with your environment variables
# Generate Below Keys : https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-foundry&pivots=fdp-project
azure_ai_project = {
    "subscription_id": "Insert your subscription_id",
    "resource_group_name": "Insert your resource_group_name",
    "project_name": "Insert your project_name",
}

# Generate Below Keys : https://learn.microsoft.com/en-us/azure/ai-foundry/openai/reference
model_config = {
    "azure_endpoint": "Insert your azure_endpoint",
    "api_key": "Insert your api_key",
    "azure_deployment": "Insert your azure_deployment",
    "api_version": "Insert your api_version"
}

### 🧠 Azure Evaluators Initialization

We are using Azure’s **predefined evaluators**, which are already trained on relevant prompt-response pairs. These evaluators analyze the LLM-generated outputs and return evaluation scores directly — no custom prompt engineering is needed.

The following evaluators are initialized:

- **Groundedness** – Checks if the response is factual and based on the source content.
- **Coherence** – Assesses the logical flow and structure of the response.
- **Relevance** – Determines if the response accurately addresses the specific key term.
- **Fluency** – Evaluates grammar, clarity, and language quality.

---

### 🔑 Key Terms to Extract & Evaluate

These predefined key terms are extracted from the contract and evaluated by the LLM:

- Service Warranty  
- Limitation of Liability  
- Governing Law  
- Termination for Cause  
- Payment Terms  
- Confidentiality Obligations


In [38]:
# Initialize Azure evaluators
groundedness_eval = GroundednessEvaluator(model_config=model_config)
coherence_eval = CoherenceEvaluator(model_config=model_config)
relevance_eval = RelevanceEvaluator(model_config=model_config)
fluency_eval = FluencyEvaluator(model_config=model_config)

KEY_TERMS = [
    "Service Warranty",
    "Limitation of Liability",
    "Governing Law",
    "Termination for Cause",
    "Payment Terms",
    "Confidentiality Obligations"
]

### 📄 Document Loading

#### 📂 Extracting Text from Files

The `extract_text_from_file` function supports **multiple document formats** and handles file parsing based on the extension:

- **PDF**: Uses `PyPDFLoader` to extract content from each page.
- **DOCX/DOC**: Uses `Docx2txtLoader` to retrieve structured text from Word documents.
- **TXT**: Uses `TextLoader` to process plain text files.
- **CSV**: Uses `pandas` to read tabular data and convert it into string format for LLM consumption.

The function returns both the **raw text** and a list of `Document` objects, which can be used for downstream evaluation or processing.


In [39]:
def extract_text_from_file(file_path):
    """
    Extracts text from various file types with improved error handling.

    Args:
        file_path (str): The path to the file.

    Returns:
        tuple: A tuple containing the extracted text (str) and a list of Document objects.

    Raises:
        ValueError: If the file type is unsupported.
        Exception: For errors during file reading or processing.
    """
    ext = os.path.splitext(file_path)[1].lower()
    try:
        if ext == ".pdf":
            loader = PyPDFLoader(file_path)
            docs = loader.load()
            text = "\n".join([doc.page_content for doc in docs])
        elif ext in [".docx", ".doc"]:
            loader = Docx2txtLoader(file_path)
            docs = loader.load()
            text = "\n".join([doc.page_content for doc in docs])
        elif ext in [".txt"]:
            loader = TextLoader(file_path)
            docs = loader.load()
            text = "\n".join([doc.page_content for doc in docs])
        elif ext == ".csv":
            df = pd.read_csv(file_path)
            text = df.to_string(index=False)
            docs = [Document(page_content=text, metadata={"page": "N/A"})]
        else:
            raise ValueError("Unsupported file type")
        return text, docs
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        raise
    except pd.errors.EmptyDataError:
        print(f"Error: CSV file is empty or malformed at {file_path}")
        raise
    except Exception as e:
        print(f"An error occurred while reading the file {file_path}: {e}")
        raise

def chunk_document(docs, chunk_size=1000, chunk_overlap=200):
    """
    Splits a list of Document objects into smaller chunks.

    Args:
        docs (list): A list of Langchain Document objects.
        chunk_size (int): The maximum size of each chunk in characters.
        chunk_overlap (int): The number of characters to overlap between chunks.

    Returns:
        list: A list of smaller Langchain Document chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )
    chunks = text_splitter.split_documents(docs)
    return chunks

### 🔁 Why We Use `query_llm_with_retry`

We use this function to **ensure reliable communication with the LLM**. APIs can fail due to rate limits, network issues, or temporary service disruptions. This retry mechanism helps prevent crashes and improves the robustness of the application by automatically retrying failed requests.


In [40]:
def query_llm_with_retry(messages, model="gpt-4o-mini", max_retries=3, delay=1):
    """
    Sends a query to the LLM with retry logic for API errors.

    Args:
        messages (list): The list of messages for the chat completion.
        model (str): The LLM model to use.
        max_retries (int): Maximum number of retries.
        delay (int): Delay in seconds between retries.

    Returns:
        str: The content of the LLM's response.

    Raises:
        Exception: If the LLM call fails after all retries.
    """
    for i in range(max_retries):
        try:
            completion = client.chat.completions.create(
                model=model,
                messages=messages
            )
            return completion.choices[0].message.content
        except (APIError, APIConnectionError, RateLimitError) as e:
            print(f"API error during LLM call (Attempt {i+1}/{max_retries}): {e}")
            if i < max_retries - 1:
                time.sleep(delay * (2 ** i))
            else:
                print("Max retries reached. LLM call failed.")
                raise
        except Exception as e:
            print(f"An unexpected error occurred during LLM call (Attempt {i+1}/{max_retries}): {e}")
            if i < max_retries - 1:
                time.sleep(delay * (2 ** i))
            else:
                print("Max retries reached. LLM call failed.")
                raise

## 📄 Function: `extract_key_terms(docs, key_terms)`

### ✅ Purpose
Extracts legal clauses relevant to predefined key terms (e.g., *Payment Terms*, *Governing Law*) from a document by using a Large Language Model (LLM). This enables automated compliance checks or contract analysis without manual review.

---

### 🧠 Why Use It?
- Legal documents are often long and complex.
- LLMs have token limits, so we split the document into chunks and analyze each individually.
- This function ensures we can **reliably find and summarize legal obligations** across entire documents using AI.

---

### 🧱 Inputs

| Parameter   | Type    | Description                                                                 |
|-------------|---------|-----------------------------------------------------------------------------|
| `docs`      | `list`  | A list of Langchain `Document` objects containing the document content.     |
| `key_terms` | `list`  | List of strings representing legal terms to search for (e.g., `["Confidentiality"]`). |

---

### 🔄 Internal Workflow

1. **Chunking**: The input document is split into smaller overlapping pieces to fit LLM limits.
2. **Term Loop**: For each term in `key_terms`, every chunk is checked via LLM.
3. **LLM Prompting**:
   - The model is prompted with the chunk and asked if the key term is relevant.
   - If **relevant**, it provides a **short summary**.
4. **Result Handling**:
   - Relevant summaries are collected with associated **page numbers**.
   - If not found, the result is marked as `"Not found"`.

---

### 📤 Output

Returns a `dict` where:

- **Key**: A legal term.
- **Value**: A dictionary containing:
  - `"answer"` – a brief summary or `"Not found."`
  - `"page_number"` – page(s) where relevant clauses were found.

#### Example:
```json
{
  "Confidentiality Obligations": {
    "answer": "Found relevant information. Summary: The agreement prohibits both parties from disclosing trade secrets.",
    "page_number": "3, 5"
  },
  "Governing Law": {
    "answer": "Not found.",
    "page_number": "Not applicable"
  }
}
```

---

In [41]:
def extract_key_terms(docs, key_terms):
    """
    Extracts clauses related to key terms from document chunks using LLM.
    Uses chunking to handle token limits and attempts to correlate results with page numbers.

    Args:
        docs (list): A list of Langchain Document objects (can be chunks).
        key_terms (list): List of key terms to extract.

    Returns:
        dict: A dictionary with each key term and its associated extracted answer and page number.
    """
    results = {}
    chunks = chunk_document(docs)

    for term in key_terms:
        term_results = []
        found_in_chunks = False
        for i, chunk in enumerate(chunks):
            prompt = (
                f"You are a legal document analysis assistant. Your task is to determine if the following document snippet "
                f"contains information pertaining to the term: '{term}'. If it does, extract the relevant clause(s) and provide "
                f"a very short summary. If relevant information is found, respond in the following structured format:\n\n"
                f"Relevant: Yes  \n"
                f"Summary: <A very short and to the point summary of the relevant clause(s)>\n\n"
                f"If the term is not found in this snippet, respond exactly with:  \n"
                f"Relevant: No  \n\n"
                f"Document Snippet (Chunk {i+1}):  \n"
                f"{chunk.page_content}  \n\n"
                f"Please ensure that your analysis is precise and adheres to the legal context, "
                f"maintaining clarity and conciseness in your summary."
            )

            messages = [
                {"role": "system", "content": "You are a legal contract analysis assistant."},
                {"role": "user", "content": prompt}
            ]

            try:
                answer = query_llm_with_retry(messages, model="gpt-4o-mini")
                if "Relevant: Yes" in answer:
                    summary_match = re.search(r"Summary:\s*(.*)", answer, re.DOTALL)
                    summary = summary_match.group(1).strip() if summary_match else "Summary extraction failed."
                    page_number = chunk.metadata.get("page", "N/A")
                    term_results.append({"clause_summary": summary, "page_number": page_number, "chunk_index": i})
                    found_in_chunks = True
                elif "Relevant: No" in answer:
                    pass
                else:
                    print(f"Warning: Unexpected LLM response format for term '{term}' in chunk {i+1}. Response:\n{answer}")

            except Exception as e:
                print(f"An error occurred while processing chunk {i+1} for term '{term}': {e}")

        if not found_in_chunks or not term_results:
            results[term] = {"answer": "Not found.", "page_number": "Not applicable"}
        else:
            combined_summary = " ".join([res["clause_summary"] for res in term_results])
            page_numbers = sorted(list(set([res["page_number"] for res in term_results])))
            page_info = ", ".join(map(str, page_numbers)) if page_numbers and page_numbers != ["N/A"] else "N/A"

            results[term] = {
                "answer": f"Found relevant information. Summary: {combined_summary}",
                "page_number": page_info
            }

    return results


In [29]:
def truncate_text(text, max_length):
    """Truncate text to specified length for display purposes"""
    return text[:max_length] + "..." if len(text) > max_length else text

## 🧪 Function: `azure_judge(entry, results_df)`

### 🔹 Purpose
Evaluates a single LLM response (`entry`) using four quality metrics — **groundedness**, **coherence**, **relevance**, and **fluency** — and appends the results to the `results_df` DataFrame.

---

### 🧱 Parameters

| Name         | Type        | Description                                      |
|--------------|-------------|--------------------------------------------------|
| `entry`      | dict        | A dictionary with keys like query, context, LLM response, etc. |
| `results_df` | DataFrame   | A Pandas DataFrame to which the evaluation results are appended. |

---

### 🛠️ Evaluation Metrics

- **Groundedness:** Checks how well the response is based on provided context.
- **Coherence:** Measures logical flow and consistency of the response.
- **Relevance:** Assesses how relevant the response is to the original query.
- **Fluency:** Evaluates the grammatical correctness and readability.

---

### ⚙️ Behavior

- Extracts necessary inputs from the `entry`.
- Calls evaluator functions with these inputs.
- Appends evaluation results (or `None` in case of an error) to `results_df`.

---

### ✅ Returns

- Updated `results_df` with a new row of evaluation scores and metadata.



In [42]:
def azure_judge(entry, results_df):
    """Evaluate a single entry using all evaluators and add to results dataframe"""

    # Format inputs for each evaluator - using llm_response instead of ground_truth
    groundedness_input = {
        "query": entry["query"],
        "context": entry["context"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    coherence_input = {
        "query": entry["query"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    relevance_input = {
        "query": entry["query"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    fluency_input = {
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    try:
        # Get all scores
        groundedness_score = groundedness_eval(**groundedness_input)
        coherence_score = coherence_eval(**coherence_input)
        relevance_score = relevance_eval(**relevance_input)
        fluency_score = fluency_eval(**fluency_input)

        # Create new row for the dataframe
        new_row = {
            'Key Term Name': entry['Key Term Name'], # Ensure Key Term Name is passed through
            'Context': entry['context'],
            'Query': entry['query'],
            'LLM Response': entry['llm_response'],
            'Groundedness Score': groundedness_score['groundedness'],
            'Coherence Score': coherence_score['coherence'],
            'Relevance Score': relevance_score['relevance'],
            'Fluency Score': fluency_score['fluency']
        }

    except Exception as e:
        print(f"Error evaluating entry: {e}")
        # Create row with error indicators
        new_row = {
            'Key Term Name': entry['Key Term Name'],
            'Context': entry['context'],
            'Query': entry['query'],
            'LLM Response': entry['llm_response'],
            'Groundedness Score': None,
            'Coherence Score': None,
            'Relevance Score': None,
            'Fluency Score': None
        }

    # Use concat with a pre-defined DataFrame
    new_row_df = pd.DataFrame([new_row])
    return pd.concat([results_df, new_row_df], ignore_index=True)

### Function: `generate_evaluation_data`

The `generate_evaluation_data` function prepares a structured dataset for evaluating how well an LLM has extracted information about specific legal key terms from a document. It iterates through a predefined list of key terms, checks if a term exists in the LLM result, and retrieves the associated answer. For each valid term, it constructs an evaluation entry consisting of:
- the key term name,
- a query requesting extraction of information for that term,
- a truncated version (first 2000 characters) of the original document text as context,
- and the LLM’s response.

These entries are compiled into a list and returned. This dataset is typically used for scoring the LLM outputs against evaluation metrics such as groundedness, fluency, coherence, and relevance.


In [43]:
def generate_evaluation_data(llm_results,document_text):
    """Create evaluation dataset from LLM and ground truth results"""
    evaluation_data = []

    for term in KEY_TERMS:
        if term in llm_results :
            llm_answer = llm_results[term].get('answer', 'Not found.')


            # Create evaluation entry
            entry = {
                'Key Term Name': term,  # Add the key term name here
                'query': f"Extract information about {term} from the legal document.",
                'context': document_text[:2000],  # Use first 2000 chars as context
                'llm_response': llm_answer
            }
            evaluation_data.append(entry)

    return evaluation_data

### Function: `evaluate_responses`

The `evaluate_responses` function assesses LLM-generated outputs using Azure's pre-trained evaluators (groundedness, coherence, relevance, and fluency). It begins by generating structured evaluation data for each key legal term based on the LLM's response and a portion of the document text. It then initializes an empty DataFrame with appropriate columns and iteratively evaluates each entry using the `azure_judge` function. The final output is a DataFrame containing evaluation scores for each response. If any error occurs during processing, the function catches the exception and returns a DataFrame indicating the failure. This process helps quantify the quality and reliability of the LLM's outputs.


In [44]:
def evaluate_responses(llm_results, document_text):
    """Evaluate LLM responses against ground truth using Azure evaluators"""
    try:
        # Create evaluation dataset
        evaluation_data = generate_evaluation_data(llm_results,document_text)

        # Initialize results dataframe with all necessary columns
        results_df = pd.DataFrame(columns=[
            'Key Term Name', 'Context', 'Query', 'LLM Response', 'Groundedness Score',
            'Coherence Score', 'Relevance Score', 'Fluency Score'
        ])

        # Evaluate each entry
        for entry in evaluation_data:
            results_df = azure_judge(entry, results_df)

        return results_df

    except Exception as e:
        print(f"Error during evaluation: {e}")
        # Return an empty DataFrame or a DataFrame with error info for Gradio
        return pd.DataFrame({'Error': [f"Evaluation failed: {e}"]})

### Function: `process_document`

The `process_document` function handles the end-to-end flow of extracting and evaluating legal key terms from a contract file. It first reads and extracts text from the provided contract document, then uses an LLM to identify and extract predefined key legal terms. Once extracted, it evaluates the quality of these responses using Azure's evaluation metrics (groundedness, coherence, relevance, fluency). The function returns both the LLM-extracted results and the evaluation results. If an error occurs at any stage, it catches the exception and ensures a fallback response structure is returned, enabling smooth downstream handling in interfaces like Gradio.


In [45]:
def process_document(contract_file):
    """
    Processes the uploaded contract and optional ground truth document.

    Args:
        contract_file (str): Path to the contract document.
        ground_truth_file (str, optional): Path to the ground truth document. Defaults to None.

    Returns:
        tuple: A tuple containing the LLM extraction results, ground truth results, and evaluation results.
    """
    llm_results = {}

    evaluation_results = None
    document_text = ""

    try:
        # Step 1: Extract text and documents from the contract file
        print(f"Processing contract file: {contract_file}")
        contract_text, contract_docs = extract_text_from_file(contract_file)
        document_text = contract_text

        # Step 2: Extract key terms using the LLM on document chunks
        print("Extracting key terms using LLM...")
        llm_results = extract_key_terms(contract_docs, KEY_TERMS)
        print("Key term extraction complete.")
        # Step 4: Evaluate LLM responses against ground truth
        print("Starting evaluation...")
        evaluation_results = evaluate_responses(llm_results,document_text)
        print("Evaluation complete.")

    except Exception as e:
        print(f"\nAn error occurred during document processing: {e}")
        llm_results = {"Error": f"Processing failed: {e}"}

        # Ensure evaluation_results is a DataFrame even on error for Gradio
        evaluation_results = pd.DataFrame({'Error': [f"Processing failed: {e}"]})

    return llm_results, evaluation_results

### 📄 Application Flow: Legal Document Analyzer with Azure Evaluation

This application provides a user-friendly interface for analyzing legal documents using an LLM (like GPT-4) and evaluating the extracted results with Azure's language evaluators.

---

### 🔁 End-to-End Flow

1. **User Uploads Document**
   - The user uploads a contract document (`.pdf`, `.docx`, `.txt`, etc.) via the Gradio interface.

2. **Button Click Triggers Analysis**
   - Clicking the "Analyze Document" button invokes the `display_results` function.

3. **Document Processing**
   - Internally, `display_results` calls `process_document`, which performs:
     - **Text Extraction**: Extracts text from the uploaded file.
     - **LLM Term Extraction**: Extracts key legal terms using the LLM.
     - **Evaluation**: Evaluates the extracted responses using Azure metrics (Groundedness, Coherence, Relevance, Fluency).

4. **Formatting Results**
   - The LLM-extracted terms are formatted into a Markdown string.
   - The evaluation scores are structured into a clean `DataFrame` for display.

5. **Displaying Results**
   - The Gradio interface displays:
     - A **Markdown section** showing the key terms and their extracted answers.
     - A **DataFrame table** with evaluation scores for each term.

6. **Error Handling**
   - If any step fails (e.g., file parsing or API errors), a fallback response is returned with appropriate error messaging, ensuring the UI doesn't break.

---

### ✅ Outcome

The user receives:
- **LLM Extraction Results**: Each key legal term and its extracted answer.
- **Evaluation Table**: Objective scores assessing how well the extracted answer aligns with expected responses based on Azure's evaluation tools.

---

> 💡 **Note**: You must have an active **Azure Premium subscription** to access the evaluation capabilities.


In [46]:
def display_results(contract_file):
    """
    Display results in a formatted way for Gradio interface
    """
    llm_results, evaluation_results = process_document(contract_file)

    # Format LLM Results
    llm_output = "## LLM Extraction Results\n\n"
    for term, result in llm_results.items():
        if isinstance(result, dict) and 'answer' in result:
            llm_output += f"**{term}:**\n"
            llm_output += f"- Answer: {result['answer']}\n"
        else:
            llm_output += f"**{term}:** {result}\n\n"



    # Prepare Evaluation Results for Gradio DataFrame
    if evaluation_results is not None and not evaluation_results.empty:
        if 'Error' in evaluation_results.columns:
            # If there's an error, just return a simple DataFrame with the error message
            # Gradio DataFrame can display this, but it won't be in the desired evaluation format.
            # You might want to handle this error display differently in the UI if needed.
            # For now, it will show a table with one column 'Error' and the message.
            evaluation_df_for_gradio = evaluation_results
        else:
            # Define the desired column order for the final table
            desired_columns = [
                'Key Term Name', 'Query', 'LLM Response',
                'Groundedness Score', 'Coherence Score', 'Relevance Score', 'Fluency Score'
            ]

            # Filter and reorder columns
            # Drop 'Context' as it's not requested in the final output format.
            display_df = evaluation_results.copy()
            if 'Context' in display_df.columns:
                display_df = display_df.drop(columns=['Context'])

            final_columns = [col for col in desired_columns if col in display_df.columns]
            evaluation_df_for_gradio = display_df[final_columns]
    else:
        # Return an empty DataFrame with the desired columns if no evaluation is performed
        # This prevents Gradio from throwing an error about unexpected output type.
        evaluation_df_for_gradio = pd.DataFrame(columns=[
            'Key Term Name', 'Query', 'LLM Response',
            'Groundedness Score', 'Coherence Score', 'Relevance Score', 'Fluency Score'
        ])
        # You could also add a row indicating "No evaluation performed"
        # evaluation_df_for_gradio.loc[0] = ["N/A"] * len(evaluation_df_for_gradio.columns)
        # evaluation_df_for_gradio.loc[0, 'Key Term Name'] = "No evaluation performed (ground truth file required)."

    return llm_output, evaluation_df_for_gradio

# Gradio Interface
def create_interface():
    """Create Gradio interface for the legal document analyzer"""

    with gr.Blocks(title="Legal Document Analyzer with Azure Evaluation") as interface:
        gr.Markdown("# Legal Document Analyzer with Azure Evaluation")
        gr.Markdown("Upload a legal contract document and optionally a ground truth file to extract key terms and evaluate the results.")

        with gr.Row():
            with gr.Column():
                contract_file = gr.File(
                    label="Upload Contract Document",
                    file_types=[".pdf", ".docx", ".doc", ".txt", ".csv"]
                )
                analyze_btn = gr.Button("Analyze Document", variant="primary")

        with gr.Row():
            with gr.Column():
                llm_output = gr.Markdown(label="LLM Results")

        with gr.Row():
            # Changed from gr.HTML to gr.DataFrame
            # REMOVE headers="keys" - Gradio will infer from the DataFrame
            evaluation_output = gr.DataFrame(label="Evaluation Results", wrap=True) # wrap=True for better text wrapping in cells

        analyze_btn.click(
            fn=display_results,
            inputs=[contract_file],
            outputs=[llm_output,evaluation_output]
        )

    return interface

In [47]:
if __name__ == "__main__":
    interface = create_interface()
    interface.launch(debug=True, share=True)

KeyboardInterrupt: 