# Azure AI Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sachin0034/MLAI-community-labs/blob/main/AI-evaluation-course-labs/Lab-3(Building_AI_evaluators)/Advanced_Azure_AI_Evaluation.ipynb)



In this lab, we will learn about **Azure AI Evaluation**, which refers to a set of tools and services provided by Microsoft Azure designed to assess and monitor the performance, safety, and quality of artificial intelligence applications—particularly generative AI and machine learning models.



# Get Started

## Prerequisites

Before you begin, please make sure you have the following ready so your experience in the lab is smooth and successful:


> **Note Azure Subscription Requirement:**  
> To run this lab and access Azure OpenAI services, you must have an **Azure account with Premium (Pay-as-you-go or Enterprise) subscription**.  
> Free-tier or trial accounts **do not** provide access to Azure OpenAI resources.


### 1. Sample Contract File for Testing


- **Download the sample contract here:**  
  [Download Sample Contract (Google Drive)](https://drive.google.com/file/d/1E557kdNBZ5cDUvVDLNrEVRuKcRSYDG3Z/view?usp=sharing)


### 2. OpenAI API Key

An OpenAI API key is required for accessing the language models used in contract evaluation.

- **Don’t have an OpenAI API key? Get one in a few minutes by following this step-by-step guide:**  
  [How to get your own OpenAI API key (Medium article)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

# Step 1: Install Dependencies and Their Descriptions

To begin working with Azure AI Evaluation and related tools, install all required Python packages


```| Package Name             | Description                                                                                         |
|-------------------------|-----------------------------------------------------------------------------------------------------|
| **langchain**           | Framework to build applications powered by language models, helping with prompt management and chaining AI responses.
| **pypdf**               | Library to read and manipulate PDF files, useful for processing contract documents in PDF format.     
| **docx2txt**            | Simple tool to extract text from Microsoft Word (.docx) files, enabling text extraction from contracts.
| **pandas**              | Data analysis and manipulation library, useful for handling structured evaluation results and datasets.
| **openai**              | Python client for OpenAI API, allowing access to OpenAI language models for text generation and evaluation.
| **gradio**              | Easy-to-use library to build interactive UIs for machine learning demos, helpful for building evaluation interfaces.
| **azure-ai-generative** | Azure SDK to use Azure’s generative AI capabilities including chat and text generation.                 
| **langchain-community** | Community-maintained extensions and tools for LangChain to support additional AI workflow features.   
| **azure-ai-evaluation** | Azure AI Evaluation SDK providing built-in evaluators to measure AI-generated content quality and safety.
| **azure-ai-projects**   | SDK to manage Azure AI Foundry projects and execute evaluations, enabling integration with Azure's AI management tools.
| **semantic-kernel**     | Framework for building AI apps with semantic memory and prompt orchestration, enhancing complex AI workflows.

```


Installing these packages equips your Azure AI contract evaluation environment with capabilities for document processing, language model access, interactive interfaces, and quality evaluation tools.

---




In [None]:
# Install Dependencies
! pip install langchain pypdf docx2txt pandas openai gradio azure-ai-generative langchain-community azure-ai-evaluation azure-ai-projects semantic-kernel

# Step 2 : Imports and Configuration

This code cell sets up the necessary imports and configurations for processing contract documents, interacting with OpenAI models, and using Azure AI Evaluation tools.

```
| Import / Library                  | Purpose                                                                                       |
|---------------------------------|------------------------------------------------------------------------------------------------|
| `os`, `io`, `tempfile`, `re`, `json`, `time` | Standard Python libraries for file handling, string processing, JSON manipulation, and timing.
| `PyPDFLoader`, `Docx2txtLoader`, `TextLoader` | LangChain loaders to read and extract text from PDF, DOCX, and text files.                    
| `RecursiveCharacterTextSplitter` | Splits large documents into smaller chunks for better processing by AI models.                 
| `Document`                      | LangChain object representing textual documents with associated metadata.                      
| `pandas`                       | For organizing data and evaluation results in tabular format.                                 
| `gradio`                       | To build interactive user interfaces and demos.                                              
| `OpenAI` and error classes     | OpenAI Python SDK for making API calls and handling errors like rate limits or connection issues.
| `GroundednessEvaluator`, `CoherenceEvaluator`, `RelevanceEvaluator`, `FluencyEvaluator` | Azure AI Evaluation SDK classes to measure AI-generated content’s quality on various dimensions.
```

---

This setup prepares the environment to load contract documents, process them, generate AI responses, and evaluate those responses using Azure AI Evaluation’s built-in metrics.


In [2]:
import os
import io
import tempfile
import re
import json
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd
from langchain.schema import Document
from openai import OpenAI
from openai import APIError, APIConnectionError, RateLimitError
import time
import gradio as gr

# Azure AI Evaluation imports
from azure.ai.evaluation import GroundednessEvaluator, CoherenceEvaluator, RelevanceEvaluator, FluencyEvaluator

## Step 3 : OpenAI and Azure AI Configuration

### OpenAI Client Initialization

> 1. **OpenAI API Key**  
>    To use the OpenAI client, you need an API key.  
>    👉 Follow this guide to get your API key:  
>    [How to get your own OpenAI API key (Medium)](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)

---

### Azure AI Project Configuration

The `azure_ai_project` dictionary holds metadata for identifying and accessing a specific Azure AI project. This includes:

- `subscription_id`: The unique identifier of your Azure subscription.
- `resource_group_name`: The name of the resource group that contains your Azure resources.
- `project_name`: The name of the Azure AI project or workspace.

> **To configure the Azure AI project:**  
> 👉 Use this official Microsoft guide to create an Azure AI project:  
> [Create Azure AI Projects (Microsoft Docs)](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-foundry&pivots=fdp-project)


---

### Azure OpenAI Model Configuration

The `model_config` dictionary contains connection details required to interact with an Azure-hosted OpenAI deployment.

- `azure_endpoint`: The base URL of the Azure OpenAI resource.
- `api_key`: The API key for authentication against the Azure OpenAI endpoint.
- `azure_deployment`: The specific deployment name of the OpenAI model within Azure.
- `api_version`: The version of the Azure OpenAI API to be used (e.g., `2024-02-15-preview`).

> **To configure the Azure OpenAI model deployment:**  
> 👉 Follow this official Microsoft guide to create and deploy an Azure OpenAI resource:  
> [Create and Deploy Azure OpenAI Resources (Microsoft Docs)](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal)

---

> ❗ **Important:**  
> You **must** have both your **OpenAI API key** and your **Azure AI project + model configuration** completed.  
> Without these keys and setup, the application **will not run** and you **cannot proceed** further.



In [None]:
# Initialize OpenAI client (replace with your own API key securely)
# Follow this link to get your api key : https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)
client = OpenAI(api_key='YOUR_OPENAI_API_KEY')


# Initialize Azure AI project configuration (replace placeholders with your own values)
# Follow this link create a azure ai project  : https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-foundry&pivots=fdp-project
azure_ai_project = {
    "subscription_id": "your-subscription-id",
    "resource_group_name": "your-resource-group-name",
    "project_name": "your-project-name",
}


# Azure OpenAI model configuration (replace placeholders with your own values)
# Follow this to create and deploy an azure open ai project : https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal
model_config = {
    "azure_endpoint": "your-azure-endpoint",
    "api_key": "YOUR_AZURE_API_KEY",
    "azure_deployment": "your-deployment-name",
    "api_version": "your-api-version"
}


### Step 4 : Azure Evaluator Initialization

Initializes multiple evaluation modules from Azure AI to assess different quality aspects of text generation (e.g., model outputs or document parsing).

- **`GroundednessEvaluator`**: Measures how well the generated content is grounded in the source material.
- **`CoherenceEvaluator`**: Evaluates logical flow and consistency between sentences.
- **`RelevanceEvaluator`**: Checks how relevant the output is to the input or expected context.
- **`FluencyEvaluator`**: Assesses the grammatical and linguistic quality of the content.

Each evaluator is configured using the `model_config` dictionary, which must contain Azure OpenAI endpoint, API key, deployment name, and API version.



---

### Key Terms to Extract

Defines a list of key legal or contractual terms to identify and extract from the contract.

- **`"Product Name"`**: The name of the product being described or licensed.
- **`"Limitation of Liability In Months"`**: The time limit for legal liability in the agreement.
- **`"Governing Law"`**: Specifies which jurisdiction’s laws apply to the contract.

In [4]:
# Initialize Azure evaluators
groundedness_eval = GroundednessEvaluator(model_config=model_config)
coherence_eval = CoherenceEvaluator(model_config=model_config)
relevance_eval = RelevanceEvaluator(model_config=model_config)
fluency_eval = FluencyEvaluator(model_config=model_config)

# Key terms to extract
KEY_TERMS = [
    "Product Name",
    "Limitation of Liability In Months",
    "Governing Law"
]

## Step 5  : File Text Extraction Helper

### Function: `extract_text_from_file(file_path)`

This helper function handles the extraction of text from various file types, converting them into a consistent structure for downstream processing (e.g., with LangChain or LLM pipelines).

#### Parameters

- **`file_path`** *(str)*: Path to the input file whose content needs to be extracted.

#### Returns

- **`text`** *(str)*: Full extracted text from the file, as a single string.
- **`docs`** *(list)*: A list of document-like objects, each having a `page_content` attribute for structured processing.

---

### Supported File Types

| File Type | Loader Used         | Description                              |
|-----------|---------------------|------------------------------------------|
| `.pdf`    | `PyPDFLoader`       | Extracts page-wise text from PDF         |
| `.docx`, `.doc` | `Docx2txtLoader` | Reads content from Word documents        |
| `.txt`    | `TextLoader`        | Loads plain text from text files         |
| `.csv`    | `pandas.read_csv`   | Converts CSV rows into a text string     |



> ⚠️ **Warning:** If the file type is not recognized, the function raises a `ValueError`.


In [5]:
def extract_text_from_file(file_path):
    # Get the file extension and convert it to lowercase
    ext = os.path.splitext(file_path)[1].lower()

    # Handle PDF files
    if ext == ".pdf":
        loader = PyPDFLoader(file_path)  # Use PyPDFLoader to read PDF
        docs = loader.load()  # Load document into LangChain Document objects
        text = "\n".join([doc.page_content for doc in docs])  # Combine all page content

    # Handle Word documents (.docx, .doc)
    elif ext in [".docx", ".doc"]:
        loader = Docx2txtLoader(file_path)  # Use Docx2txtLoader for Word files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle plain text files
    elif ext in [".txt"]:
        loader = TextLoader(file_path)  # Use TextLoader for .txt files
        docs = loader.load()
        text = "\n".join([doc.page_content for doc in docs])

    # Handle CSV files
    elif ext == ".csv":
        df = pd.read_csv(file_path)  # Read CSV using pandas
        text = df.to_string(index=False)  # Convert DataFrame to plain string
        # Wrap in a dummy doc-like object to keep consistent structure
        docs = [type('Doc', (object,), {'page_content': text})()]

    # Unsupported file types
    else:
        raise ValueError("Unsupported file type")

    # Return both raw text and structured docs for further processing
    return text, docs  # docs may include page-level details

## Step 6 : Key Term Extraction with LLM and JSON Parsing

### Function: `extract_key_terms(text, key_terms)`

This function extracts specific legal or contractual terms from a text document using a large language model (LLM). It prompts the LLM to return structured JSON responses and parses them with resilience against formatting issues.

---

### 📥 Parameters

- **`text`** *(str)*: Full contract or document text to be analyzed.
- **`key_terms`** *(list of str)*: List of term names (e.g., `"Governing Law"`) to extract from the document.

---

### 📤 Returns

- **`results`** *(dict)*: Dictionary mapping each key term to a dictionary containing:
  - `"Value"`: The extracted value or `"Not found"`.
  - `"page_number"`: The page number where the term was found (if available).

Example:
```json
{
  "Governing Law": {
    "Value": "California",
    "page_number": "5"
  }
}


In [6]:
def extract_key_terms(text, key_terms):

    def safe_json_parse(response_text):
        """Safely parse JSON from LLM response with fallback strategies."""
        # Strategy 1: Try direct JSON parsing
        try:
            return json.loads(response_text.strip())
        except json.JSONDecodeError:
            pass

        # Strategy 2: Extract from code blocks
        json_patterns = [
            r'```json\s*(.*?)\s*```',
            r'```\s*(.*?)\s*```',
            r'\{.*\}'
        ]

        for pattern in json_patterns:
            json_match = re.search(pattern, response_text, re.DOTALL)
            if json_match:
                try:
                    return json.loads(json_match.group(1).strip() if 'json' in pattern else json_match.group(0))
                except json.JSONDecodeError:
                    continue

        # Fallback: return default structure
        return {"Value": "Not found", "Page Number": None, "Section": None}

    results = {}
    for term in key_terms:
        prompt = (
            f"Act as a legal expert. From this contract text, extract the value for '{term}'.\n\n"
            f"Contract Text: {text}\n\n"
            f"Instructions:\n"
            f"1. Provide a one-word answer for '{term}' if found\n"
            f"2. Include page number if available\n"
            f"3. If not found, use 'Not found'\n\n"
            f"Return ONLY valid JSON in this exact format:\n"
            f'{{"Value": "your_answer", "Page Number": "page_number_or_null", "Section": "section_name"}}'
        )

        try:
            completion = client.chat.completions.create(
                model="gpt-4o-mini",  # Fixed model name
                messages=[
                    {"role": "system", "content": "You are a legal contract analysis assistant. Always return valid JSON only."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1  # Lower temperature for consistency
            )
            answer = completion.choices[0].message.content
            print("****************LLM Answer*******************")
            print(answer)
            print("*********************************************")

            # Use safe JSON parsing
            parsed_response = safe_json_parse(answer)
            print(f"DEBUG: Parsed JSON successfully: {parsed_response}")

            value = parsed_response.get("Value", "Not found")
            print(f"DEBUG: Extracted value: {value}")

            # Extract page number with multiple fallback options
            page_number = (
                parsed_response.get("Page Number") or
                parsed_response.get("PageNumber") or
                parsed_response.get("page_number")
            )
            print(f"DEBUG: Direct page number extraction: {page_number}")

            # If not found, try extracting from Section field
            if not page_number:
                section = parsed_response.get("Section", "")
                print(f"DEBUG: Section field content: {section}")
                if section and "Page" in str(section):
                    page_match = re.search(r'Page (\d+)', str(section))
                    if page_match:
                        page_number = page_match.group(1)
                        print(f"DEBUG: Extracted page number from section: {page_number}")

        except Exception as e:
            print(f"DEBUG: API call or parsing failed: {e}")
            value = "Error"
            page_number = None

        final_result = {"Value": value, "page_number": page_number}
        print(f"DEBUG: Final result for term '{term}': {final_result}")
        print("=" * 60)

        results[term] = final_result

    print(f"DEBUG: All results: {results}")
    return results



## Step 7 : LLM Response Evaluation with Azure AI

### Function: `azure_judge(entry, results_df)`

This function evaluates a single LLM-generated response across four evaluation — **groundedness**, **coherence**, **relevance**, and **fluency** — using Azure-based evaluators. It returns an updated results DataFrame with the new evaluation appended.

---

### 📥 Parameters

- **`entry`** *(dict)*: A dictionary containing the following fields:
  - `"Key Term Name"`: Name of the key term being evaluated.
  - `"context"`: Background or reference text used for grounding.
  - `"query"`: The question or prompt issued to the LLM.
  - `"llm_response"`: The LLM-generated output to be evaluated.

- **`results_df`** *(pd.DataFrame)*: The existing DataFrame to which the evaluation result will be appended.

---

### 📤 Returns

- **`pd.DataFrame`**: An updated DataFrame with a new row containing:
  - Evaluation scores from all four evaluators
  - The query, context, response, and key term name

---


In [7]:
def azure_judge(entry, results_df):
    """Evaluate a single entry using all evaluators and add to results dataframe"""

    # Format inputs for each evaluator - using llm_response instead of ground_truth
    groundedness_input = {
        "query": entry["query"],
        "context": entry["context"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    coherence_input = {
        "query": entry["query"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    relevance_input = {
        "query": entry["query"],
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    fluency_input = {
        "response": entry["llm_response"]  # Changed from ground_truth to llm_response
    }

    try:
        # Get all scores
        groundedness_score = groundedness_eval(**groundedness_input)
        coherence_score = coherence_eval(**coherence_input)
        relevance_score = relevance_eval(**relevance_input)
        fluency_score = fluency_eval(**fluency_input)

        # Create new row for the dataframe
        new_row = {
            'Key Term Name': entry['Key Term Name'], # Ensure Key Term Name is passed through
            'Context': entry['context'],
            'Query': entry['query'],
            'LLM Response': entry['llm_response'],
            'Groundedness Score': groundedness_score['groundedness'],
            'Coherence Score': coherence_score['coherence'],
            'Relevance Score': relevance_score['relevance'],
            'Fluency Score': fluency_score['fluency']
        }

    except Exception as e:
        print(f"Error evaluating entry: {e}")
        # Create row with error indicators
        new_row = {
            'Key Term Name': entry['Key Term Name'],
            'Context': entry['context'],
            'Query': entry['query'],
            'LLM Response': entry['llm_response'],
            'Groundedness Score': None,
            'Coherence Score': None,
            'Relevance Score': None,
            'Fluency Score': None
        }

    # Use concat with a pre-defined DataFrame
    new_row_df = pd.DataFrame([new_row])
    return pd.concat([results_df, new_row_df], ignore_index=True)

## Step 8 : Main Document Processing Pipeline

### Function: `process_document(file)`

This core function orchestrates the end-to-end workflow for document ingestion, key term extraction, and quality evaluation of AI-generated outputs using Azure Large Language Model (LLM) evaluators. It is designed to process contractual or related documents, extracting relevant terms and assessing the reliability and quality of the extracted data.

---

### Parameters

- **`file`** *(UploadedFile or equivalent)*:  
  The input file to be processed. Supported formats include `.pdf`, `.docx`, `.txt`, and `.csv`.

---

### Returns

A tuple consisting of:

1. **Status Message** *(str)* — Indicates success or explains errors encountered during processing.  
2. **`extracted_df`** *(pandas.DataFrame)* — A structured table containing extracted key contractual terms, their detected values, and associated page numbers or locations within the document.  
3. **`evaluation_summary`** *(pandas.DataFrame)* — A summary table containing quality scores generated by Azure LLM evaluators for each extracted term, measuring attributes such as groundedness, coherence, relevance, and fluency.

---

### Workflow Overview

#### 1. File Validation  
The function initially verifies that a valid file has been provided. If no file is detected, it terminates early with a user-friendly prompt requesting file upload.

#### 2. Document Text Extraction  
The input document is processed using the `extract_text_from_file()` utility, which supports multiple file formats. This step yields:  
- A single concatenated string representing the full document text.  
- A structured list of LangChain-compatible document objects (`docs`) for downstream NLP processing.

> **Supported formats:** PDF, Microsoft Word (.docx), plain text (.txt), and CSV files.

#### 3. Key Term Extraction  
Utilizing prompt-based Large Language Model (LLM) techniques, the function `extract_key_terms()` is called to accurately identify and extract relevant contractual terms and their corresponding values from the processed text. The output is a dictionary mapping key terms to their detected values along with metadata such as page numbers where the terms were found.

> **Example key contractual terms:** `"Governing Law"`, `"Product Name"`, `"Termination Clause"`.

#### 4. Azure AI Evaluation  
Each extracted key term is then evaluated for quality and reliability:  
- An evaluation context is prepared combining the original query, relevant document context, and the LLM’s extracted value.  
- The `azure_judge()` function invokes Azure AI evaluators to assess each response across multiple dimensions:  
  - **Groundedness:** Degree to which the output is supported by source content.  
  - **Coherence:** Logical consistency and clarity of the extracted response.  
  - **Relevance:** Pertinence of the response to the specific query or clause.  
  - **Fluency:** Language quality, readability, and grammatical correctness.

In [8]:
# Main Processing Function

def process_document(file):
    """Main function to process document and return results"""

    if file is None:
        return "Please upload a document first.", None, None

    try:
        # Step 2: Extract text from document
        print("Step 2: Extracting text from document...")
        document_text, docs = extract_text_from_file(file.name)

        # Step 3: Extract key terms using LLM
        print("Step 3: Extracting key terms...")
        key_term_results = extract_key_terms(document_text, KEY_TERMS)

        # Step 4: Prepare data for Azure evaluation
        print("Step 4: Preparing for Azure evaluation...")
        results_df = pd.DataFrame()

        extracted_info = []

        for term, result in key_term_results.items():
            # Prepare entry for azure evaluation
            entry = {
                'Key Term Name': term,
                'context': document_text[:2000],  # Limit context for API
                'query': f"Extract the {term} from this contract",
                'llm_response': result['Value']
            }

            # Add to extracted info for display
            extracted_info.append({
                'Key Term': term,
                'Extracted Value': result['Value'],
                'Page Number': result['page_number'] if result['page_number'] else 'N/A'
            })

            # Evaluate with Azure
            print(f"Evaluating {term}...")
            results_df = azure_judge(entry, results_df)

        # Create extracted info dataframe for display
        extracted_df = pd.DataFrame(extracted_info)

        # Create evaluation summary
        if not results_df.empty:
            evaluation_summary = results_df[['Key Term Name', 'LLM Response', 'Groundedness Score',
                                           'Coherence Score', 'Relevance Score', 'Fluency Score']].copy()
        else:
            evaluation_summary = pd.DataFrame()

        return (
            f"✅ Document processed successfully!\n\nExtracted {len(key_term_results)} key terms.",
            extracted_df,
            evaluation_summary
        )

    except Exception as e:
        return f"❌ Error processing document: {str(e)}", None, None

## Step 9 : Dashboard Visualization of LLM Evaluation Scores

### Function: `create_dashboard(evaluation_df)`

This function generates a comprehensive three-panel dashboard using Matplotlib and Seaborn to visualize evaluation metrics of Large Language Model (LLM) outputs. It presents key quality attributes such as groundedness, coherence, relevance, and fluency across extracted terms from document analysis.

---

### Parameters

- **`evaluation_df`** *(pandas.DataFrame)*:  
  A DataFrame containing LLM evaluation scores, typically produced by the `process_document()` pipeline. It should include columns representing different quality metrics aligned with specific key terms.

---

### Returns

- **`fig`** *(matplotlib.figure.Figure)*:  
  A Matplotlib Figure object that encapsulates the dashboard's visualizations. This figure can be rendered directly in various Python UI frameworks like Streamlit or saved as an image file for reporting purposes.

---

### Visualization Panels

#### 1. Average Evaluation Scores (Bar Chart)  
Displays the overall average scores for the four core evaluation metrics (groundedness, coherence, relevance, fluency) aggregated across all key terms and documents.  
> **Note:** Scores are expected to range from 0 (lowest) to 5 (highest).

#### 2. Performance by Key Term (Bar Chart)  
Aggregates and depicts scores per extracted key term, allowing users to identify specific contract clauses or topics where the LLM's performance is stronger or weaker.

#### 3. Overall Performance Gauge  
A semi-circular gauge chart illustrating the mean evaluation score across all terms and metrics.  
- Uses categorical color coding to visually differentiate performance levels:  
  - 🟢 Excellent (4.0 – 5.0)  
  - 🟠 Good (3.0 – 4.0)  
  - 🔴 Fair (2.0 – 3.0)  
  - ⚪ Poor (< 2.0)

---

### Internal Logic and Features

- **Data Preparation:**  
  - Converts evaluation score columns to numeric format.  
  - Filters out records with all missing (`NaN`) values to maintain plot accuracy.

- **Styling and Palettes:**  
  - Employs Seaborn's `"husl"` color palette to ensure vibrant, consistent aesthetics.  
  - Custom colors are assigned for the gauge’s performance segments.

- **Annotations and Responsiveness:**  
  - Bar charts include value labels directly on bars for readability.  
  - Axes and tick labels are styled for clear visualization even with multiple terms.

- **Gauge Plot Implementation:**  
  - Uses polar coordinates to simulate a semi-circular gauge.  
  - Divides the gauge into colored bands representing performance categories.  
  - A needle indicates the computed mean score dynamically.

- **Fallback Behavior:**  
  - In the absence of valid data, the function returns a blank figure with a user-friendly message indicating no data is available to plot.





In [9]:
# Dashboard Visualization Function

def create_dashboard(evaluation_df):
    """Create dashboard visualization of evaluation scores"""

    if evaluation_df is None or evaluation_df.empty:
        return None

    import matplotlib.pyplot as plt
    import seaborn as sns

    # Set style
    plt.style.use('default')
    sns.set_palette("husl")

    # Create figure with subplots
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    fig.suptitle('Document Evaluation Dashboard', fontsize=16, fontweight='bold')

    # Score columns
    score_columns = ['Groundedness Score', 'Coherence Score', 'Relevance Score', 'Fluency Score']

    # Filter out None values and convert to numeric
    plot_data = evaluation_df.copy()
    for col in score_columns:
        plot_data[col] = pd.to_numeric(plot_data[col], errors='coerce')

    # Remove rows with all NaN scores
    plot_data = plot_data.dropna(subset=score_columns, how='all')

    if plot_data.empty:
        # If no valid data, show message
        fig.text(0.5, 0.5, 'No valid evaluation scores available',
                ha='center', va='center', fontsize=20)
        return fig

    # 1. Individual Scores Bar Chart
    ax1 = axes[0]
    scores_avg = plot_data[score_columns].mean()
    bars1 = ax1.bar(range(len(scores_avg)), scores_avg.values,
                    color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    ax1.set_xlabel('Evaluation Metrics')
    ax1.set_ylabel('Average Score')
    ax1.set_title('Average Evaluation Scores')
    ax1.set_xticks(range(len(scores_avg)))
    ax1.set_xticklabels([col.replace(' Score', '') for col in scores_avg.index], rotation=45)
    ax1.set_ylim(0, 5)  # Assuming scores are 0-5

    # Add value labels on bars
    for bar, value in zip(bars1, scores_avg.values):
        height = bar.get_height()
        if not pd.isna(height):
            ax1.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                    f'{height:.2f}', ha='center', va='bottom')

    # 2. Key Terms Performance
    ax2 = axes[1]
    if len(plot_data) > 0:
        term_scores = plot_data.groupby('Key Term Name')[score_columns].mean().mean(axis=1)
        bars2 = ax2.bar(range(len(term_scores)), term_scores.values,
                       color=['#FFD93D', '#6BCF7F', '#4D96FF'])
        ax2.set_xlabel('Key Terms')
        ax2.set_ylabel('Average Score')
        ax2.set_title('Performance by Key Term')
        ax2.set_xticks(range(len(term_scores)))
        ax2.set_xticklabels(term_scores.index, rotation=45, ha='right')
        ax2.set_ylim(0, 5)

        # Add value labels
        for bar, value in zip(bars2, term_scores.values):
            height = bar.get_height()
            if not pd.isna(height):
                ax2.text(bar.get_x() + bar.get_width()/2., height + 0.05,
                        f'{height:.2f}', ha='center', va='bottom')

    # 3. Overall Performance Gauge
    ax3 = axes[2]
    overall_score = plot_data[score_columns].mean().mean()

    # Create a simple gauge chart
    theta = np.linspace(0, np.pi, 100)
    r = np.ones_like(theta)

    # Color based on score
    if pd.notna(overall_score):
        if overall_score >= 4:
            color = '#2ECC71'  # Green
            status = 'Excellent'
        elif overall_score >= 3:
            color = '#F39C12'  # Orange
            status = 'Good'
        elif overall_score >= 2:
            color = '#E74C3C'  # Red
            status = 'Fair'
        else:
            color = '#95A5A6'  # Gray
            status = 'Poor'

        # Plot gauge
        ax3.fill_between(theta, 0, r, alpha=0.3, color='lightgray')
        gauge_theta = np.linspace(0, np.pi * (overall_score/5), 50)
        gauge_r = np.ones_like(gauge_theta)
        ax3.fill_between(gauge_theta, 0, gauge_r, alpha=0.8, color=color)

        ax3.set_ylim(0, 1)
        ax3.set_xlim(0, np.pi)
        ax3.set_title('Overall Performance')
        ax3.text(np.pi/2, 0.5, f'{overall_score:.2f}\n{status}',
                ha='center', va='center', fontsize=14, fontweight='bold')
        ax3.set_xticks([])
        ax3.set_yticks([])
        ax3.spines['top'].set_visible(False)
        ax3.spines['right'].set_visible(False)
        ax3.spines['bottom'].set_visible(False)
        ax3.spines['left'].set_visible(False)

    plt.tight_layout()
    return fig

## Step 10 :  Gradio Interface – Legal Document Analyzer

### Purpose

This section defines the **Gradio user interface (UI)** that enables users to upload legal documents, extract key contractual terms, evaluate their quality using Azure LLM metrics, and visualize the results through an interactive dashboard.

---

### Function: `gradio_process_wrapper(file)`

A wrapper function that connects the Gradio frontend input to the backend processing pipeline.

#### Workflow:

1. Receives a document upload (`file`), then calls `process_document(file)` to:
   - Extract document text and key contractual terms.
   - Run quality evaluation metrics on the extracted data.
2. If evaluation results are available, invokes `create_dashboard()` to generate a comprehensive visualization of performance.
3. Returns the following outputs to the UI:
   - Status message indicating success or error.
   - Table of extracted key terms with associated values and locations.
   - Table of LLM evaluation scores per term.
   - Matplotlib figure containing the evaluation dashboard.

In [None]:
# Gradio Interface

import numpy as np

def gradio_process_wrapper(file):
    """Wrapper function for Gradio that processes document and returns dashboard"""

    # Process the document
    status, extracted_df, evaluation_df = process_document(file)

    # Create dashboard if evaluation data exists
    dashboard_plot = None
    if evaluation_df is not None and not evaluation_df.empty:
        dashboard_plot = create_dashboard(evaluation_df)

    return status, extracted_df, evaluation_df, dashboard_plot

# Create Gradio interface
with gr.Blocks(title="Legal Document Analyzer", theme=gr.themes.Soft()) as demo:
    gr.Markdown(
        """
        # 📋 Legal Document Analyzer

        Upload a legal document (PDF, DOCX, TXT, CSV) and extract key terms with AI evaluation.

        **Key Terms Extracted:**
        - Product Name
        - Limitation of Liability In Months
        - Governing Law
        """
    )

    with gr.Row():
        with gr.Column(scale=1):
            # Upload section
            gr.Markdown("## 📤 Upload Document")
            file_input = gr.File(
                label="Choose Document",
                file_types=[".pdf", ".docx", ".doc", ".txt", ".csv"],
                type="filepath"
            )

            process_btn = gr.Button(
                "🚀 Start Evaluating",
                variant="primary",
                size="lg"
            )

            # Status output
            status_output = gr.Textbox(
                label="Status",
                interactive=False,
                placeholder="Upload a document and click 'Start Evaluating'..."
            )

        with gr.Column(scale=2):
            # Results section
            gr.Markdown("## 📊 Evaluation Dashboard")
            dashboard_plot = gr.Plot(label="Performance Dashboard")

    with gr.Row():
        with gr.Column():
            gr.Markdown("## 📝 Extracted Information")
            extracted_table = gr.Dataframe(
                label="Key Terms Extracted",
                interactive=False,
                wrap=True
            )

        with gr.Column():
            gr.Markdown("## 🎯 Evaluation Scores")
            evaluation_table = gr.Dataframe(
                label="AI Evaluation Results",
                interactive=False,
                wrap=True
            )

    # Event handlers
    process_btn.click(
        fn=gradio_process_wrapper,
        inputs=[file_input],
        outputs=[status_output, extracted_table, evaluation_table, dashboard_plot],
        show_progress=True
    )

    # Example section
    gr.Markdown(
        """
        ## 📋 Evaluation Metrics Explained

        - **Groundedness**: How well the response is supported by the context
        - **Coherence**: How logical and well-structured the response is
        - **Relevance**: How relevant the response is to the query
        - **Fluency**: How natural and well-written the response is

        *Scores range from 1-5, with 5 being the best.*
        """
    )

# Launch the interface
if __name__ == "__main__":
    demo.launch(debug=True, share=True)