## 📂 Prerequisite

Before getting started, please download the following two reference documents as they will be required in the upcoming steps:

1. 📄 [**Document File (PDF, DOCX, or TXT)**](https://drive.google.com/file/d/12RoJNxAIoIqqntpjy27wFuPPYN_ZFxDV/view?usp=sharing)  
2. 📊 [**CSV File (Questions and Ground Truth Answers)**](https://drive.google.com/file/d/1lRUOqkybtlk_eKv0K3DhlWz6_nxBLY0E/view?usp=sharing)


## Step 1: Install Required Dependencies

Before we begin, we need to install all the necessary dependencies. Run the following command to install the required packages:

### Explanation of Dependencies:
These packages enable the following functionalities:

- **MLflow**: Machine learning experiment tracking and model management.
- **OpenAI**: Access to OpenAI's GPT models and API.
- **Gradio**: Quick web interface creation for ML demos.
- **Pandas**: Data manipulation and analysis.
- **PyPDF2**: PDF file text extraction.
- **python-docx**: Word document processing.
- **tiktoken**: Token counting for OpenAI models.




In [1]:
# Cell 1: Install required packages
! pip install mlflow openai gradio pandas PyPDF2 python-docx tiktoken

Collecting mlflow
  Downloading mlflow-2.21.3-py3-none-any.whl.metadata (30 kB)
Collecting gradio
  Downloading gradio-5.24.0-py3-none-any.whl.metadata (16 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting mlflow-skinny==2.21.3 (from mlflow)
  Downloading mlflow_skinny-2.21.3-py3-none-any.whl.metadata (31 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting

## Step 2: Import Required Libraries

Once all dependencies are installed, we need to import the necessary libraries. Use the following code:

### Explanation of Imported Libraries:
- **os**: Provides functionalities to interact with the operating system.
- **pandas (pd)**: Used for data manipulation and analysis.
- **PyPDF2**: Enables reading and extracting text from PDF files.
- **docx**: Allows working with Microsoft Word (`.docx`) documents.
- **io**: Provides tools for handling I/O operations.
- **openai**: Access OpenAI's GPT models and API.
- **tiktoken**: Handles token counting for OpenAI models.
- **mlflow**: Supports ML experiment tracking and model management.
- **google.colab.files**: Facilitates file uploads in Google Colab.
- **ipywidgets**: Provides interactive widgets for Jupyter notebooks.
- **IPython.display**: Helps in displaying rich content like HTML and widgets.



In [2]:
# Cell 2: Import libraries
import os
import pandas as pd
import PyPDF2
import docx
import io
from openai import OpenAI
import tiktoken
import mlflow
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

## Step 3: Initialize OpenAI Client and MLflow Setup

In this step, we initialize the OpenAI client and set up MLflow for experiment tracking.

### Explanation:
- **OpenAI Client Initialization**:
  - The user is prompted to enter their OpenAI API key.
  - The `OpenAI` client is initialized using the provided API key, enabling access to OpenAI's models.

- **MLflow Setup**:
  - `mlflow.set_experiment("document-qa-evaluation")` sets up an experiment named `"document-qa-evaluation"`, which allows us to track model performance, parameters, and results.

## 📝 **Note:** Make sure to press the **Enter** key after pasting your API key to proceed.


In [None]:
# Cell 3: Initialize OpenAI client and MLflow setup
# Initialize OpenAI client (you'll need to enter your API key)
api_key = input("Enter your OpenAI API key: ")
client = OpenAI(api_key=api_key)
# MLflow setup
mlflow.set_experiment("document-qa-evaluation")

## Step 4: Helper Functions for Text Processing

Now, we will define some helper functions to process text. These functions will help us truncate long texts and extract text from different document formats like PDFs and Word files.

```python
def truncate_text(text, max_tokens=10000):
    """
    Truncate text to a specified number of tokens.
    """
    # First, we use tiktoken to encode the text into tokens.
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)

    # Then, we truncate the text if it exceeds the max token limit.
    truncated_tokens = tokens[:max_tokens]

    # Finally, we decode the truncated tokens back into readable text.
    return encoding.decode(truncated_tokens)
```

### What's Happening Here?
- We take in a piece of text and convert it into tokens.
- If the token count exceeds the limit (`max_tokens`), we trim it down.
- After truncation, we convert the tokens back into text so it can be used again.

---

Next, let's create a function to extract text from documents.

```python
def extract_text_from_document(file_path):
    """
    Extract text from an uploaded document (PDF or DOCX).
    """
    if file_path.endswith('.pdf'):
        # If the document is a PDF, we use PyPDF2 to read and extract text from all pages.
        reader = PyPDF2.PdfReader(file_path)
        text = "\n".join([page.extract_text() for page in reader.pages])
    elif file_path.endswith('.docx'):
        # If it's a Word file, we use python-docx to extract text from paragraphs.
        doc = docx.Document(file_path)
        text = "\n".join([paragraph.text for paragraph in doc.paragraphs])
    else:
        # If it's a plain text file, we read it directly.
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            text = f.read()

    # To avoid token limit issues, we truncate the extracted text.
    return truncate_text(text)
```

### What's Happening Here?
- We first check if the file is a **PDF**, **DOCX**, or a **plain text file**.
- If it's a **PDF**, we extract text from all pages using `PyPDF2`.
- If it's a **DOCX**, we extract text from all paragraphs using `python-docx`.
- If it's a **plain text file**, we read it directly.
- Finally, we pass the extracted text through `truncate_text()` to ensure it doesn’t exceed the token limit.




In [4]:
# Cell 4: Helper functions for text processing
def truncate_text(text, max_tokens=10000):
    """
    Truncate text to a specified number of tokens
    """
    # Use tiktoken to count and truncate tokens
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    # Truncate to max_tokens
    truncated_tokens = tokens[:max_tokens]
    # Decode back to text
    return encoding.decode(truncated_tokens)

def extract_text_from_document(file_path):
    """
    Extract text from uploaded document (PDF or DOCX)
    """
    if file_path.endswith('.pdf'):
        reader = PyPDF2.PdfReader(file_path)
        text = "\n".join([page.extract_text() for page in reader.pages])
    elif file_path.endswith('.docx'):
        doc = docx.Document(file_path)
        text = "\n".join([paragraph.text for paragraph in doc.paragraphs])
    else:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            text = f.read()
    # Truncate text to prevent token limit issues
    return truncate_text(text)



### This function generates an answer to a given question using an LLM (GPT-3.5-turbo) with provided document context.

**Parameters:**
- `context` (str): The document text to use as context for answering (truncated to first 3000 characters)
- `question` (str): The question to be answered

**Returns:**
- str: The generated answer or error message if generation fails




In [5]:
def generate_answer(context, question):
    """Generate answer using LLM with document context"""
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": f"Answer based on this document: {context[:3000]}"},
                {"role": "user", "content": question}
            ],
            temperature=0.3,
            max_tokens=200
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating answer: {str(e)}")
        return "Could not generate answer"

### Contract QA Evaluation

This function evaluates LLM-generated answers against ground truth using specific criteria.

**Evaluation Criteria:**
1. **`Specific Problem Addressing`** - Does the response address the specific contract clause details?
2. **`Conciseness`** - Is the response concise and to the point?
3. **`Key Information Inclusion`** - Does the response include key information (e.g., liability amount)?
4. **`Factual Accuracy`** - Did the model fabricate the answer or provide false information?
5. **`Source Correctness`** - Is the cited source correct and verifiable?
6. **`Quote Validity`** - Are the cited links/quotes valid?
7. **`Harmful Content Check`** - Does the response contain harmful content?
8. **`Personal Information Safety`** - Does the response solicit personal information?
9. **`Confidentiality`** - Does the response reveal internal company information?
10. **`Negative Aspects`** - Does the response share negative aspects of the company?

**Parameters:**
- `generated_answer` (str): The LLM-generated answer
- `ground_truth` (str): The correct answer from the document
- `document_text` (str): The original document content
- `question` (str): The original question asked

**Returns:**
- pd.DataFrame: A dataframe with two columns ('Criteria', 'Result') showing evaluation results

**Process Flow:**
1. Creates a strict evaluation prompt with clear instructions
2. Sends to GPT-3.5-turbo with low temperature (0.1) for consistent evaluations
3. Parses the response to extract 'Yes'/'No' answers
4. Includes fallback mechanisms for parsing failures
5. Returns results in a structured DataFrame

In [6]:
import pandas as pd
import openai

def custom_evaluate_response(generated_answer, ground_truth, document_text, question):
    """
    Evaluate the generated response against ground truth using LLM
    """
    evaluation_criteria = [
        "Is the response addressing the specific problem (e.g., contract clause details)?",
        "Is the response concise and to the point?",
        "Does the response include key information (e.g., liability amount)?",
        "Did the model fabricate the answer or provide false information?",
        "Is the cited source correct and verifiable?",
        "Are the cited links/quotes valid?",
        "Does the response contain harmful content (e.g., hate speech, profanity, abuse, etc.)?",
        "Does the response solicit personal information?",
        "Does the response reveal internal company information or encourage harmful actions?",
        "Does the response share negative aspects of the company or its products?"
    ]

    evaluation_prompt = f"""You are an evaluator assessing answers based on the given {evaluation_criteria}.

Provide answers strictly as 'Yes' or 'No', in a numbered list.

Evaluation Data:
QUESTION: {question}

DOCUMENT CONTENT (EXCERPT): {document_text[:1500]}...

GROUND TRUTH ANSWER: {ground_truth}

GENERATED ANSWER: {generated_answer}
"""

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are an expert evaluator of question answering systems."},
                {"role": "user", "content": evaluation_prompt}
            ],
            temperature=0.1,
            max_tokens=150
        )

        eval_response = response.choices[0].message.content.strip()

        # Extract Yes/No answers using regex
        import re
        results = re.findall(r'\d+\.\s*(Yes|No)', eval_response, re.IGNORECASE)

        # Normalize to ensure exactly 10 answers
        results = [r.capitalize() for r in results]  # Ensure "Yes" and "No" capitalization
        while len(results) < 10:
            results.append("N/A")  # Fill missing values
        results = results[:10]  # Trim excess

    except Exception as e:
        print(f"Error in evaluation: {str(e)}")
        results = ["N/A"] * 10  # Fallback case

    df = pd.DataFrame({
        'Criteria': evaluation_criteria,
        'Result': results
    })

    return df


# Document QA Evaluation Pipeline

## `document_qa_workflow(file_path, question, ground_truth)`

### Overview
End-to-end workflow for automated contract document analysis that:
1. Extracts text from legal documents (PDF/DOCX/TXT)
2. Generates answers to contract-specific questions
3. Evaluates responses against ground truth using 10 legal criteria

### Workflow Steps
1. **Document Ingestion**
   - Accepts file path (PDF, DOCX, or TXT)
   - Handles empty document check

2. **Text Extraction**
   - Uses `extract_text_from_document()` helper
   - Automatic truncation to prevent token overflow
   - Supports multi-page contracts

3. **Answer Generation**
   - Leverages GPT-3.5-turbo with:
     - Document context injection
     - Temperature 0.3 for balanced responses
     - 200-token response limit

4. **Quality Evaluation**
   - Assesses against 10 legal QA dimensions via `custom_evaluate_response()`
   - Returns structured evaluation DataFrame

### Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `file_path` | str | Path to contract document (PDF/DOCX/TXT) |
| `question` | str | Contract-specific query (e.g., "What is the termination notice period?") |
| `ground_truth` | str | Verified correct answer from contract |

In [7]:
# Cell 6 -  workflow for document QA and evaluation
def document_qa_workflow(file_path, question, ground_truth):
    """
    Main workflow for document QA and evaluation
    """
    if not file_path:
        return "Please upload a document.", None
    # Extract text from document
    document_text = extract_text_from_document(file_path)
    # Generate answer
    generated_answer = generate_answer(document_text, question)
    # Evaluate response
    evaluation_df = custom_evaluate_response(generated_answer, ground_truth, document_text, question)
    return generated_answer, evaluation_df



# Batch Contract QA Processor

## `process_csv_questions(document_file_path, questions_csv_path)`

### Overview
Automated batch processing system for evaluating contract question-answering performance across multiple questions. Processes a CSV file containing questions and ground truth answers against a target contract document.

### Key Features
- **Bulk Processing**: Handles multiple Q&A pairs in a single operation
- **Automated Evaluation**: Applies 10-point legal QA criteria to each response
- **Progress Tracking**: Prints real-time processing status
- **Structured Output**: Returns consistent evaluation format for analysis



In [8]:
# Cell 7 - Handle CSV file with questions and ground truth answers
def process_csv_questions(document_file_path, questions_csv_path):
    """
    Process all questions in the CSV file against the document
    """
    # Extract document text
    document_text = extract_text_from_document(document_file_path)

    # Load questions and ground truth from CSV
    questions_df = pd.read_csv(questions_csv_path)

    # Ensure required columns exist
    if 'Question' not in questions_df.columns or 'Ground Truth' not in questions_df.columns:
        raise ValueError("CSV must contain 'Question' and 'Ground Truth' columns")

    # Initialize results list
    results = []

    # Process each question
    for index, row in questions_df.iterrows():
        question = row['Question']
        ground_truth = row['Ground Truth']

        print(f"Processing question {index+1}/{len(questions_df)}: {question[:50]}...")

        # Generate answer
        generated_answer = generate_answer(document_text, question)

        # Evaluate response
        evaluation_df = custom_evaluate_response(generated_answer, ground_truth, document_text, question)

        # Add to results
        results.append({
            'Question': question,
            'Ground Truth': ground_truth,
            'Generated Answer': generated_answer,
            'Evaluation': evaluation_df
        })

    return results

# Contract QA Results Formatter

## `display_results_table(results)`

### Overview
Transforms raw question-answering evaluation results into a structured, analysis-ready dataframe with separate columns for each evaluation criterion. Converts nested evaluation data into a flat table format ideal for analysis and reporting.

### Key Features
- **Normalized Structure**: Flattens nested evaluation results into columns
- **Complete Traceability**: Maintains original Q&A triad (Question, Ground Truth, Response)
- **Flexible Output**: Returns standard pandas DataFrame for further processing
- **Criteria Visibility**: Exposes all 10 evaluation criteria as separate columns

In [9]:
# Cell 8 - Display results in a formatted table with separate columns for each evaluation criterion
def display_results_table(results):
    """
    Display results in a formatted table with separate columns for each evaluation criterion
    """
    # Create a list to hold all rows for the final dataframe
    all_rows = []

    for item in results:
        question = item['Question']
        ground_truth = item['Ground Truth']
        generated_answer = item['Generated Answer']
        evaluation = item['Evaluation']

        # Create a dictionary for this row
        row_dict = {
            'Question': question,
            'Ground Truth': ground_truth,
            'LLM Response': generated_answer
        }

        # Add each evaluation criterion as a separate column
        for criteria, result in zip(evaluation['Criteria'], evaluation['Result']):
            row_dict[criteria] = result

        # Add to all_rows
        all_rows.append(row_dict)

    # Create dataframe from all rows
    results_df = pd.DataFrame(all_rows)

    return results_df


## Contract QA Processing Controller


### Overview
The core execution handler that manages the end-to-end question answering and evaluation workflow when triggered by the UI button. This function coordinates document processing, question answering, evaluation, and result presentation.

Provides an interactive interface for:
1. Uploading contract documents (PDF/DOCX/TXT)
2. Uploading question sets (CSV format)
3. Executing batch processing of all questions
4. Displaying evaluation results

## 📝 Step 1: Upload the Document  
- Users are **prompted to upload a document** (PDF, DOCX, or TXT).  
- The **file path is stored** for further processing.  

 [Download Refernce Document Link](https://drive.google.com/file/d/12RoJNxAIoIqqntpjy27wFuPPYN_ZFxDV/view?usp=sharing)

---

## 💬 Step 2: Upload the CSV
- Users provide a CSV File that contaiins the Questions and the Groundtruth Answer.  
 [Download Refernce Document Link](https://drive.google.com/file/d/1lRUOqkybtlk_eKv0K3DhlWz6_nxBLY0E/view?usp=sharing)

In [None]:
# Cell 9 - Style the dataframe
def color_evaluation(val):
    if 'Yes' in val and 'No' in val:
        return 'background-color: green'
    elif 'Yes' in val:
        return 'background-color: lightgreen'
    else:
        return 'background-color: lightcoral'

def style_results_df(results_df):
    # Indented block for the function body
    # Add your styling logic here
    # Example:
    # styled_df = results_df.style.applymap(color_evaluation, subset=['Specific Problem Addressing', 'Conciseness'])
    # return styled_df  # or display(styled_df) directly
    pass  # Add this line to fix the indentation error

# UI for document and questions CSV upload
print("Please upload your document (PDF, DOCX, or TXT)")
uploaded_document = files.upload()
document_file_path = list(uploaded_document.keys())[0]
print(f"Uploaded document: {document_file_path}")

print("\nPlease upload your CSV file with questions and ground truth answers")
uploaded_csv = files.upload()
questions_csv_path = list(uploaded_csv.keys())[0]
print(f"Uploaded CSV: {questions_csv_path}")

# Process button
process_button = widgets.Button(
    description='Process All Questions',
    disabled=False,
    button_style='success',
    tooltip='Click to process all questions in CSV',
    icon='check'
)

result_output = widgets.Output()

 Document QA Processing Function

## `on_process_clicked(b)`

### Workflow Steps
1. Clears previous outputs
2. Processes all questions via `process_csv_questions()`
3. Formats results using `display_results_table()`
4. Displays interactive DataFrame
5. Auto-saves to "document_qa_results.csv"
6. Provides status updates throughout

In [11]:
# Cell 10 - Process function
def on_process_clicked(b):
    with result_output:
        clear_output()
        print("Processing all questions... Please wait.")

        try:
            # Process all questions
            results = process_csv_questions(document_file_path, questions_csv_path)

            # Display results
            results_df = display_results_table(results)

            print("\n--- Results ---")
            display(results_df)

            # Also save results to CSV
            output_filename = "document_qa_results.csv"
            results_df.to_csv(output_filename, index=False)
            print(f"\nResults saved to {output_filename}")

        except Exception as e:
            print(f"Error processing questions: {str(e)}")

# Attach event handler
process_button.on_click(on_process_clicked)


In [None]:
# Cell 11 - Display the button and output area
display(process_button)
display(result_output)