# AI Agent for Invoice Processing with RAG + Local LLM (Ollama)

This notebook implements a **Complete Invoice Processing Pipeline** using **RAG (Retrieval-Augmented Generation)** with local Ollama models.

## Overview

The notebook performs two main functions:
1. **Rule Extraction** - Extracts invoice processing rules from contract documents using RAG
2. **Invoice Processing** - Processes invoices against extracted rules with intelligent validation

**Version:** 2.0 - RAG Edition  
**Author:** r4 Technologies, Inc 2025

## Key Features:
- **RAG Architecture** - Retrieval-Augmented Generation for context-aware rule extraction
- **Local LLM** - Ollama gemma3:270m (no API keys needed)
- **Vector Store** - FAISS for fast semantic search
- **Document Processing** - Supports PDF, DOCX, and scanned images (OCR)
- **Invoice Validation** - Rule-based compliance checking with detailed reporting
- **Automatic Setup** - Auto-generates sample documents if needed
- **Cross-Platform** - Works on Windows, Mac, and Linux

## Notebook Structure:
- **Cells 1-4:** Documentation and setup requirements
- **Cells 5-6:** Package installation (document processing + RAG packages)
- **Cell 7:** Import all required packages
- **Cell 8:** Check and auto-generate sample documents if needed
- **Cell 9:** Test Ollama connection and initialize models
- **Cells 10-13:** Helper functions and RAG agent class definition
- **Cells 14-18:** Part 1 - Rule extraction from contracts
- **Cells 19-25:** Part 2 - Invoice processing and validation
- **Cells 26-33:** Complete pipeline test and reporting

## Quick Start Guide

### Execution Order:

1. **Run Cells 5-6:** Install all required packages
2. **Run Cell 7:** Import all libraries
3. **Run Cell 8:** Check for sample documents (auto-generates if missing)
4. **Run Cell 9:** Test Ollama connection (requires Ollama running)
5. **Run Cells 14-18:** Extract rules from contract documents
6. **Run Cells 19-25:** Process invoices using extracted rules
7. **Run Cell 29:** Complete pipeline test (extract rules + process invoices)

### Prerequisites:

- **Python 3.10+**
- **Ollama** installed and running (https://ollama.ai)
- **Tesseract OCR** binary (for scanned document processing)
- Required Ollama models:
  ```bash
  ollama pull gemma3:270m
  ollama pull nomic-embed-text
  ```

## Installation Requirements

### Python Dependencies
All dependencies are installed automatically by running the installation cells:

- **Cell 5:** Document processing packages
  - pdfplumber (PDF parsing)
  - python-docx (Word document parsing)
  - Pillow (image processing)
  - reportlab (PDF generation)
  - matplotlib (visualization)

- **Cell 6:** RAG and ML packages
  - langchain-core, langchain-community, langchain
  - langchain-ollama (Ollama integration)
  - faiss-cpu (vector store)
  - pytesseract (OCR wrapper)
  - numpy, pydantic, ipywidgets

### External Dependencies

**Tesseract OCR Binary** (required for scanned documents):
- **macOS:** `brew install tesseract`
- **Linux:** `sudo apt-get install tesseract-ocr`
- **Windows:** Download from https://github.com/UB-Mannheim/tesseract/wiki

**Ollama** (required for local LLM):
- Download and install from https://ollama.ai
- Pull required models (see Cell 9 for instructions)

## RAG Setup Requirements

### Required Packages
All RAG packages are installed automatically in **Cell 6**. The notebook uses:
- LangChain framework for RAG orchestration
- FAISS vector store for semantic search
- Ollama for local LLM processing (no API keys needed)

### Ollama Models
Make sure Ollama is running and you have the required models:

```bash
# Pull the LLM model (for rule extraction)
ollama pull gemma3:270m

# Pull the embedding model (for vector search)
ollama pull nomic-embed-text
```

**Note:** Cell 9 will test the Ollama connection and verify these models are available.

In [1]:
# Cell 5: Install document processing packages
import sys
import subprocess

# Platform-independent pip installation
result = subprocess.run(
    [
        sys.executable,
        "-m",
        "pip",
        "install",
        "-q",
        "--disable-pip-version-check",
        "pdfplumber",
        "python-docx",
        "Pillow",
        "reportlab",
        "matplotlib",
    ],
    capture_output=True,
    text=True,
)

print("[OK] Document processing packages installed!")


[OK] Document processing packages installed!


In [2]:
# Cell 6: Install RAG packages (with pytesseract - stable and lightweight)
import sys
import subprocess
import warnings

warnings.filterwarnings("ignore")

# Install core packages with numpy constraint
subprocess.run(
    [
        sys.executable,
        "-m",
        "pip",
        "install",
        "-q",
        "--disable-pip-version-check",
        "numpy==1.26.4",
        "pdfplumber",
        "Pillow",
        "matplotlib",
        "python-docx",
        "reportlab",
        "langchain-core==0.3.6",
        "langchain-community==0.3.1",
        "langchain==0.3.1",
        "langchain-ollama==0.2.0",
        "faiss-cpu",
        "ipywidgets",
        "pydantic==2.9.2",
    ],
    capture_output=True,
    text=True,
)

# Install pytesseract (lightweight, uses external Tesseract binary)
subprocess.run(
    [
        sys.executable,
        "-m",
        "pip",
        "install",
        "-q",
        "--disable-pip-version-check",
        "pytesseract",
    ],
    capture_output=True,
    text=True,
)

print("[OK] All packages installed with numpy 1.26.4")
print("[OK] pytesseract installed (lightweight OCR)")
print("[OK] No dependency conflicts!")
print("\n[INFO] OCR Note: pytesseract requires Tesseract binary to be installed:")
print("  - macOS: brew install tesseract")
print("  - Linux: sudo apt-get install tesseract-ocr")
print("  - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki")
print("\nMake sure Ollama is running with models:")
print("  ollama pull gemma3:270m")
print("  ollama pull nomic-embed-text")


[OK] All packages installed with numpy 1.26.4
[OK] pytesseract installed (lightweight OCR)
[OK] No dependency conflicts!

[INFO] OCR Note: pytesseract requires Tesseract binary to be installed:
  - macOS: brew install tesseract
  - Linux: sudo apt-get install tesseract-ocr
  - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

Make sure Ollama is running with models:
  ollama pull gemma3:270m
  ollama pull nomic-embed-text


In [3]:
# Import all required packages
from pathlib import Path
import json
import logging
import re
import os
import sys
import warnings
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional
from collections import Counter
from contextlib import redirect_stderr
import multiprocessing

# Document processing
import pdfplumber  # For PDF parsing
from docx import Document  # For Word (.docx) parsing
from PIL import Image, ImageEnhance  # For image processing

# RAG and ML
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document as LangchainDocument

warnings.filterwarnings("ignore")

print("[OK] All packages imported successfully")

# Cell 7: Configure environment variables + Platform-specific settings
import os
import warnings
import platform
import sys

# Detect platform
PLATFORM = platform.system()  # 'Darwin' (Mac), 'Windows', 'Linux'
IS_MAC = PLATFORM == "Darwin"
IS_WINDOWS = PLATFORM == "Windows"
IS_APPLE_SILICON = IS_MAC and platform.machine() == "arm64"

print(f"Platform: {PLATFORM}")
if IS_APPLE_SILICON:
    print("[APPLE] Detected: Apple Silicon (ARM64)")
elif IS_MAC:
    print("[APPLE] Detected: macOS (Intel)")
elif IS_WINDOWS:
    print("[WIN] Detected: Windows")

# Environment variables (cross-platform)
os.environ["USER_AGENT"] = "InvoiceProcessingRAGAgent"

# Suppress warnings
warnings.filterwarnings("ignore", message=".*IProgress.*")
warnings.filterwarnings("ignore", category=DeprecationWarning)

print("[OK] Environment configured - Using pytesseract for image processing")


[OK] All packages imported successfully
Platform: Darwin
[APPLE] Detected: Apple Silicon (ARM64)
[OK] Environment configured - Using pytesseract for image processing


In [None]:
# Check if sample documents exist, generate if needed

from pathlib import Path
import subprocess
import sys

# Helper function to filter out temp/system files
def is_valid_file(file_path: Path) -> bool:
    """Check if a file is valid (not a temp/system file)."""
    name = file_path.name
    if name.startswith('.') or name.startswith('~$'):
        return False
    system_files = {'.DS_Store', 'Thumbs.db', 'desktop.ini', '.gitkeep', '.gitignore'}
    if name in system_files:
        return False
    temp_extensions = {'.tmp', '.bak', '.swp', '.~', '.old'}
    if file_path.suffix.lower() in temp_extensions:
        return False
    return True

# Define directories
data_dir = Path("docs")
contracts_dir = data_dir / "contracts"
invoices_dir = data_dir / "invoices"

# Create directories if they don't exist
contracts_dir.mkdir(parents=True, exist_ok=True)
invoices_dir.mkdir(parents=True, exist_ok=True)

# Check if directories contain any valid files (excluding temp/system files)
contracts_has_files = any(f.is_file() and is_valid_file(f) for f in contracts_dir.iterdir())
invoices_has_files = any(f.is_file() and is_valid_file(f) for f in invoices_dir.iterdir())

if not contracts_has_files or not invoices_has_files:
    print("=" * 70)
    print("Sample documents not found. Generating sample documents...")
    print("=" * 70)
    print("\nRunning Generate_Sample_Documents.ipynb...")
    
    # Execute the generation notebook using nbconvert
    try:
        result = subprocess.run(
            [sys.executable, "-m", "jupyter", "nbconvert", "--to", "notebook", "--execute", "--inplace", "Generate_Sample_Documents.ipynb"],
            capture_output=True,
            text=True,
            cwd=Path.cwd()
        )
        
        if result.returncode == 0:
            print("\n[OK] Sample documents generated successfully!")
        else:
            print(f"\n[ERROR] Failed to generate documents.")
            print(f"Error: {result.stderr[:500] if result.stderr else 'Unknown error'}")
            print("\nPlease run Generate_Sample_Documents.ipynb manually.")
    except Exception as e:
        print(f"\n[WARN] Could not auto-generate documents: {e}")
        print("\nPlease run Generate_Sample_Documents.ipynb manually.")
else:
    # Count only valid files (excluding temp/system files)
    valid_contracts = [f for f in contracts_dir.glob('*.*') if f.is_file() and is_valid_file(f)]
    valid_invoices = [f for f in invoices_dir.glob('*.*') if f.is_file() and is_valid_file(f)]
    print("Sample documents already exist. Skipping generation.")
    print(f"  Contracts: {len(valid_contracts)} files")
    print(f"  Invoices: {len(valid_invoices)} files")


Sample documents already exist. Skipping generation.
  Contracts: 5 files
  Invoices: 10 files


In [None]:
# Cell 8: Import necessary libraries (Standard + RAG)

import json
import logging
import re
import io
from pathlib import Path
from typing import List, Dict, Any, Optional
from multiprocessing import Manager
from datetime import datetime, timedelta
from contextlib import redirect_stderr

import pdfplumber  # For PDF parsing
from docx import Document  # For Word (.docx) parsing
from PIL import ImageEnhance  # For contrast enhancement in scanned PDFs

# RAG imports
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document as LangchainDocument

# Set up logging (prevent duplicate handlers when re-running cells)
# Clear any existing handlers to prevent duplicates
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", force=True
)
logger = logging.getLogger(__name__)


# ============================================================================
# Helper Function: Filter out temp/system files
# ============================================================================

def is_valid_file(file_path: Path) -> bool:
    """
    Check if a file is valid (not a temp/system file).
    
    Filters out:
    - Hidden files (starting with .)
    - System files (.DS_Store, Thumbs.db, etc.)
    - Temp files (~$ files, .tmp, etc.)
    - Backup files
    
    Args:
        file_path: Path object to check
        
    Returns:
        True if file should be processed, False if it should be skipped
    """
    name = file_path.name
    
    # Skip hidden files (starting with .)
    if name.startswith('.'):
        return False
    
    # Skip temp files (starting with ~$)
    if name.startswith('~$'):
        return False
    
    # Skip system files
    system_files = {
        '.DS_Store',           # macOS
        'Thumbs.db',           # Windows
        'desktop.ini',         # Windows
        '.gitkeep',            # Git
        '.gitignore',          # Git
    }
    if name in system_files:
        return False
    
    # Skip temp/backup extensions
    temp_extensions = {'.tmp', '.bak', '.swp', '.~', '.old'}
    if file_path.suffix.lower() in temp_extensions:
        return False
    
    return True


def filter_valid_files(file_list: List[Path]) -> List[Path]:
    """
    Filter a list of files to exclude temp/system files.
    
    Args:
        file_list: List of Path objects
        
    Returns:
        Filtered list containing only valid files
    """
    return [f for f in file_list if is_valid_file(f)]


print("[OK] All libraries imported successfully (Standard + RAG components)")
print("[OK] File filtering helper functions defined")


[OK] All libraries imported successfully (Standard + RAG components)


In [6]:
# Cell 9: Test Ollama connection and initialize models (cross-platform)

try:
    # Test embeddings (suppress noise output)
    print("Testing Ollama embeddings...")
    with redirect_stderr(io.StringIO()):
        test_embedding = OllamaEmbeddings(model="nomic-embed-text")
        test_embedding.embed_query("test")
    print("[OK] Ollama embeddings working (nomic-embed-text)")

    # Initialize LLM with response length limit for faster generation
    print("Testing Ollama LLM...")
    with redirect_stderr(io.StringIO()):
        llm = ChatOllama(
            model="gemma3:270m",
            temperature=0,
            num_predict=100,  # Limit response length for speed
        )
        test_response = llm.invoke("Hello")
    print("[OK] Ollama LLM working (gemma3:270m)")

    # Initialize embeddings for later use
    embeddings = OllamaEmbeddings(model="nomic-embed-text")

    print("\n[OK] All Ollama models ready!")

except Exception as e:
    print(f"[ERROR] Ollama error: {e}")
    print("\nTroubleshooting:")
    print("  1. Make sure Ollama is running:")
    if IS_WINDOWS:
        print("     - Windows: Check system tray for Ollama icon")
        print("     - Or run: ollama serve")
    elif IS_MAC:
        print("     - Mac: Check menu bar for Ollama icon")
        print("     - Or run: ollama serve")

    print("\n  2. Pull required models:")
    print("     ollama pull gemma3:270m")
    print("     ollama pull nomic-embed-text")

    print("\n  3. Verify Ollama is accessible:")
    print("     ollama list")

    if IS_APPLE_SILICON:
        print("\n  4. Apple Silicon specific:")
        print("     - Make sure you have the ARM64 version of Ollama")
        print("     - Download from: https://ollama.ai/download")

    raise


Testing Ollama embeddings...


2025-11-06 19:01:42,077 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


[OK] Ollama embeddings working (nomic-embed-text)
Testing Ollama LLM...


2025-11-06 19:01:48,692 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


[OK] Ollama LLM working (gemma3:270m)

[OK] All Ollama models ready!


In [7]:
# Cell 10: Helper function to detect garbled text


def is_garbled_text(
    text: str, non_alpha_threshold: float = 0.4, min_word_length: int = 3
) -> bool:
    """
    Detect if text is likely garbled (low-confidence OCR output).

    Args:
        text (str): Extracted text to check.
        non_alpha_threshold (float): Max proportion of non-alphanumeric characters.
        min_word_length (int): Minimum average word length to consider valid.

    Returns:
        bool: True if text is likely garbled, False otherwise.
    """
    if not text.strip():
        return True

    # Check proportion of non-alphanumeric characters
    non_alpha_count = len(re.findall(r"[^a-zA-Z0-9\s]", text))
    if non_alpha_count / max(len(text), 1) > non_alpha_threshold:
        return True

    # Check average word length
    words = [w for w in text.split() if w.strip()]
    if not words:
        return True
    avg_word_length = sum(len(w) for w in words) / len(words)
    if avg_word_length < min_word_length:
        return True

    return False


print("[OK] Garbled text detection function defined")


[OK] Garbled text detection function defined


In [8]:
# Cell 11: Helper function to validate invoice-related terms


def validate_invoice_terms(text: str, min_terms: int = 2) -> bool:
    """
    Validate if text contains enough invoice-related terms.

    Args:
        text (str): Extracted text to validate.
        min_terms (int): Minimum number of invoice-related terms required.

    Returns:
        bool: True if sufficient invoice-related terms are found, False otherwise.
    """
    invoice_keywords = [
        r"\bpayment\b",
        r"\binvoice\b",
        r"\bdue\b",
        r"\bnet\s*\d+\b",
        r"\bterms\b",
        r"\bapproval\b",
        r"\bpenalty\b",
        r"\bPO\s*number\b",
        r"\btax\b",
        r"\bbilling\b",
    ]
    found_terms = sum(
        1 for keyword in invoice_keywords if re.search(keyword, text, re.IGNORECASE)
    )
    return found_terms >= min_terms


print("[OK] Invoice terms validation function defined")


[OK] Invoice terms validation function defined


In [9]:
# Cell 12: InvoiceRuleExtractorAgent class definition (RAG-powered with FAISS vector store)


class InvoiceRuleExtractorAgent:
    """
    AI Agent for extracting invoice processing rules from contract documents using RAG.
    """

    def __init__(self, llm=None, embeddings=None):
        """
        Initialize the agent with RAG components.

        Args:
            llm: ChatOllama instance (defaults to gemma3:270m)
            embeddings: OllamaEmbeddings instance (defaults to nomic-embed-text)
        """
        logger.info("Initializing RAG-powered Invoice Rule Extractor Agent")

        # Use provided models or create defaults
        # Set num_predict to limit response length (faster generation)
        self.llm = (
            llm
            if llm
            else ChatOllama(
                model="gemma3:270m",
                temperature=0,
                num_predict=100,  # Limit to ~100 tokens for faster responses
            )
        )
        self.embeddings = (
            embeddings if embeddings else OllamaEmbeddings(model="nomic-embed-text")
        )

        # Expanded keyword patterns for better matching
        self.rule_keywords = [
            "payment",
            "terms",
            "due",
            "net",
            "days",
            "invoice",
            "approval",
            "submission",
            "requirement",
            "late",
            "fee",
            "penalty",
            "penalties",
            "PO",
            "purchase order",
            "tax",
            "dispute",
            "month",
            "overdue",
            "rejection",
        ]

        # RAG chain will be created after document parsing
        self.vectorstore = None
        self.retriever = None
        self.num_chunks = 0

    def parse_document(self, file_path: str) -> str:
        """
        Parse the contract document (PDF or Word), extract text, and create vector store for RAG.
        """
        file_path = Path(file_path)
        if not file_path.exists():
            raise FileNotFoundError(f"File not found: {file_path}")

        text = ""
        try:
            # Extract text from document
            if file_path.suffix.lower() == ".pdf":
                logger.info(f"Parsing PDF: {file_path}")
                with pdfplumber.open(file_path) as pdf:
                    for page in pdf.pages:
                        page_text = page.extract_text()
                        if page_text:
                            text += page_text + "\n"
                        else:
                            # Use pytesseract for scanned pages
                            import pytesseract
                            import tempfile

                            img = page.to_image().original
                            # Optimize image for OCR
                            img = ImageEnhance.Contrast(img).enhance(2.0)
                            img = ImageEnhance.Sharpness(img).enhance(1.5)

                            # Save and process with tesseract
                            with tempfile.NamedTemporaryFile(
                                suffix=".png", delete=False
                            ) as tmp:
                                img.save(tmp.name, "PNG", optimize=True)
                                try:
                                    # Use optimized tesseract config
                                    extracted_text = pytesseract.image_to_string(
                                        tmp.name, config="--psm 6"
                                    )
                                    if extracted_text.strip():
                                        text += extracted_text + "\n"
                                except Exception as ocr_err:
                                    logger.warning(f"OCR failed for page: {ocr_err}")
                                finally:
                                    Path(tmp.name).unlink()  # Clean up temp file

            elif file_path.suffix.lower() == ".docx":
                logger.info(f"Parsing Word doc: {file_path}")
                doc = Document(file_path)
                for para in doc.paragraphs:
                    if para.text.strip():
                        text += para.text + "\n"
            else:
                raise ValueError(
                    f"Unsupported file format: {file_path.suffix}. Use PDF or DOCX."
                )

            if not text.strip():
                raise ValueError(
                    "No text extracted from document. Check scan quality or OCR setup."
                )

            logger.info(f"Successfully parsed {len(text)} characters.")

            # Create document chunks for RAG
            logger.info("Creating vector store for RAG...")
            self._create_vectorstore(text)

            return text

        except Exception as e:
            logger.error(f"Error parsing document: {e}")
            raise

    def _create_vectorstore(self, text: str):
        """Create vector store from document text using FAISS."""
        from langchain_community.vectorstores import FAISS

        # Create a document object
        doc = LangchainDocument(page_content=text, metadata={"source": "contract"})

        # Split document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800,
            chunk_overlap=200,
            length_function=len,
        )
        splits = text_splitter.split_documents([doc])
        self.num_chunks = len(splits)
        logger.info(f"Created {self.num_chunks} document chunks")

        # Create FAISS vector store (fast and reliable)
        try:
            with redirect_stderr(io.StringIO()):
                self.vectorstore = FAISS.from_documents(
                    documents=splits, embedding=self.embeddings
                )
            logger.info("[OK] Vector store created with FAISS")

        except Exception as e:
            raise ValueError(f"Failed to create FAISS vector store: {str(e)}")

        # Adaptive k: use min(3, num_chunks)
        k_value = min(3, self.num_chunks)
        self.retriever = self.vectorstore.as_retriever(search_kwargs={"k": k_value})
        logger.info(
            f"Vector store created successfully (retrieving top {k_value} chunks)"
        )

    def extract_rules(self, text: str) -> Dict[str, str]:
        """
        Use RAG to extract invoice-related rules from the document.
        """
        logger.info("Extracting rules using RAG...")

        if not self.retriever:
            raise ValueError(
                "Vector store not initialized. Call parse_document() first."
            )

        # Create RAG chain
        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)

        prompt_template = ChatPromptTemplate.from_template(
            """Extract invoice processing rules from this contract.

Contract text:
{context}

Question: {question}

Answer concisely with key details only (1-2 sentences). If not found, say "Not specified"."""
        )

        rag_chain = (
            {"context": self.retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | self.llm
            | StrOutputParser()
        )

        # Simplified questions for faster extraction
        questions = {
            "payment_terms": "What are the payment terms (Net days, PO requirements)?",
            "approval_process": "What is the invoice approval process?",
            "late_penalties": "What are the late payment penalties?",
            "submission_requirements": "What must be included on every invoice?",
        }

        raw_rules = {}
        for key, question in questions.items():
            try:
                with redirect_stderr(io.StringIO()):
                    answer = rag_chain.invoke(question)

                # Accept answer if it has substance
                if (
                    answer
                    and len(answer.strip()) > 15
                    and "not specified" not in answer.lower()
                ):
                    raw_rules[key] = answer.strip()
                    logger.info(f"Extracted {key}: {answer[:100]}...")
                else:
                    raw_rules[key] = "Not found"
                    logger.warning(f"Rule {key} not found in contract")

            except Exception as e:
                logger.warning(f"Error extracting {key}: {e}")
                raw_rules[key] = "Not found"

        return raw_rules

    def refine_rules(self, raw_rules: Dict[str, str]) -> List[Dict[str, Any]]:
        """
        Refine and structure the raw rules into a standardized format.
        """
        logger.info("Refining rules...")
        structured_rules = []
        rule_mapping = {
            "payment_terms": {"type": "payment_term", "priority": "high"},
            "approval_process": {"type": "approval", "priority": "medium"},
            "late_penalties": {"type": "penalty", "priority": "high"},
            "submission_requirements": {"type": "submission", "priority": "medium"},
        }

        for key, description in raw_rules.items():
            if key in rule_mapping and description != "Not found":
                # Accept if content is substantial (>15 chars)
                if len(description.strip()) > 15:
                    rule = {
                        "rule_id": key,
                        "type": rule_mapping[key]["type"],
                        "description": description.strip(),
                        "priority": rule_mapping[key]["priority"],
                        "confidence": "medium",
                    }
                    structured_rules.append(rule)
                    logger.info(
                        f"[OK] Structured rule: {rule['type']} - {rule['description'][:60]}..."
                    )
                else:
                    logger.warning(f"Rule {key} too short: '{description}'")

        return structured_rules

    def run(self, file_path: str) -> List[Dict[str, Any]]:
        """
        Main execution method for the agent.
        """
        try:
            text = self.parse_document(file_path)
            raw_rules = self.extract_rules(text)
            refined_rules = self.refine_rules(raw_rules)
            logger.info(f"Extraction complete. Found {len(refined_rules)} rules.")
            return refined_rules
        except Exception as e:
            logger.error(f"Agent run failed: {e}")
            raise


print("[OK] InvoiceRuleExtractorAgent class defined with FAISS vector store")


[OK] InvoiceRuleExtractorAgent class defined with FAISS vector store


---

## Part 1: Rule Extraction with RAG

This section extracts invoice processing rules from contract documents using RAG.

### Workflow:
1. **Cell 14:** Initialize the RAG-powered agent
2. **Cell 15:** Process a contract document and extract rules
3. **Cell 16:** Save extracted rules to JSON file (`extracted_rules.json`)
4. **Cell 17:** Display extracted rules in formatted output

### Input:
- Contract documents (PDF, DOCX, or scanned images) in `docs/contracts/` directory

### Output:
- `extracted_rules.json` - Structured rules ready for invoice validation

In [10]:
# Cell 14: Initialize the RAG-powered agent

# Use the global llm and embeddings initialized earlier
agent = InvoiceRuleExtractorAgent(llm=llm, embeddings=embeddings)
print("[OK] RAG-powered Agent initialized successfully")
print(f"  - LLM: gemma3:270m")
print(f"  - Embeddings: nomic-embed-text")
print(f"  - Vector Store: FAISS")


2025-11-06 19:01:48,840 - INFO - Initializing RAG-powered Invoice Rule Extractor Agent


[OK] RAG-powered Agent initialized successfully
  - LLM: gemma3:270m
  - Embeddings: nomic-embed-text
  - Vector Store: FAISS


In [11]:
# Cell 15: Process a contract document with RAG - WITH DIAGNOSTICS

# Use sample contract or specify your own path
file_path = "docs/contracts/sample_contract_net30.pdf"  # Change this to your file path

try:
    print(f"Processing contract: {file_path}")
    print("-" * 60)

    # Use the agent initialized in Cell 14 (faster - no re-initialization)
    # Note: If you need a clean state, uncomment the line below:
    # agent = InvoiceRuleExtractorAgent(llm=llm, embeddings=embeddings)

    rules = agent.run(file_path)

    print(f"\n[OK] Extracted {len(rules)} rules using RAG:")
    print("=" * 60)
    print(json.dumps(rules, indent=2))

except FileNotFoundError:
    print(f"[WARN] File not found: {file_path}")
    print("Please create sample documents first (run Generate_Sample_Documents.ipynb)")

except Exception as e:
    print(f"[ERROR] Error: {e}")
    print("\nCreating fallback rules...")

    # Provide manual fallback rules
    print("\n1. Creating fallback rules (manual extraction)...")
    rules = [
        {
            "rule_id": "payment_terms",
            "type": "payment_term",
            "description": "Payment terms: Net 30 days from invoice date. All invoices must include a valid Purchase Order (PO) number.",
            "priority": "high",
            "confidence": "high",
        },
        {
            "rule_id": "submission_requirements",
            "type": "submission",
            "description": "All invoices must include: Valid PO number (format: PO-YYYY-####), Invoice date and due date, Vendor tax identification number, Detailed description of services",
            "priority": "medium",
            "confidence": "high",
        },
        {
            "rule_id": "late_penalties",
            "type": "penalty",
            "description": "Late payment penalty: 1.5% per month on overdue balance. Missing PO number: Automatic rejection.",
            "priority": "high",
            "confidence": "high",
        },
        {
            "rule_id": "approval_process",
            "type": "approval",
            "description": "All invoices must be approved by the Project Manager within 5 business days. Finance department will process payment after approval.",
            "priority": "medium",
            "confidence": "high",
        },
    ]

    print(f"[OK] Created {len(rules)} fallback rules")
    print("\n" + "=" * 60)
    print(json.dumps(rules, indent=2))

    print("\n" + "=" * 60)
    print("[WARN] Using manually extracted rules due to error")
    print("NOTE: These fallback rules work fine for testing!")


2025-11-06 19:01:48,846 - INFO - Parsing PDF: docs/contracts/sample_contract_net30.pdf
2025-11-06 19:01:48,864 - INFO - Successfully parsed 1171 characters.
2025-11-06 19:01:48,864 - INFO - Creating vector store for RAG...
2025-11-06 19:01:48,865 - INFO - Created 2 document chunks


Processing contract: docs/contracts/sample_contract_net30.pdf
------------------------------------------------------------


2025-11-06 19:01:49,082 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 19:01:49,086 - INFO - Loading faiss.
2025-11-06 19:01:49,101 - INFO - Successfully loaded faiss.
2025-11-06 19:01:49,106 - INFO - [OK] Vector store created with FAISS
2025-11-06 19:01:49,106 - INFO - Vector store created successfully (retrieving top 2 chunks)
2025-11-06 19:01:49,107 - INFO - Extracting rules using RAG...
2025-11-06 19:01:49,137 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 19:01:49,526 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06 19:01:49,694 - INFO - Extracted payment_terms: The payment terms are Net 30 days from invoice date and in monthly installments.
...
2025-11-06 19:01:49,723 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
2025-11-06 19:01:49,969 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
2025-11-06


[OK] Extracted 4 rules using RAG:
[
  {
    "rule_id": "payment_terms",
    "type": "payment_term",
    "description": "The payment terms are Net 30 days from invoice date and in monthly installments.",
    "priority": "high",
    "confidence": "medium"
  },
  {
    "rule_id": "approval_process",
    "type": "approval",
    "description": "The invoice approval process is to approve invoices by the Project Manager.",
    "priority": "medium",
    "confidence": "medium"
  },
  {
    "rule_id": "late_penalties",
    "type": "penalty",
    "description": "The late payment penalty is 1.5% per month on overdue balance.",
    "priority": "high",
    "confidence": "medium"
  },
  {
    "rule_id": "submission_requirements",
    "type": "submission",
    "description": "The invoice processing rules are:\n\n*   All invoices must include:\n    *   Valid PO number (format: PO-YYYY-####)\n    *   Detailed description of services\n    *   Invoice date and due date\n    *   Vendor tax identification 

In [12]:
# Cell 16: Save extracted rules to JSON file

output_file = "extracted_rules.json"

try:
    with open(output_file, "w") as f:
        json.dump(rules, f, indent=2)
    print(f"[OK] Rules saved to {output_file}")
except NameError:
    print("[WARN] No rules to save. Run Cell 15 first to extract rules.")


[OK] Rules saved to extracted_rules.json


In [13]:
# Cell 17: Display extracted rules in a formatted way

try:
    print("=" * 60)
    print("EXTRACTED INVOICE PROCESSING RULES")
    print("=" * 60)

    for i, rule in enumerate(rules, 1):
        print(f"\n[Rule {i}]")
        print(f"Type: {rule['type']}")
        print(f"Priority: {rule['priority']}")
        print(f"Description: {rule['description']}")
        print(f"Confidence: {rule['confidence']}")
        print("-" * 60)
except NameError:
    print("[WARN] No rules to display. Run Cell 15 first to extract rules.")


EXTRACTED INVOICE PROCESSING RULES

[Rule 1]
Type: payment_term
Priority: high
Description: The payment terms are Net 30 days from invoice date and in monthly installments.
Confidence: medium
------------------------------------------------------------

[Rule 2]
Type: approval
Priority: medium
Description: The invoice approval process is to approve invoices by the Project Manager.
Confidence: medium
------------------------------------------------------------

[Rule 3]
Type: penalty
Priority: high
Description: The late payment penalty is 1.5% per month on overdue balance.
Confidence: medium
------------------------------------------------------------

[Rule 4]
Type: submission
Priority: medium
Description: The invoice processing rules are:

*   All invoices must include:
    *   Valid PO number (format: PO-YYYY-####)
    *   Detailed description of services
    *   Invoice date and due date
    *   Vendor tax identification number
*   Invoices must be approved by the Project Manager
* 

---

## Part 2: Invoice Processor - Apply Extracted Rules

This section processes invoices against the extracted rules.

### Components:
- **Cell 19:** InvoiceProcessor class definition
- **Cell 21:** Initialize processor with extracted rules
- **Cell 22:** Process a single invoice (optional - for testing/debugging)
- **Cell 23:** Batch process all invoices (recommended for production use)
- **Cell 24:** Generate processing report

### When to use each:
- **Cell 22 (Single Invoice):** Use for testing, debugging, or when you need to process just one specific invoice with detailed output
- **Cell 23 (Batch Processing):** Use for production workflows - processes all invoices and generates summary statistics

### Input:
- Invoice documents (PDF, DOCX, PNG, JPG, TIFF, BMP) in `docs/invoices/` directory
- `extracted_rules.json` (generated in Part 1)

### Output:
- Validation results (APPROVED/FLAGGED/REJECTED)
- Detailed processing reports
- JSON output files

In [None]:
# Cell 19: InvoiceContractMatcher Class Definition

# Import additional modules needed for contract matching
import re
from pathlib import Path
from typing import Tuple

# ============================================================================
# InvoiceContractMatcher Class
# ============================================================================

class InvoiceContractMatcher:
    """
    Matches invoices to their source contracts using multiple detection methods.
    
    Matching priority (stops at first successful match):
    1. PO Number (confidence: 0.95) - searches contract CONTENT
    2. Vendor + Program Code (confidence: 0.85)
    3. Vendor + Date Range (confidence: 0.80)
    4. Program Code only (confidence: 0.70)
    5. Vendor only (confidence: 0.60) - last resort, may be ambiguous
    6. No match → UNMATCHED status (requires manual review)
    """
    
    def __init__(self, contracts_data: Dict):
        """
        Initialize matcher with contracts data.
        
        Args:
            contracts_data: Dict with structure:
                {
                    "contracts": [
                        {
                            "contract_id": "...",
                            "contract_path": "...",
                            "parties": [...],
                            "program_code": "...",
                            "date_range": {"start": "...", "end": "..."},
                            ...
                        }
                    ]
                }
        """
        self.contracts = contracts_data.get("contracts", [])
        self.contract_index = self._build_contract_index()
        # Cache for parsed contract content (to avoid re-parsing)
        self._contract_content_cache = {}
    
    def _build_contract_index(self) -> Dict:
        """
        Build searchable index of contract metadata.
        
        Includes:
        - contract_id
        - contract_path
        - parties (normalized, lowercase)
        - program_code (uppercase)
        - date_range (start, end dates)
        """
        index = {}
        for contract in self.contracts:
            contract_id = contract.get("contract_id", "UNKNOWN")
            index[contract_id] = {
                "contract_id": contract_id,
                "contract_path": contract.get("contract_path", ""),
                "parties": [p.lower() for p in contract.get("parties", [])],
                "program_code": contract.get("program_code", "").upper(),
                "date_range": contract.get("date_range"),  # {"start": "...", "end": "..."}
            }
        return index
    
    def match_invoice_to_contract(self, invoice_data: Dict) -> Dict:
        """
        Detect which contract an invoice belongs to.
        
        Args:
            invoice_data: Parsed invoice data from parse_invoice()
        
        Returns:
            {
                "contract_id": "..." or None,
                "contract_path": "..." or None,
                "match_method": "PO_NUMBER|VENDOR_PROGRAM|VENDOR_DATE|PROGRAM_CODE|VENDOR_ONLY|UNMATCHED",
                "confidence": 0.0-1.0,
                "status": "MATCHED|AMBIGUOUS|UNMATCHED",
                "matching_details": {...},
                "alternative_matches": [...]
            }
        """
        matches = []
        
        # 1. Try PO number matching (highest priority, unique identifier)
        po_matches = self._match_by_po_number(invoice_data)
        if po_matches:
            matches.extend(po_matches)
        
        # 2. Try vendor + program code (if no PO match)
        if not matches:
            vendor_program_matches = self._match_by_vendor_and_program(invoice_data)
            if vendor_program_matches:
                matches.extend(vendor_program_matches)
        
        # 3. Try vendor + date range (if no previous matches)
        if not matches:
            vendor_date_matches = self._match_by_vendor_and_date(invoice_data)
            if vendor_date_matches:
                matches.extend(vendor_date_matches)
        
        # 4. Try program code only (if no previous matches)
        if not matches:
            program_matches = self._match_by_program_code(invoice_data)
            if program_matches:
                matches.extend(program_matches)
        
        # 5. Try vendor only (last resort, lowest confidence)
        if not matches:
            vendor_matches = self._match_by_vendor_only(invoice_data)
            if vendor_matches:
                matches.extend(vendor_matches)
        
        # Build result
        result = {
            "contract_id": None,
            "contract_path": None,
            "match_method": None,
            "confidence": 0.0,
            "status": "UNMATCHED",
            "matching_details": {},
            "alternative_matches": [],
        }
        
        if len(matches) == 1:
            # Unique match
            contract_id, method, confidence = matches[0]
            contract_info = self.contract_index.get(contract_id, {})
            result["contract_id"] = contract_id
            result["contract_path"] = contract_info.get("contract_path", "")
            result["match_method"] = method
            result["confidence"] = confidence
            result["status"] = "MATCHED"
            result["matching_details"] = self._get_matching_details(invoice_data, contract_id)
        
        elif len(matches) > 1:
            # Multiple matches - ambiguous
            contract_id, method, confidence = matches[0]
            contract_info = self.contract_index.get(contract_id, {})
            result["contract_id"] = contract_id
            result["contract_path"] = contract_info.get("contract_path", "")
            result["match_method"] = method
            result["confidence"] = confidence
            result["status"] = "AMBIGUOUS"
            result["alternative_matches"] = [
                {"contract_id": m[0], "method": m[1], "confidence": m[2]}
                for m in matches[1:]
            ]
            result["matching_details"] = self._get_matching_details(invoice_data, contract_id)
        
        else:
            # No match - UNMATCHED (no fallback)
            result["status"] = "UNMATCHED"
            result["matching_details"] = {
                "reason": "No matching contract found. Manual review required.",
                "invoice_po": invoice_data.get("po_number"),
                "invoice_vendor": invoice_data.get("vendor_name"),
                "invoice_date": str(invoice_data.get("invoice_date")) if invoice_data.get("invoice_date") else None,
            }
        
        return result
    
    def _match_by_po_number(self, invoice_data: Dict) -> List[Tuple[str, str, float]]:
        """
        Match by PO number - searches contract CONTENT (not filenames).
        
        Requires parsing contract documents to search for PO references in content.
        
        Returns: List of (contract_id, method, confidence) tuples
        """
        invoice_po = invoice_data.get("po_number")
        if not invoice_po:
            return []
        
        matches = []
        invoice_po_upper = invoice_po.upper()
        
        # Parse each contract and search content for PO number
        for contract in self.contracts:
            contract_id = contract.get("contract_id", "UNKNOWN")
            contract_path = contract.get("contract_path", "")
            
            if not contract_path:
                continue
            
            # Parse contract document to get text content
            contract_text = self._parse_contract_content(contract_path)
            
            # Search for PO number in contract content
            if invoice_po_upper in contract_text.upper():
                matches.append((contract_id, "PO_NUMBER", 0.95))
        
        return matches
    
    def _parse_contract_content(self, contract_path: str) -> str:
        """
        Parse contract document and extract text content.
        Reuses parsing logic from InvoiceRuleExtractorAgent.
        
        Supports: PDF and DOCX formats
        """
        # Check cache first
        if contract_path in self._contract_content_cache:
            return self._contract_content_cache[contract_path]
        
        contract_file = Path(contract_path)
        if not contract_file.exists():
            logger.warning(f"Contract file not found: {contract_path}")
            return ""
        
        text = ""
        try:
            # Extract text from document
            if contract_file.suffix.lower() == ".pdf":
                logger.debug(f"Parsing PDF contract: {contract_path}")
                import pdfplumber
                with pdfplumber.open(contract_path) as pdf:
                    for page in pdf.pages:
                        page_text = page.extract_text()
                        if page_text:
                            text += page_text + "\n"
            
            elif contract_file.suffix.lower() in [".docx", ".doc"]:
                logger.debug(f"Parsing DOCX contract: {contract_path}")
                from docx import Document
                doc = Document(contract_path)
                text = "\n".join([para.text for para in doc.paragraphs])
            
            else:
                logger.warning(f"Unsupported contract format: {contract_file.suffix}")
                return ""
            
            # Cache the parsed content
            self._contract_content_cache[contract_path] = text
            return text
            
        except Exception as e:
            logger.error(f"Error parsing contract {contract_path}: {e}")
            return ""
    
    def _match_by_vendor_and_program(self, invoice_data: Dict) -> List[Tuple[str, str, float]]:
        """
        Match by vendor AND program code - handles multiple contracts between same parties.
        
        Returns: List of (contract_id, method, confidence) tuples
        """
        invoice_vendor = invoice_data.get("vendor_name") or ""
        invoice_vendor = invoice_vendor.lower() if invoice_vendor else ""
        raw_text = invoice_data.get("raw_text", "").upper()
        
        if not invoice_vendor:
            return []
        
        # Extract program codes from invoice raw_text (2-4 uppercase letters)
        program_codes = re.findall(r'\b([A-Z]{2,4})\b', raw_text)
        
        if not program_codes:
            return []
        
        matches = []
        
        for contract_id, contract_info in self.contract_index.items():
            # Check vendor match
            parties = contract_info.get("parties", [])
            vendor_matches = any(
                party in invoice_vendor or invoice_vendor in party
                for party in parties
            )
            
            if not vendor_matches:
                continue
            
            # Check program code match
            contract_program = contract_info.get("program_code", "")
            if contract_program and contract_program in program_codes:
                matches.append((contract_id, "VENDOR_PROGRAM", 0.85))
        
        return matches
    
    def _match_by_vendor_and_date(self, invoice_data: Dict) -> List[Tuple[str, str, float]]:
        """
        Match by vendor AND date range - requires contract date range info.
        
        Checks if:
        1. Vendor matches contract party
        2. Invoice date falls within contract date range
        
        Returns: List of (contract_id, method, confidence) tuples
        """
        invoice_vendor = invoice_data.get("vendor_name") or ""
        invoice_vendor = invoice_vendor.lower() if invoice_vendor else ""
        invoice_date = invoice_data.get("invoice_date")
        
        if not invoice_vendor or not invoice_date:
            return []
        
        matches = []
        
        for contract_id, contract_info in self.contract_index.items():
            # Check vendor match
            parties = contract_info.get("parties", [])
            vendor_matches = any(
                party in invoice_vendor or invoice_vendor in party
                for party in parties
            )
            
            if not vendor_matches:
                continue
            
            # Check date range
            date_range = contract_info.get("date_range")
            if date_range and self._date_in_range(invoice_date, date_range):
                matches.append((contract_id, "VENDOR_DATE", 0.80))
        
        return matches
    
    def _date_in_range(self, date: datetime, date_range: Dict) -> bool:
        """Check if date falls within contract date range."""
        start_date = date_range.get("start")
        end_date = date_range.get("end")
        
        if not start_date or not end_date:
            return False
        
        # Parse dates if strings
        try:
            if isinstance(start_date, str):
                start_date = datetime.fromisoformat(start_date.split("T")[0])
            if isinstance(end_date, str):
                end_date = datetime.fromisoformat(end_date.split("T")[0])
            
            return start_date <= date <= end_date
        except (ValueError, AttributeError) as e:
            logger.warning(f"Error parsing date range: {e}")
            return False
    
    def _match_by_program_code(self, invoice_data: Dict) -> List[Tuple[str, str, float]]:
        """
        Match by program code only.
        
        Returns: List of (contract_id, method, confidence) tuples
        """
        raw_text = invoice_data.get("raw_text", "").upper()
        if not raw_text:
            return []
        
        # Extract potential program codes (2-4 uppercase letters)
        program_codes = re.findall(r'\b([A-Z]{2,4})\b', raw_text)
        
        if not program_codes:
            return []
        
        matches = []
        
        for contract_id, contract_info in self.contract_index.items():
            contract_program = contract_info.get("program_code", "")
            if contract_program and contract_program in program_codes:
                matches.append((contract_id, "PROGRAM_CODE", 0.70))
        
        return matches
    
    def _match_by_vendor_only(self, invoice_data: Dict) -> List[Tuple[str, str, float]]:
        """
        Match by vendor only - last resort, low confidence.
        
        Returns: List of (contract_id, method, confidence) tuples
        """
        invoice_vendor = invoice_data.get("vendor_name") or ""
        invoice_vendor = invoice_vendor.lower() if invoice_vendor else ""
        if not invoice_vendor:
            return []
        
        matches = []
        
        for contract_id, contract_info in self.contract_index.items():
            parties = contract_info.get("parties", [])
            for party in parties:
                if party in invoice_vendor or invoice_vendor in party:
                    matches.append((contract_id, "VENDOR_ONLY", 0.60))
                    break
        
        return matches
    
    def _get_matching_details(self, invoice_data: Dict, contract_id: str) -> Dict:
        """Get details of why invoice matched this contract."""
        return {
            "po_number": invoice_data.get("po_number"),
            "vendor": invoice_data.get("vendor_name"),
            "invoice_date": str(invoice_data.get("invoice_date")) if invoice_data.get("invoice_date") else None,
            "contract_id": contract_id,
        }


print("[OK] InvoiceContractMatcher class defined")


[OK] InvoiceContractMatcher class defined


In [None]:
# Cell 20: InvoiceProcessor Class Definition

class InvoiceProcessor:
    """
    AI-powered Invoice Processor that applies extracted rules to validate invoices.
    """

    def __init__(self, rules_file: str = "extracted_rules.json"):
        """
        Initialize the processor with extracted rules and contract matcher.

        Args:
            rules_file: Path to JSON file with extracted rules (multi-contract format)
        """
        self.rules_file = rules_file
        # Load rules structure (multi-contract format only)
        self.rules_data = self._load_rules_data(rules_file)
        
        # Initialize contract matcher
        if not isinstance(self.rules_data, dict) or "contracts" not in self.rules_data:
            raise ValueError(
                f"Invalid rules file format. Expected multi-contract format with 'contracts' key. "
                f"File: {rules_file}"
            )
        
        self.matcher = InvoiceContractMatcher(self.rules_data)
        logger.info(f"Invoice Processor initialized with {len(self.rules_data.get('contracts', []))} contract(s)")
        
        # Current contract-specific rules (loaded per invoice)
        self.current_contract_id = None
        self.current_rules = []
        self.payment_terms = None

    def _load_rules_data(self, rules_file: str) -> Dict[str, Any]:
        """
        Load rules structure from JSON file (multi-contract format only).
        
        Expected format:
        {
            "extracted_at": "...",
            "contracts": [
                {
                    "contract_id": "...",
                    "contract_path": "...",
                    "parties": [...],
                    "program_code": "...",
                    "date_range": {...},
                    "rules": [...]
                }
            ]
        }
        
        Returns:
            Rules structure dict
        """
        try:
            with open(rules_file, "r") as f:
                rules_data = json.load(f)
            
            # Validate format
            if not isinstance(rules_data, dict):
                raise ValueError(f"Rules file must be a JSON object, got {type(rules_data)}")
            
            if "contracts" not in rules_data:
                raise ValueError(
                    f"Invalid rules file format. Missing 'contracts' key. "
                    f"Expected multi-contract format. File: {rules_file}"
                )
            
            if not isinstance(rules_data["contracts"], list):
                raise ValueError(f"'contracts' must be a list, got {type(rules_data['contracts'])}")
            
            logger.info(f"Loaded rules data from {rules_file}: {len(rules_data.get('contracts', []))} contract(s)")
            return rules_data
            
        except FileNotFoundError:
            raise FileNotFoundError(
                f"Rules file not found: {rules_file}. "
                f"Please run rule extraction (Cell 28) first to generate extracted_rules.json"
            )
        except json.JSONDecodeError as e:
            raise ValueError(f"Invalid JSON in rules file {rules_file}: {e}")
        except Exception as e:
            raise RuntimeError(f"Error loading rules file {rules_file}: {e}")

    def _load_contract_rules(self, contract_id: str) -> List[Dict[str, Any]]:
        """
        Load rules for a specific contract.
        
        Args:
            contract_id: Contract ID to load rules for
            
        Returns:
            List of rules for the specified contract
        """
        if not isinstance(self.rules_data, dict) or "contracts" not in self.rules_data:
            return []
        
        for contract in self.rules_data.get("contracts", []):
            if contract.get("contract_id") == contract_id:
                rules = contract.get("rules", [])
                logger.info(f"Loaded {len(rules)} rules for contract {contract_id}")
                return rules
        
        logger.warning(f"No rules found for contract {contract_id}")
        return []

    def _extract_payment_terms(self, rules: Optional[List[Dict[str, Any]]] = None) -> Optional[int]:
        """
        Extract net days from payment terms rule.
        
        Args:
            rules: Rules to search (defaults to self.current_rules)
        """
        if rules is None:
            rules = self.current_rules
            
        for rule in rules:
            if rule.get("type") == "payment_term":
                description = rule.get("description", "")
                # Look for "net 30", "net 60", etc.
                match = re.search(r"net\s*(\d+)", description, re.IGNORECASE)
                if match:
                    return int(match.group(1))
        return None

    def parse_invoice(self, invoice_path: str) -> Dict[str, Any]:
        """
        Parse invoice document and extract key fields.

        Args:
            invoice_path: Path to invoice PDF/image

        Returns:
            Dictionary with invoice data
        """
        logger.info(f"Parsing invoice: {invoice_path}")
        invoice_path = Path(invoice_path)

        if not invoice_path.exists():
            raise FileNotFoundError(f"Invoice not found: {invoice_path}")

        # Extract text from invoice
        text = ""

        # Handle image files (PNG, JPG, JPEG, TIFF, BMP) with pytesseract
        if invoice_path.suffix.lower() in [".png", ".jpg", ".jpeg", ".tiff", ".bmp"]:
            try:
                import pytesseract
                from PIL import Image, ImageEnhance

                logger.info(f"Using pytesseract for image file: {invoice_path.name}")

                # Load and optimize image for OCR
                img = Image.open(invoice_path)

                # Convert to RGB if needed
                if img.mode != "RGB":
                    img = img.convert("RGB")

                # Enhance image quality for better OCR
                img = ImageEnhance.Contrast(img).enhance(2.0)
                img = ImageEnhance.Sharpness(img).enhance(1.5)

                # Extract text using tesseract with optimized config
                # --psm 6: Assume a single uniform block of text
                # --oem 3: Use LSTM OCR Engine
                text = pytesseract.image_to_string(img, config="--psm 6 --oem 3")

                logger.info(f"pytesseract extracted {len(text)} characters")

            except Exception as e:
                logger.error(f"pytesseract extraction failed: {e}")
                logger.info("Make sure Tesseract is installed:")
                logger.info("  macOS: brew install tesseract")
                logger.info("  Linux: sudo apt-get install tesseract-ocr")
                text = ""

        # Handle PDF files
        elif invoice_path.suffix.lower() == ".pdf":
            with pdfplumber.open(invoice_path) as pdf:
                for page in pdf.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"

        # Extract key invoice fields using regex patterns
        invoice_data = {
            "file": invoice_path.name,
            "invoice_number": self._extract_field(
                text, r"invoice\s*#\s*:?\s*([A-Z0-9-]+)", "Invoice Number"
            ),
            "po_number": self._extract_field(
                text, r"(?:purchase\s+order\s+number|po\s*(?:number|#)?):?\s*(PO-[\w-]+)", "PO Number"
            ),
            "invoice_date": self._extract_date(
                text, r"invoice\s*date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})"
            ),
            "due_date": self._extract_date(
                text, r"due\s*date:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})"
            ),
            "total_amount": self._extract_amount(text),
            "vendor_name": self._extract_vendor_name(text),
            "raw_text": text[:500],  # First 500 chars for reference
        }

        return invoice_data

    def _extract_field(self, text: str, pattern: str, field_name: str) -> Optional[str]:
        """Extract a field using regex pattern."""
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            return match.group(1).strip()
        logger.warning(f"{field_name} not found in invoice")
        return None

    def _extract_vendor_name(self, text: str) -> Optional[str]:
        """Extract vendor name from invoice with multiple pattern attempts."""
        patterns = [
            # Pattern 1: After "INVOICE" heading, capture text before "Invoice #"
            r"INVOICE\s*\n\s*(.+?)\s+Invoice\s*#",
            # Pattern 2: "From:" line (common in some formats)
            r"from:?\s*([^\n]+)",
            # Pattern 3: First line containing "Inc." or "LLC" or "Ltd" or "Corp"
            r"(?:^|\n)([^\n]*?(?:Inc\.|LLC|Ltd\.|Corp\.|Corporation|Company)[^\n]*?)(?:\s+Invoice|$)",
            # Pattern 4: Text between INVOICE and first address/date line
            r"INVOICE\s*\n\s*([^\n]+?)(?:\s+\d{1,4}\s|$)",
        ]

        for pattern in patterns:
            match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
            if match:
                vendor = match.group(1).strip()
                # Clean up and validate
                # Remove trailing text after company name indicators
                vendor = re.sub(
                    r"\s+(Invoice|Tax|PO|Date).*$", "", vendor, flags=re.IGNORECASE
                )
                # Filter out invalid extractions
                if (
                    vendor
                    and len(vendor) > 3
                    and not vendor.lower().startswith("invoice")
                ):
                    return vendor

        logger.warning("Vendor not found in invoice")
        return None

    def _extract_date(self, text: str, pattern: str) -> Optional[datetime]:
        """Extract and parse a date field."""
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            date_str = match.group(1)
            # Try common date formats
            for fmt in [
                "%m/%d/%Y",
                "%d/%m/%Y",
                "%m-%d-%Y",
                "%d-%m-%Y",
                "%m/%d/%y",
                "%d/%m/%y",
            ]:
                try:
                    return datetime.strptime(date_str, fmt)
                except ValueError:
                    continue
        return None

    def _extract_amount(self, text: str) -> Optional[float]:
        """Extract total amount from invoice."""
        patterns = [
            r"(?:total\s*amount\s*due|total|amount\s*due|balance\s*due)[:\s]*\$\s*([\d,]+\.?\d*)",
            r"\$\s*([\d,]+\.\d{2})\s*$",  # Last dollar amount in text
        ]

        for pattern in patterns:
            match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
            if match:
                amount_str = match.group(1).replace(",", "")
                try:
                    return float(amount_str)
                except ValueError:
                    continue
        return None

    def validate_invoice(self, invoice_data: Dict[str, Any]) -> Dict[str, Any]:
        """
        Validate invoice against extracted rules.

        Args:
            invoice_data: Parsed invoice data

        Returns:
            Validation result with status and issues
        """
        logger.info(f"Validating invoice: {invoice_data['file']}")

        issues = []
        warnings = []

        # Check for required fields based on submission requirements rule
        required_fields = self._get_required_fields()
        for field in required_fields:
            if not invoice_data.get(field):
                issue_msg = f"Missing required field: {field}"
                issues.append(issue_msg)
                # Print critical validation issues to stdout (bypasses logging suppression)
                print(f"[!] VALIDATION ISSUE: {invoice_data['file']} - {issue_msg}")

        # Validate payment terms
        if (
            self.payment_terms
            and invoice_data.get("invoice_date")
            and invoice_data.get("due_date")
        ):
            expected_due = invoice_data["invoice_date"] + timedelta(
                days=self.payment_terms
            )
            actual_due = invoice_data["due_date"]

            if abs((actual_due - expected_due).days) > 2:  # Allow 2-day tolerance
                issue_msg = (
                    f"Due date mismatch: Expected {expected_due.strftime('%m/%d/%Y')}, "
                    f"got {actual_due.strftime('%m/%d/%Y')} (Net {self.payment_terms} terms)"
                )
                issues.append(issue_msg)
                print(f"[!] VALIDATION ISSUE: {invoice_data['file']} - {issue_msg}")

        # Check if invoice is overdue
        if invoice_data.get("due_date"):
            if invoice_data["due_date"] < datetime.now():
                days_overdue = (datetime.now() - invoice_data["due_date"]).days
                warnings.append(f"Invoice is {days_overdue} days overdue")

                # Check for late penalties
                penalty_rule = self._get_penalty_rule()
                if penalty_rule:
                    warnings.append(f"Late penalty may apply: {penalty_rule}")

        # Determine approval status
        if issues:
            status = "REJECTED"
            action = "Manual review required"
        elif warnings:
            status = "FLAGGED"
            action = "Review recommended"
        else:
            status = "APPROVED"
            action = "Auto-approved for payment"

        result = {
            "invoice_file": invoice_data["file"],
            "invoice_number": invoice_data.get("invoice_number"),
            "status": status,
            "action": action,
            "issues": issues,
            "warnings": warnings,
            "invoice_data": invoice_data,
            "validation_timestamp": datetime.now().isoformat(),
        }

        logger.info(f"Validation complete: {status}")
        return result

    def _get_required_fields(self) -> List[str]:
        """Extract required fields from submission requirements rule."""
        # Core required fields for any valid invoice
        required = ["invoice_number", "invoice_date", "total_amount", "vendor_name"]

        for rule in self.current_rules:
            if rule.get("type") == "submission":
                description = rule.get("description", "").lower()
                if "po" in description or "purchase order" in description:
                    required.append("po_number")

        return required

    def _get_penalty_rule(self) -> Optional[str]:
        """Get late payment penalty description."""
        for rule in self.current_rules:
            if rule.get("type") == "penalty":
                return rule.get("description")
        return None

    def process_invoice(self, invoice_path: str) -> Dict[str, Any]:
        """
        Complete invoice processing pipeline with contract matching.
        
        Args:
            invoice_path: Path to invoice file

        Returns:
            Processing result with validation and decision, including contract match info
        """
        try:
            # 1. Parse invoice
            invoice_data = self.parse_invoice(invoice_path)
            
            # 2. Match invoice to contract
            contract_match = self.matcher.match_invoice_to_contract(invoice_data)
            
            if contract_match["status"] == "MATCHED" or contract_match["status"] == "AMBIGUOUS":
                # Load contract-specific rules
                contract_id = contract_match["contract_id"]
                self.current_contract_id = contract_id
                self.current_rules = self._load_contract_rules(contract_id)
                self.payment_terms = self._extract_payment_terms()
                
                logger.info(f"Invoice matched to contract {contract_id} via {contract_match['match_method']} (confidence: {contract_match['confidence']})")
            elif contract_match["status"] == "UNMATCHED":
                # No contract match - cannot validate
                logger.warning(f"Invoice {invoice_data.get('file')} could not be matched to any contract")
                return {
                    "invoice_file": invoice_data["file"],
                    "invoice_number": invoice_data.get("invoice_number"),
                    "status": "UNMATCHED",
                    "action": "Manual review required - no matching contract found",
                    "issues": [contract_match["matching_details"].get("reason", "No matching contract found")],
                    "warnings": [],
                    "invoice_data": invoice_data,
                    "contract_match": contract_match,
                    "validation_timestamp": datetime.now().isoformat(),
                }
            
            # 3. Validate against rules
            result = self.validate_invoice(invoice_data)
            
            # 4. Add contract match info to result
            result["contract_match"] = contract_match
            result["contract_id"] = contract_match.get("contract_id")
            result["match_method"] = contract_match.get("match_method")
            result["match_confidence"] = contract_match.get("confidence")
            
            return result

        except Exception as e:
            logger.error(f"Error processing invoice: {e}")
            return {
                "invoice_file": str(invoice_path),
                "status": "ERROR",
                "action": "System error - manual review required",
                "issues": [str(e)],
                "warnings": [],
                "validation_timestamp": datetime.now().isoformat(),
            }

    def batch_process(self, invoice_folder: str):
        """
        Process multiple invoices from a folder.
            invoice_folder: Path to folder containing invoices
        Args:
            invoice_folder: Path to folder containing invoices

        Returns:
            Tuple of (results list, summary dict)
        """
        folder = Path(invoice_folder)
        if not folder.exists():
            raise FileNotFoundError(f"Folder not found: {invoice_folder}")

        results = []
        # Collect all invoice files and filter out temp/system files
        all_files = (
            list(folder.glob("*.pdf"))
            + list(folder.glob("*.png"))
            + list(folder.glob("*.jpg"))
        )
        # Filter out temp/system files (using helper function from Cell 8)
        invoice_files = [f for f in all_files if f.is_file() and is_valid_file(f)]

        logger.info(f"Processing {len(invoice_files)} invoices from {invoice_folder}")

        for invoice_file in invoice_files:
            result = self.process_invoice(str(invoice_file))
            results.append(result)

        # Generate summary
        summary = {
            "total": len(results),
            "approved": sum(1 for r in results if r["status"] == "APPROVED"),
            "flagged": sum(1 for r in results if r["status"] == "FLAGGED"),
            "rejected": sum(1 for r in results if r["status"] == "REJECTED"),
            "errors": sum(1 for r in results if r["status"] == "ERROR"),
        }
        return results, summary


print("[OK] InvoiceProcessor class defined")


[OK] InvoiceProcessor class defined


In [None]:
# Cell 20.5: Test InvoiceContractMatcher

# Test the contract matching logic before integrating with InvoiceProcessor

print("=" * 80)
print("TESTING InvoiceContractMatcher")
print("=" * 80)

# Use REAL contracts from docs/contracts directory
# Check what contracts we actually have
from pathlib import Path
import re

contracts_dir = Path("docs/contracts")
all_contracts = list(contracts_dir.glob("*.pdf")) + list(contracts_dir.glob("*.docx"))
# Filter out temp/system files (using helper function from Cell 8)
real_contracts = [f for f in all_contracts if f.is_file() and is_valid_file(f)]
print(f"\nFound {len(real_contracts)} contract file(s) in docs/contracts/")
for contract in real_contracts:
    print(f"  - {contract.name}")

# Helper function to extract basic metadata from contract files
def extract_contract_metadata(contract_path: Path) -> Dict:
    """Extract parties, program codes, and dates from contract file."""
    metadata = {
        "contract_id": f"CONTRACT_{contract_path.stem.upper()}",
        "contract_path": str(contract_path),
        "parties": [],
        "program_code": "",
        "date_range": None,
    }
    
    # Try to parse contract content to extract metadata
    try:
        text = ""
        if contract_path.suffix.lower() == ".pdf":
            import pdfplumber
            with pdfplumber.open(contract_path) as pdf:
                for page in pdf.pages[:3]:  # First 3 pages usually have party info
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
        elif contract_path.suffix.lower() in [".docx", ".doc"]:
            from docx import Document
            doc = Document(contract_path)
            text = "\n".join([para.text for para in doc.paragraphs[:20]])  # First 20 paragraphs
        
        text_upper = text.upper()
        
        # Extract program code from contract content first (more reliable)
        program_match = re.search(r'PROGRAM CODE:\s*([A-Z]{2,4})', text_upper)
        if program_match:
            metadata["program_code"] = program_match.group(1)
        else:
            # Fallback: Extract from filename (convert to uppercase first)
            filename_upper = contract_path.name.upper()
            program_match = re.search(r'\b([A-Z]{2,4})\b', filename_upper)
            if program_match:
                code = program_match.group(1)
                # Filter out common words
                if code not in ["FOR", "PDF", "SOW", "MSA", "THE", "NET", "CONTRACT"]:
                    metadata["program_code"] = code
        
        # Extract parties (common patterns)
        parties = set()
        if "BAYER" in text_upper:
            parties.add("BAYER")
        if "R4" in text_upper or "R4 TECHNOLOGIES" in text_upper:
            parties.add("R4 Technologies")
        if "ACME" in text_upper:
            parties.add("ACME Corp")
        if "CLIENT" in text_upper and "CLIENT INC" in text_upper:
            parties.add("Client Inc")
        
        # Extract dates from text (YYYY-MM-DD or YYYY/MM/DD)
        date_patterns = [
            r'\b(20\d{2}[-/]\d{2}[-/]\d{2})\b',
            r'\b(20\d{2})\b',  # Year only
        ]
        dates = []
        for pattern in date_patterns:
            matches = re.findall(pattern, text)
            dates.extend(matches)
        
        if dates:
            # Try to create date range from extracted dates
            years = [int(d[:4]) for d in dates if len(d) >= 4]
            if years:
                start_year = min(years)
                end_year = max(years) + 1  # Assume 1 year contract
                metadata["date_range"] = {
                    "start": f"{start_year}-01-01",
                    "end": f"{end_year}-12-31"
                }
        
        metadata["parties"] = list(parties)
        
    except Exception as e:
        logger.debug(f"Could not extract metadata from {contract_path.name}: {e}")
    
    return metadata

# Extract real metadata from each contract file
print("\n" + "-" * 80)
print("Extracting metadata from real contract files...")
print("-" * 80)

real_contracts_data = {
    "extracted_at": datetime.now().isoformat(),
    "contracts": []
}

for contract_file in real_contracts:
    print(f"\n📄 {contract_file.name}:")
    metadata = extract_contract_metadata(contract_file)
    
    # Add default rules (will be replaced with actual extracted rules later)
    metadata["extracted_at"] = datetime.now().isoformat()
    metadata["rules"] = [
        {
            "rule_id": "payment_terms",
            "type": "payment_term",
            "description": "Payment terms extracted from contract",
            "priority": "high"
        }
    ]
    
    real_contracts_data["contracts"].append(metadata)
    
    print(f"  Contract ID: {metadata['contract_id']}")
    print(f"  Parties: {metadata['parties']}")
    print(f"  Program Code: {metadata['program_code']}")
    print(f"  Date Range: {metadata['date_range']}")

if not real_contracts_data["contracts"]:
    print("\n⚠ No contracts found. Using sample data for testing...")
    # Fallback to sample data if no contracts found
    real_contracts_data = {
        "extracted_at": datetime.now().isoformat(),
        "contracts": [
            {
                "contract_id": "CONTRACT_NET30",
                "contract_path": "docs/contracts/sample_contract_net30.pdf",
                "parties": ["BAYER", "R4 Technologies"],
                "program_code": "BCH",
                "date_range": {"start": "2021-01-01", "end": "2023-12-31"},
                "extracted_at": datetime.now().isoformat(),
                "rules": [{"rule_id": "payment_terms", "type": "payment_term", "description": "Net 30 days", "priority": "high"}]
            }
        ]
    }

matcher = InvoiceContractMatcher(real_contracts_data)

# Test with REAL invoice files
print("\n" + "-" * 80)
print("TEST 1: Testing with Real Invoice Files")
print("-" * 80)

# Find real invoice files
invoice_dir = Path("docs/invoices")
all_invoice_files = (
    list(invoice_dir.glob("*.pdf")) + 
    list(invoice_dir.glob("*.docx")) + 
    list(invoice_dir.glob("*.png"))
)
# Filter out temp/system files
real_invoice_files = [f for f in all_invoice_files if f.is_file() and is_valid_file(f)]

print(f"\nFound {len(real_invoice_files)} real invoice file(s):")
for inv_file in real_invoice_files[:5]:  # Show first 5
    print(f"  - {inv_file.name}")

if real_invoice_files:
    # Initialize a temporary processor just for parsing invoices
    try:
        processor_temp = InvoiceProcessor()
    except:
        # If processor not initialized, create minimal parser
        print("\n⚠ InvoiceProcessor not initialized. Using basic parsing...")
        processor_temp = None
    
    print("\n" + "-" * 80)
    print("Testing each real invoice file:")
    print("-" * 80)
    
    for inv_file in real_invoice_files[:5]:  # Test first 5 files
        try:
            print(f"\n📄 {inv_file.name}:")
            
            # Parse the invoice
            if processor_temp:
                invoice_data = processor_temp.parse_invoice(str(inv_file))
            else:
                # Basic parsing fallback
                print("  ⚠ Skipping - InvoiceProcessor not available")
                continue
            
            # Match to contract
            match_result = matcher.match_invoice_to_contract(invoice_data)
            
            # Display results
            status_icon = {
                "MATCHED": "✓",
                "AMBIGUOUS": "⚠",
                "UNMATCHED": "✗"
            }.get(match_result['status'], "?")
            
            print(f"  {status_icon} Status: {match_result['status']}")
            print(f"     Contract: {match_result['contract_id']}")
            print(f"     Method: {match_result['match_method']}")
            print(f"     Confidence: {match_result['confidence']}")
            if match_result.get('matching_details'):
                details = match_result['matching_details']
                if details.get('po_number'):
                    print(f"     PO: {details['po_number']}")
                if details.get('vendor'):
                    print(f"     Vendor: {details['vendor']}")
                    
        except Exception as e:
            print(f"  ✗ Error processing {inv_file.name}: {e}")
            import traceback
            traceback.print_exc()
else:
    print("⚠ No invoice files found in docs/invoices/")
    print("Using simulated data for testing...")
    
    # Fallback to simulated tests if no real files
    print("\n" + "-" * 80)
    print("TEST 1: PO Number Matching (Simulated)")
    print("-" * 80)
    
    invoice_with_po = {
        "file": "invoice_001.pdf",
        "invoice_number": "INV-001",
        "po_number": "PO-2021-1234",
        "vendor_name": "R4 Technologies",
        "invoice_date": datetime(2022, 6, 15),
        "raw_text": "INVOICE\nR4 Technologies\nPO Number: PO-2021-1234\nAmount: $1000"
    }
    
    match_result = matcher.match_invoice_to_contract(invoice_with_po)
    print(f"  Status: {match_result['status']}")
    print(f"  Contract ID: {match_result['contract_id']}")
    print(f"  Match Method: {match_result['match_method']}")

# Test 2: Test with actual extracted_rules.json (if exists)
print("\n" + "-" * 80)
print("TEST 2: Test with Actual extracted_rules.json")
print("-" * 80)

try:
    with open("extracted_rules.json", "r") as f:
        actual_rules_data = json.load(f)
    
    # Convert old format to new format if needed
    if isinstance(actual_rules_data, list):
        print("Converting very old format (list) to new format...")
        actual_rules_data = {
            "version": "2.0",
            "extracted_at": datetime.now().isoformat(),
            "contracts": [
                {
                    "contract_id": "CONTRACT_DEFAULT",
                    "contract_path": "docs/contracts/sample_contract_net30.pdf",
                    "parties": [],
                    "program_code": "",
                    "date_range": None,
                    "extracted_at": datetime.now().isoformat(),
                    "rules": actual_rules_data
                }
            ]
        }
        print("✓ Converted list format to new format")
    elif isinstance(actual_rules_data, dict):
        if "contract_path" in actual_rules_data and "contracts" not in actual_rules_data:
            print("Converting old format (single contract) to new format...")
            actual_rules_data = {
                "version": "2.0",
                "extracted_at": actual_rules_data.get("extracted_at", datetime.now().isoformat()),
                "contracts": [
                    {
                        "contract_id": "CONTRACT_DEFAULT",
                        "contract_path": actual_rules_data.get("contract_path", "docs/contracts/sample_contract_net30.pdf"),
                        "parties": [],
                        "program_code": "",
                        "date_range": None,
                        "extracted_at": actual_rules_data.get("extracted_at", datetime.now().isoformat()),
                        "rules": actual_rules_data.get("rules", [])
                    }
                ]
            }
            print("✓ Converted single contract format to new format")
        else:
            print("✓ Already in new format")
    
    actual_matcher = InvoiceContractMatcher(actual_rules_data)
    print(f"✓ Loaded {len(actual_rules_data.get('contracts', []))} contract(s)")
    
    # Test with a real invoice file if available
    if real_invoice_files and processor_temp:
        test_invoice_file = real_invoice_files[0]
        print(f"\nTesting with real invoice: {test_invoice_file.name}")
        test_invoice_data = processor_temp.parse_invoice(str(test_invoice_file))
        test_match = actual_matcher.match_invoice_to_contract(test_invoice_data)
        print(f"  Status: {test_match['status']}")
        print(f"  Contract ID: {test_match['contract_id']}")
    else:
        # Fallback to simulated test
        test_invoice = {
            "file": "test_invoice.pdf",
            "invoice_number": "TEST-001",
            "po_number": None,
            "vendor_name": "Test Vendor",
            "invoice_date": datetime.now(),
            "raw_text": "TEST INVOICE"
        }
        test_match = actual_matcher.match_invoice_to_contract(test_invoice)
        print(f"\nTest Match Result (simulated):")
        print(f"  Status: {test_match['status']}")
        print(f"  Contract ID: {test_match['contract_id']}")
    
except FileNotFoundError:
    print("⚠ extracted_rules.json not found - skipping test with actual data")
except Exception as e:
    print(f"⚠ Error testing with actual data: {e}")
    import traceback
    traceback.print_exc()

print("\n" + "=" * 80)
print("MATCHER TESTING COMPLETE")
print("=" * 80)



TESTING InvoiceContractMatcher

Found 4 contract file(s) in docs/contracts/
  - sample_contract_net60.pdf
  - sample_contract_net30.pdf
  - sample_contract_net60.docx
  - sample_contract_net30.docx

--------------------------------------------------------------------------------
TEST 1: PO Number Matching
--------------------------------------------------------------------------------

Invoice PO: PO-2021-1234
Invoice Vendor: R4 Technologies

Match Result:
  Status: MATCHED
  Contract ID: CONTRACT_NET30
  Match Method: VENDOR_DATE
  Confidence: 0.8
  Details: {'po_number': 'PO-2021-1234', 'vendor': 'R4 Technologies', 'invoice_date': '2022-06-15 00:00:00', 'contract_id': 'CONTRACT_NET30'}

--------------------------------------------------------------------------------
TEST 2: Vendor + Program Code Matching
--------------------------------------------------------------------------------

Invoice Vendor: R4 Technologies
Invoice Text contains: BCH

Match Result:
  Status: MATCHED
  Contra

## Usage: Process Invoices with Extracted Rules

After extracting rules from contracts (Part 1), use these cells to process invoices:

- **Cell 21:** Initialize Invoice Processor (loads rules from `extracted_rules.json`)
- **Cell 22:** Process a single invoice file (optional - for testing/debugging)
  - Useful for: Testing specific invoices, debugging issues, learning how validation works
  - Shows detailed output for one invoice
- **Cell 23:** Batch process all invoices (recommended for production)
  - Processes all invoices in `docs/invoices/` directory
  - Generates summary statistics and saves results to JSON
  - Use this for normal workflow
- **Cell 24:** Generate a summary report of all processed invoices

### Recommendation:
- **For production:** Use Cell 23 (batch processing) - it's more efficient and provides summary statistics
- **For testing/debugging:** Use Cell 22 (single invoice) - easier to see detailed output for one file

### Validation Status:
- **APPROVED** - Invoice meets all requirements
- **FLAGGED** - Invoice has warnings but may be acceptable
- **REJECTED** - Invoice fails critical requirements (e.g., missing PO number)

In [None]:
# Cell 21: Initialize Invoice Processor (with robust error handling)

import os

# Check if rules file exists and is valid
rules_file = "extracted_rules.json"

if not os.path.exists(rules_file):
    print(f"[WARN] Rules file not found: {rules_file}")
    print("\nCreating default rules file...")

    # Create default rules
    default_rules = [
        {
            "rule_id": "payment_terms",
            "type": "payment_term",
            "description": "Payment terms: Net 30 days from invoice date. All invoices must include a valid Purchase Order (PO) number.",
            "priority": "high",
            "confidence": "high",
        },
        {
            "rule_id": "submission_requirements",
            "type": "submission",
            "description": "All invoices must include: Valid PO number (format: PO-YYYY-####), Invoice date and due date, Vendor tax identification number",
            "priority": "medium",
            "confidence": "high",
        },
        {
            "rule_id": "late_penalties",
            "type": "penalty",
            "description": "Late payment penalty: 1.5% per month on overdue balance. Missing PO number: Automatic rejection.",
            "priority": "high",
            "confidence": "high",
        },
    ]

    with open(rules_file, "w") as f:
        json.dump(default_rules, f, indent=2)

    print(f"[OK] Created {rules_file} with {len(default_rules)} default rules")

else:
    # Check if file is empty or invalid
    try:
        with open(rules_file, "r") as f:
            content = f.read().strip()
            if not content:
                raise ValueError("File is empty")
            # Try to parse JSON
            json.loads(content)
    except (ValueError, json.JSONDecodeError) as e:
        print(f"[WARN] Invalid JSON in {rules_file}: {e}")
        print("\nCreating default rules file...")

        default_rules = [
            {
                "rule_id": "payment_terms",
                "type": "payment_term",
                "description": "Payment terms: Net 30 days from invoice date. All invoices must include a valid Purchase Order (PO) number.",
                "priority": "high",
                "confidence": "high",
            },
            {
                "rule_id": "submission_requirements",
                "type": "submission",
                "description": "All invoices must include: Valid PO number (format: PO-YYYY-####), Invoice date and due date, Vendor tax identification number",
                "priority": "medium",
                "confidence": "high",
            },
            {
                "rule_id": "late_penalties",
                "type": "penalty",
                "description": "Late payment penalty: 1.5% per month on overdue balance. Missing PO number: Automatic rejection.",
                "priority": "high",
                "confidence": "high",
            },
        ]

        with open(rules_file, "w") as f:
            json.dump(default_rules, f, indent=2)

        print(f"[OK] Created {rules_file} with {len(default_rules)} default rules")

# Now initialize processor
try:
    processor = InvoiceProcessor(rules_file=rules_file)

    # Display loaded rules
    print("\n" + "=" * 60)
    print("Loaded Contract Rules:")
    print("=" * 60)
    for rule in processor.rules:
        print(f"\n[{rule['type'].upper()}] - Priority: {rule['priority']}")
        print(f"Description: {rule['description'][:100]}...")

    if processor.payment_terms:
        print(f"\n[OK] Payment Terms: Net {processor.payment_terms} days")
    else:
        print("\n[WARN] No payment terms found in rules")

    print("\n[OK] Invoice Processor ready")

except Exception as e:
    print(f"[ERROR] Error initializing processor: {e}")
    print("\nTroubleshooting:")
    print("  1. Run Cell 15 to extract rules from contract")
    print("  2. Or run Generate_Sample_Documents.ipynb to create sample documents first")
    print("  3. Or run Cell 28 for complete pipeline test")


In [None]:
# Cell 22: Process a Single Invoice

# NOTE: This cell is OPTIONAL - for testing/debugging individual invoices
# For production use, skip to Cell 23 (Batch Processing) which processes all invoices
# Use this cell when you need to:
#   - Test a specific invoice
#   - Debug validation issues
#   - See detailed output for one invoice


# Process a single invoice file
invoice_file = "docs/invoices/invoice_005_ocr_valid.png"  # Change to your invoice file

try:
    result = processor.process_invoice(invoice_file)

    # Display results
    print("=" * 70)
    print("INVOICE VALIDATION RESULT")
    print("=" * 70)
    print(f"\nInvoice File: {result['invoice_file']}")
    print(f"Invoice Number: {result.get('invoice_number', 'N/A')}")
    print(f"\nStatus: {result['status']}")
    print(f"Action: {result['action']}")

    if result["issues"]:
        print(f"\n[FAIL] ISSUES ({len(result['issues'])}):")
        for i, issue in enumerate(result["issues"], 1):
            print(f"  {i}. {issue}")

    if result["warnings"]:
        print(f"\n[WARN] WARNINGS ({len(result['warnings'])}):")
        for i, warning in enumerate(result["warnings"], 1):
            print(f"  {i}. {warning}")

    if result["status"] == "APPROVED":
        print("\n[OK] Invoice approved for payment")

    print(f"\nValidation Timestamp: {result['validation_timestamp']}")
    print("=" * 70)

    # Display invoice data
    if "invoice_data" in result:
        print("\nExtracted Invoice Data:")
        inv_data = result["invoice_data"]
        print(f"  Invoice Date: {inv_data.get('invoice_date', 'N/A')}")
        print(f"  Due Date: {inv_data.get('due_date', 'N/A')}")
        print(
            f"  Total Amount: ${inv_data.get('total_amount', 0):.2f}"
            if inv_data.get("total_amount")
            else "  Total Amount: N/A"
        )
        print(f"  PO Number: {inv_data.get('po_number', 'N/A')}")
        print(f"  Vendor: {inv_data.get('vendor_name', 'N/A')}")

except FileNotFoundError:
    print(f"[WARN] Invoice file not found: {invoice_file}")
    print("Please create sample documents first (run Generate_Sample_Documents.ipynb)")
except Exception as e:
    print(f"[ERROR] Error processing invoice: {e}")


In [None]:
# Cell 23: Batch Process Multiple Invoices

# Process multiple invoices from a folder
invoice_folder = "docs/invoices"  # Change to your invoices folder

try:
    results, summary = processor.batch_process(invoice_folder)

    # Display summary
    print("=" * 70)
    print("BATCH PROCESSING SUMMARY")
    print("=" * 70)
    print(f"\nTotal Invoices Processed: {summary['total']}")
    print(f"\n[OK] Approved: {summary['approved']}")
    print(f"[WARN] Flagged for Review: {summary['flagged']}")
    print(f"[FAIL] Rejected: {summary['rejected']}")
    print(f"[ERROR] Errors: {summary['errors']}")

    # Calculate approval rate
    if summary["total"] > 0:
        approval_rate = (summary["approved"] / summary["total"]) * 100
        print(f"\nApproval Rate: {approval_rate:.1f}%")

    # Display individual results
    print("\n" + "=" * 70)
    print("INDIVIDUAL INVOICE RESULTS")
    print("=" * 70)

    for i, result in enumerate(results, 1):
        status_icon = {
            "APPROVED": "[OK]",
            "FLAGGED": "[WARN]",
            "REJECTED": "[FAIL]",
            "ERROR": "[ERROR]",
        }.get(result["status"], "[?]")

        print(f"\n{i}. {status_icon} {result['invoice_file']}")
        print(f"   Status: {result['status']} - {result['action']}")

        if result["issues"]:
            print(f"   Issues: {', '.join(result['issues'][:2])}")
        if result["warnings"]:
            print(f"   Warnings: {', '.join(result['warnings'][:2])}")

    # Save results to JSON
    output_file = "invoice_processing_results.json"
    with open(output_file, "w") as f:
        json.dump(
            {
                "summary": summary,
                "results": results,
                "processed_at": datetime.now().isoformat(),
            },
            f,
            indent=2,
            default=str,
        )

    print(f"\n[OK] Results saved to {output_file}")

except FileNotFoundError:
    print(f"[WARN] Invoice folder not found: {invoice_folder}")
    print("Please create sample documents first (run Generate_Sample_Documents.ipynb)")
except Exception as e:
    print(f"[FAIL] Error in batch processing: {e}")


In [None]:
# Cell 24: Generate Processing Report


def generate_processing_report(results_file: str = "invoice_processing_results.json"):
    """Generate a detailed processing report with statistics and insights."""

    try:
        with open(results_file, "r") as f:
            data = json.load(f)

        summary = data["summary"]
        results = data["results"]

        print("=" * 80)
        print("INVOICE PROCESSING REPORT")
        print("=" * 80)
        print(f"\nGenerated: {data.get('processed_at', 'N/A')}")

        # Overall Statistics
        print("\nOVERALL STATISTICS")
        print("-" * 80)
        print(f"Total Invoices: {summary['total']}")
        print(
            f"Approved: {summary['approved']} ({summary['approved']/max(summary['total'],1)*100:.1f}%)"
        )
        print(
            f"Flagged: {summary['flagged']} ({summary['flagged']/max(summary['total'],1)*100:.1f}%)"
        )
        print(
            f"Rejected: {summary['rejected']} ({summary['rejected']/max(summary['total'],1)*100:.1f}%)"
        )
        print(
            f"Errors: {summary['errors']} ({summary['errors']/max(summary['total'],1)*100:.1f}%)"
        )

        # Most Common Issues
        print("\nMOST COMMON ISSUES")
        print("-" * 80)
        all_issues = []
        for result in results:
            all_issues.extend(result.get("issues", []))

        if all_issues:
            from collections import Counter

            issue_counts = Counter(all_issues)
            for issue, count in issue_counts.most_common(5):
                print(f"  • {issue}: {count} occurrence(s)")
        else:
            print("  No issues found")

        # Most Common Warnings
        print("\nMOST COMMON WARNINGS")
        print("-" * 80)
        all_warnings = []
        for result in results:
            all_warnings.extend(result.get("warnings", []))

        if all_warnings:
            from collections import Counter

            warning_counts = Counter(all_warnings)
            for warning, count in warning_counts.most_common(5):
                print(f"  • {warning}: {count} occurrence(s)")
        else:
            print("  No warnings found")

        # Recommended Actions
        print("\nRECOMMENDED ACTIONS")
        print("-" * 80)
        if summary["rejected"] > 0:
            print(f"  1. Review {summary['rejected']} rejected invoice(s) manually")
        if summary["flagged"] > 0:
            print(f"  2. Investigate {summary['flagged']} flagged invoice(s)")
        if summary["errors"] > 0:
            print(f"  3. Fix processing errors for {summary['errors']} invoice(s)")
        if summary["approved"] == summary["total"]:
            print("  [OK] All invoices approved - ready for payment processing")

        print("\n" + "=" * 80)

    except FileNotFoundError:
        print(f"[WARN] Results file not found: {results_file}")
        print("Please run batch processing first (Cell 23)")
    except Exception as e:
        print(f"[FAIL] Error generating report: {e}")


# Run the report if results exist
generate_processing_report()


## Summary: Complete AI Agent Pipeline

This notebook provides a complete end-to-end invoice processing solution:

1. **Rule Extraction** (Cells 14-17) - Extract rules from contracts using RAG
2. **Invoice Processing** (Cells 19-24) - Validate invoices against rules
3. **Complete Pipeline** (Cell 29) - Run both steps together
4. **Reporting** (Cell 30) - Export results to JSON reports

### Key Files Generated:
- `extracted_rules.json` - Extracted invoice processing rules
- `invoice_processing_results.json` - Processing results and validation status
- `validation_report.json` - Detailed validation report

### Sample Documents:
Sample contracts and invoices are automatically generated if the `docs/` directories are empty (see Cell 8).

# Cell 26: Sample Document Generation

**Note:** Sample document generation has been moved to a separate notebook.

**To generate sample documents:**
- Run the notebook: **`Generate_Sample_Documents.ipynb`**
- Or let the main notebook auto-generate them (check runs at startup)

The main notebook automatically checks for sample documents at startup (Cell 8) and will prompt you to generate them if the folders are empty.

## Test the Complete Pipeline

**Cell 29** runs the complete pipeline:
1. Extracts rules from contract documents
2. Processes all invoices in `docs/invoices/`
3. Validates invoices against extracted rules
4. Generates comprehensive results

This is the recommended way to test the entire system end-to-end.

In [None]:
# Cell 28: Complete RAG Pipeline Test - Extract Rules from ALL Contracts and Process Invoices
from datetime import datetime

# Temporarily reduce logging noise for cleaner output
import logging

old_level = logging.getLogger().level
logging.getLogger().setLevel(
    logging.ERROR
)  # Only show errors (suppresses INFO and WARNING)

print("=" * 80)
print("=" * 80)
print("COMPLETE RAG PIPELINE TEST - MULTI-CONTRACT SUPPORT")
print("=" * 80)
print("=" * 80)

# Step 1: Extract rules from ALL contracts using RAG
print("\nStep 1: Extracting rules from ALL contracts using RAG...")
print("-" * 80)

# Helper function to extract contract metadata (reused from Cell 21)
def extract_contract_metadata(contract_path: Path) -> Dict:
    """Extract parties, program codes, and dates from contract file."""
    metadata = {
        "contract_id": f"CONTRACT_{contract_path.stem.upper()}",
        "contract_path": str(contract_path),
        "parties": [],
        "program_code": "",
        "date_range": None,
    }
    
    # Try to parse contract content to extract metadata
    try:
        text = ""
        if contract_path.suffix.lower() == ".pdf":
            import pdfplumber
            with pdfplumber.open(contract_path) as pdf:
                for page in pdf.pages[:3]:  # First 3 pages usually have party info
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
        elif contract_path.suffix.lower() in [".docx", ".doc"]:
            from docx import Document
            doc = Document(contract_path)
            text = "\n".join([para.text for para in doc.paragraphs[:20]])  # First 20 paragraphs
        
        text_upper = text.upper()
        
        # Extract program code from contract content first (more reliable)
        program_match = re.search(r'PROGRAM CODE:\s*([A-Z]{2,4})', text_upper)
        if program_match:
            metadata["program_code"] = program_match.group(1)
        else:
            # Fallback: Extract from filename (convert to uppercase first)
            filename_upper = contract_path.name.upper()
            program_match = re.search(r'\b([A-Z]{2,4})\b', filename_upper)
            if program_match:
                code = program_match.group(1)
                # Filter out common words
                if code not in ["FOR", "PDF", "SOW", "MSA", "THE", "NET", "CONTRACT"]:
                    metadata["program_code"] = code
        
        # Extract parties (common patterns - update based on actual contract content)
        parties = set()
        # Look for common party patterns in text
        # Pattern: "BETWEEN:" or "AND:" followed by company names
        party_patterns = [
            r'(?:BETWEEN|AND):\s*([A-Z][^,\n]+(?:Inc\.|LLC|Ltd\.|Corp\.|Corporation|Company))',
            r'(?:Client|Vendor):\s*([A-Z][^,\n]+(?:Inc\.|LLC|Ltd\.|Corp\.|Corporation|Company))',
        ]
        for pattern in party_patterns:
            matches = re.findall(pattern, text, re.IGNORECASE)
            for match in matches:
                party = match.strip()
                if len(party) > 3:
                    parties.add(party)
        
        # Extract dates from text (YYYY-MM-DD or YYYY/MM/DD)
        date_patterns = [
            r'\b(20\d{2}[-/]\d{2}[-/]\d{2})\b',
            r'\b(20\d{2})\b',  # Year only
        ]
        dates = []
        for pattern in date_patterns:
            matches = re.findall(pattern, text)
            dates.extend(matches)
        
        if dates:
            # Try to create date range from extracted dates
            years = [int(d[:4]) for d in dates if len(d) >= 4]
            if years:
                start_year = min(years)
                end_year = max(years) + 1  # Assume 1 year contract
                metadata["date_range"] = {
                    "start": f"{start_year}-01-01",
                    "end": f"{end_year}-12-31"
                }
        
        metadata["parties"] = list(parties)
        
    except Exception as e:
        logger.debug(f"Could not extract metadata from {contract_path.name}: {e}")
    
    return metadata

# Find all contract files
contracts_dir = Path("docs/contracts")
all_contracts = list(contracts_dir.glob("*.pdf")) + list(contracts_dir.glob("*.docx"))
# Filter out temp/system files
contract_files = [f for f in all_contracts if f.is_file() and is_valid_file(f)]

if not contract_files:
    print("[WARN] No contract files found in docs/contracts/")
    print("Using fallback rules...")
    # Fallback to multi-contract format
    all_contracts_data = {
        "extracted_at": datetime.now().isoformat(),
        "contracts": [
            {
                "contract_id": "CONTRACT_DEFAULT",
                "contract_path": "Unknown",
                "parties": [],
                "program_code": "",
                "date_range": None,
                "extracted_at": datetime.now().isoformat(),
                "rules": [
                    {
                        "rule_id": "payment_terms",
                        "type": "payment_term",
                        "description": "Payment terms: Net 30 days from invoice date. All invoices must include a valid Purchase Order (PO) number.",
                        "priority": "high",
                        "confidence": "high",
                    },
                    {
                        "rule_id": "submission_requirements",
                        "type": "submission",
                        "description": "All invoices must include: Valid PO number (format: PO-YYYY-####), Invoice date and due date, Vendor tax identification number",
                        "priority": "medium",
                        "confidence": "high",
                    },
                    {
                        "rule_id": "late_penalties",
                        "type": "penalty",
                        "description": "Late payment penalty: 1.5% per month on overdue balance. Missing PO number: Automatic rejection.",
                        "priority": "high",
                        "confidence": "high",
                    },
                ]
            }
        ]
    }
    with open("extracted_rules.json", "w") as f:
        json.dump(all_contracts_data, f, indent=2)
    print(f"Created {len(all_contracts_data['contracts'][0]['rules'])} fallback rules")
else:
    print(f"Found {len(contract_files)} contract file(s) to process")
    
    # Initialize RAG-powered agent
    rag_agent = InvoiceRuleExtractorAgent(llm=llm, embeddings=embeddings)
    
    # Process each contract
    all_contracts_data = {
        "extracted_at": datetime.now().isoformat(),
        "contracts": []
    }
    
    for contract_file in contract_files:
        try:
            print(f"\nProcessing: {contract_file.name}")
            
            # Extract metadata
            metadata = extract_contract_metadata(contract_file)
            print(f"  Contract ID: {metadata['contract_id']}")
            print(f"  Parties: {metadata['parties']}")
            print(f"  Program Code: {metadata['program_code']}")
            print(f"  Date Range: {metadata['date_range']}")
            
            # Extract rules using RAG
            print("  Extracting rules... (this takes ~4-5 seconds)")
            with redirect_stderr(io.StringIO()):
                rules = rag_agent.run(str(contract_file))
            
            # Combine metadata with rules
            contract_data = {
                "contract_id": metadata["contract_id"],
                "contract_path": metadata["contract_path"],
                "parties": metadata["parties"],
                "program_code": metadata["program_code"],
                "date_range": metadata["date_range"],
                "extracted_at": datetime.now().isoformat(),
                "rules": rules
            }
            
            all_contracts_data["contracts"].append(contract_data)
            print(f"  ✓ Extracted {len(rules)} rules")
            
        except Exception as e:
            print(f"  ✗ Error processing {contract_file.name}: {e}")
            # Add contract with empty rules
            metadata = extract_contract_metadata(contract_file)
            all_contracts_data["contracts"].append({
                "contract_id": metadata["contract_id"],
                "contract_path": metadata["contract_path"],
                "parties": metadata["parties"],
                "program_code": metadata["program_code"],
                "date_range": metadata["date_range"],
                "extracted_at": datetime.now().isoformat(),
                "rules": []
            })
    
    # Save all contracts in multi-contract format
    with open("extracted_rules.json", "w") as f:
        json.dump(all_contracts_data, f, indent=2)
    
    print(f"\n✓ Extracted rules from {len(all_contracts_data['contracts'])} contract(s)")
    print(f"✓ Saved to extracted_rules.json (multi-contract format)")
# Step 2: Process sample invoices
print("\nStep 2: Processing sample invoices...")
print("-" * 80)

try:
    # Initialize invoice processor
    processor = InvoiceProcessor(rules_file="extracted_rules.json")

    # Process all invoices found in the invoices directory
    # Scan invoices directory for all invoice files (excluding temp/system files)
    invoice_dir = Path("docs/invoices")
    all_invoice_files = list(invoice_dir.glob("*.*"))
    # Filter out temp/system files and only keep valid invoice formats
    invoice_files = [
        str(f) for f in all_invoice_files
        if f.is_file() 
        and is_valid_file(f)
        and f.suffix.lower() in [".pdf", ".docx", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"]
    ]

    if not invoice_files:
        print("[WARN] No invoice files found in docs/invoices/")
        print("Please add invoice files or run Generate_Sample_Documents.ipynb")
    else:
        print(f"Found {len(invoice_files)} invoice file(s) to process")
        print(f"Files: {[Path(f).name for f in invoice_files]}")

        results = []
        for invoice_file in invoice_files:
            try:
                result = processor.process_invoice(invoice_file)
                results.append(result)

                # Display result
                status_icon = {
                    "APPROVED": "[OK]",
                    "FLAGGED": "[WARN]",
                    "REJECTED": "[FAIL]",
                    "ERROR": "[ERROR]",
                }.get(result["status"], "[?]")

                print(f"\n{status_icon} {Path(invoice_file).name}:")
                print(f"   Status: {result['status']}")
                print(f"   Action: {result['action']}")
                
                # Show contract matching info
                if result.get("contract_match"):
                    match_info = result["contract_match"]
                    if match_info.get("status") == "MATCHED":
                        print(f"   Contract: {result.get('contract_id', 'N/A')} (via {result.get('match_method', 'N/A')}, confidence: {result.get('match_confidence', 0):.2f})")
                    elif match_info.get("status") == "UNMATCHED":
                        print(f"   Contract: UNMATCHED - Manual review required")
                    elif match_info.get("status") == "AMBIGUOUS":
                        print(f"   Contract: AMBIGUOUS - Multiple matches found")
                
                if result.get("issues"):
                    print(f"   Issues: {', '.join(result['issues'])}")
                if result.get("warnings"):
                    print(f"   Warnings: {', '.join(result['warnings'])}")
            except FileNotFoundError:
                print(f"\n[ERROR] {Path(invoice_file).name}: File not found (skipping)")

        if results:
            # Summary
            approved = sum(1 for r in results if r["status"] == "APPROVED")
            flagged = sum(1 for r in results if r["status"] == "FLAGGED")
            rejected = sum(1 for r in results if r["status"] == "REJECTED")

            print("\n" + "=" * 80)
            print("PIPELINE TEST RESULTS")
            print("=" * 80)
            print(f"Total Invoices: {len(results)}")
            print(f"[OK] Approved: {approved}")
            print(f"[WARN] Flagged: {flagged}")
            print(f"[FAIL] Rejected: {rejected}")
            
            # Contract matching statistics
            matched = sum(1 for r in results if r.get("contract_match", {}).get("status") == "MATCHED")
            unmatched = sum(1 for r in results if r.get("contract_match", {}).get("status") == "UNMATCHED")
            ambiguous = sum(1 for r in results if r.get("contract_match", {}).get("status") == "AMBIGUOUS")
            print(f"\nContract Matching:")
            print(f"  [OK] Matched: {matched}")
            print(f"  [WARN] Unmatched: {unmatched}")
            print(f"  [WARN] Ambiguous: {ambiguous}")
            
            if len(results) > 0:
                print(f"\nSuccess Rate: {approved/len(results)*100:.1f}%")
        else:
            print("\n[WARN] No invoices processed. Create sample documents first (Generate_Sample_Documents.ipynb)")

        print("\n" + "=" * 80)
        print("[OK] Pipeline test complete!")

except Exception as e:
    print(f"[ERROR] Error in invoice processing: {e}")
    import traceback
    traceback.print_exc()

finally:
    # Restore original logging level
    logging.getLogger().setLevel(old_level)



In [None]:
# Cell 29: Export Pipeline Results to Report

import json
from datetime import datetime
from collections import Counter


# Convert datetime objects to strings for JSON serialization
def serialize_result(result):
    """Convert result dict to JSON-serializable format."""
    serialized = result.copy()
    if "invoice_data" in serialized:
        invoice_data = serialized["invoice_data"].copy()
        # Convert datetime objects to ISO format strings
        for key in ["invoice_date", "due_date"]:
            if key in invoice_data and invoice_data[key]:
                if isinstance(invoice_data[key], datetime):
                    invoice_data[key] = invoice_data[key].isoformat()
        serialized["invoice_data"] = invoice_data
    return serialized


# Save results to JSON for reporting
# Read contracts info from extracted_rules.json (multi-contract format)
try:
    with open("extracted_rules.json", "r") as f:
        rules_data = json.load(f)
        # Multi-contract format
        if isinstance(rules_data, dict) and "contracts" in rules_data:
            contracts_info = {
                "total_contracts": len(rules_data.get("contracts", [])),
                "contracts": [
                    {
                        "contract_id": c.get("contract_id", "Unknown"),
                        "contract_path": c.get("contract_path", "Unknown"),
                        "rules_count": len(c.get("rules", []))
                    }
                    for c in rules_data.get("contracts", [])
                ]
            }
            total_rules = sum(len(c.get("rules", [])) for c in rules_data.get("contracts", []))
        else:
            contracts_info = {"total_contracts": 0, "contracts": []}
            total_rules = 0
except (FileNotFoundError, json.JSONDecodeError, KeyError):
    contracts_info = {"total_contracts": 0, "contracts": []}
    total_rules = 0

# Calculate contract matching statistics
contract_matching_stats = {
    "matched": sum(1 for r in results if r.get("contract_match", {}).get("status") == "MATCHED"),
    "unmatched": sum(1 for r in results if r.get("contract_match", {}).get("status") == "UNMATCHED"),
    "ambiguous": sum(1 for r in results if r.get("contract_match", {}).get("status") == "AMBIGUOUS"),
    "no_match_info": sum(1 for r in results if not r.get("contract_match"))
}

results_data = {
    "processed_at": datetime.now().isoformat(),
    "contracts_info": contracts_info,
    "total_rules_extracted": total_rules,
    "contract_matching": contract_matching_stats,
    "summary": {
        "total": len(results),
        "approved": sum(1 for r in results if r["status"] == "APPROVED"),
        "flagged": sum(1 for r in results if r["status"] == "FLAGGED"),
        "rejected": sum(1 for r in results if r["status"] == "REJECTED"),
        "errors": sum(1 for r in results if r["status"] == "ERROR"),
        "unmatched": contract_matching_stats["unmatched"],
    },
    "results": [serialize_result(r) for r in results],
}

# Save to file
with open("invoice_processing_results.json", "w") as f:
    json.dump(results_data, indent=2, fp=f)

print("[OK] Results saved to: invoice_processing_results.json")

# Generate detailed report
print("\n" + "=" * 80)
print("DETAILED PROCESSING REPORT")
print("=" * 80)
print(f"\nGenerated: {results_data['processed_at']}")
print(f"Contracts Processed: {results_data['contracts_info']['total_contracts']}")
print(f"Total Rules Extracted: {results_data['total_rules_extracted']}")

# Show contract matching statistics
print("\nCONTRACT MATCHING STATISTICS")
print("-" * 80)
matching = results_data["contract_matching"]
print(f"Matched: {matching['matched']} ({matching['matched']/max(len(results),1)*100:.1f}%)")
print(f"Unmatched: {matching['unmatched']} ({matching['unmatched']/max(len(results),1)*100:.1f}%)")
if matching['ambiguous'] > 0:
    print(f"Ambiguous: {matching['ambiguous']} ({matching['ambiguous']/max(len(results),1)*100:.1f}%)")

# Overall Statistics
print("\nOVERALL STATISTICS")
print("-" * 80)
summary = results_data["summary"]
print(f"Total Invoices: {summary['total']}")
print(
    f"[OK] Approved: {summary['approved']} ({summary['approved']/max(summary['total'],1)*100:.1f}%)"
)
print(
    f"[WARN] Flagged: {summary['flagged']} ({summary['flagged']/max(summary['total'],1)*100:.1f}%)"
)
print(
    f"[FAIL] Rejected: {summary['rejected']} ({summary['rejected']/max(summary['total'],1)*100:.1f}%)"
)
print(
    f"[ERROR] Errors: {summary['errors']} ({summary['errors']/max(summary['total'],1)*100:.1f}%)"
)

# Invoice-by-Invoice Details
print("\nINVOICE DETAILS")
print("-" * 80)
for i, result in enumerate(results, 1):
    status_icon = {
        "APPROVED": "[OK]",
        "FLAGGED": "[WARN]",
        "REJECTED": "[FAIL]",
        "ERROR": "[ERROR]",
    }.get(result["status"], "[?]")

    print(f"\n{i}. {status_icon} {result['invoice_file'].split('/')[-1]}")
    print(f"   Invoice #: {result.get('invoice_number', 'N/A')}")
    print(f"   Status: {result['status']}")
    print(f"   Action: {result['action']}")
    
    # Show contract matching info
    if result.get("contract_match"):
        match_info = result["contract_match"]
        if match_info.get("status") == "MATCHED":
            print(f"   Contract: {result.get('contract_id', 'N/A')} (via {result.get('match_method', 'N/A')}, confidence: {result.get('match_confidence', 0):.2f})")
        elif match_info.get("status") == "UNMATCHED":
            print(f"   Contract: UNMATCHED - Manual review required")
        elif match_info.get("status") == "AMBIGUOUS":
            print(f"   Contract: AMBIGUOUS - Multiple matches found")
            if match_info.get("alternative_matches"):
                print(f"      Alternative matches: {len(match_info['alternative_matches'])}")
    elif result.get("status") == "UNMATCHED":
        print(f"   Contract: UNMATCHED - No matching contract found")

    if result.get("issues"):
        print(f"   Issues: {'; '.join(result['issues'])}")
    if result.get("warnings"):
        print(f"   Warnings: {'; '.join(result['warnings'])}")

# Most Common Issues
print("\nMOST COMMON ISSUES")
print("-" * 80)
all_issues = []
for result in results:
    all_issues.extend(result.get("issues", []))

if all_issues:
    issue_counts = Counter(all_issues)
    for issue, count in issue_counts.most_common(5):
        print(f"  • {issue}: {count} occurrence(s)")
else:
    print("  [OK] No issues found")

# Most Common Warnings
print("\nMOST COMMON WARNINGS")
print("-" * 80)
all_warnings = []
for result in results:
    all_warnings.extend(result.get("warnings", []))

if all_warnings:
    warning_counts = Counter(all_warnings)
    for warning, count in warning_counts.most_common(5):
        print(f"  • {warning}: {count} occurrence(s)")
else:
    print("  [OK] No warnings found")

# Extracted Rules Summary
print("\nEXTRACTED RULES (from RAG)")
print("-" * 80)
for i, rule in enumerate(rules, 1):
    print(f"\n{i}. {rule['type'].upper()}")
    print(f"   Priority: {rule['priority']}")
    print(f"   Description: {rule['description'][:80]}...")

# Recommended Actions
print("\nRECOMMENDED ACTIONS")
print("-" * 80)
actions_listed = False
if summary["rejected"] > 0:
    print(f"  1. Review {summary['rejected']} rejected invoice(s) manually")
    actions_listed = True
if summary["flagged"] > 0:
    print(f"  2. Investigate {summary['flagged']} flagged invoice(s)")
    actions_listed = True
if summary["errors"] > 0:
    print(f"  3. Fix processing errors for {summary['errors']} invoice(s)")
    actions_listed = True
if summary["approved"] == summary["total"]:
    print("  [OK] All invoices approved - ready for payment processing")
    actions_listed = True

if not actions_listed:
    print("  [OK] No action required at this time")

print("\n" + "=" * 80)
print("[OK] Report generated successfully!")
print(f"Full results saved to: invoice_processing_results.json")
print("=" * 80)


### Benefits:

1. **No API Keys Required** - Uses local Ollama models
2. **Fast Processing** - FAISS vector store for efficient semantic search
3. **Comprehensive Validation** - Checks multiple rule types (payment terms, PO requirements, penalties, etc.)
4. **Detailed Reporting** - JSON output with validation status and issues
5. **Cross-Platform** - Works on Windows, Mac, and Linux
6. **OCR Support** - Handles scanned documents and images
7. **Automatic Setup** - Auto-generates sample documents if needed

## Usage Example

### Step-by-Step Execution:

1. **Setup** (run once):
   - Cells 5-6: Install packages
   - Cell 7: Import libraries
   - Cell 8: Generate sample documents (if needed)
   - Cell 9: Test Ollama connection

2. **Extract Rules**:
   - Cell 14: Initialize RAG agent
   - Cell 15: Process contract and extract rules
   - Cell 16: Save rules to JSON

3. **Process Invoices** (choose one):
   - **Option A (Recommended):** Cell 23 - Batch process all invoices
   - **Option B (Testing):** Cell 21 → Cell 22 - Process single invoice for debugging

4. **Complete Pipeline**:
   - Cell 29: Run complete pipeline (extract rules + process invoices)
   - Cell 30: Export results to report

### Quick Start (Production):
Run Cells 5-9, then Cell 29 (complete pipeline) for end-to-end processing.

In [None]:
# Cell 32: Verification Cell - Check Current Rules
print("=" * 70)
print("CURRENT PROCESSOR STATE")
print("=" * 70)

# Check if processor exists
try:
    print(f"\n[OK] Processor Status: Initialized")
    print(f"Total Rules Loaded: {len(processor.rules)}")
    print(
        f"Payment Terms: Net {processor.payment_terms} days"
        if processor.payment_terms
        else "Payment Terms: Not specified"
    )

    print("\nCurrently Loaded Rules:")
    for i, rule in enumerate(processor.rules, 1):
        print(f"\n  {i}. [{rule['type'].upper()}]")
        print(f"     ID: {rule['rule_id']}")
        print(f"     Priority: {rule['priority']}")
        print(
            f"     Source: {'RAG-extracted' if rule.get('confidence') == 'medium' else 'Default'}"
        )
        print(f"     Description: {rule['description'][:70]}...")

    print("\n" + "=" * 70)
    print("[OK] Ready to process invoices!")
    print("\nNext steps:")
    print("  1. Run Cell 22 to process a single invoice")
    print("  2. Run Cell 23 to batch process all invoices")
    print("  3. Run Cell 28 for complete pipeline with RAG extraction")

except NameError:
    print("[FAIL] Processor not initialized. Run Cell 21 first.")
