# OCR Invoice Processing Demo

This notebook demonstrates the enhanced OCR functionality for the Invoice & Billing system. We'll test the OCR processing with our sample invoice images and show the improvements made to text extraction and data parsing.

## Setup and Imports

Let's import the required libraries for testing our OCR functionality.

In [None]:
import requests
import json
import os
from pathlib import Path
import base64
from PIL import Image
import matplotlib.pyplot as plt

# API Configuration
API_BASE_URL = 'http://localhost:8000'
OCR_ENDPOINT = f'{API_BASE_URL}/invoices/ocr/extract'

print("OCR Demo Setup Complete!")
print(f"API Base URL: {API_BASE_URL}")
print(f"OCR Endpoint: {OCR_ENDPOINT}")

## Test Backend Connection

First, let's verify that our backend server is running and the OCR endpoint is accessible.

In [None]:
def test_backend_connection():
    """Test if the backend server is running"""
    try:
        # Test a simple endpoint first
        response = requests.get(f'{API_BASE_URL}/students')
        print(f"✅ Backend connection successful!")
        print(f"   Status: {response.status_code}")
        print(f"   Students found: {len(response.json())}")
        return True
    except Exception as e:
        print(f"❌ Backend connection failed: {e}")
        return False

# Test the connection
backend_available = test_backend_connection()

## Sample Invoice Images

Let's examine our sample invoice images that we'll use for OCR testing.

In [None]:
# List available sample files
sample_dir = Path('sample_invoices')
if sample_dir.exists():
    sample_files = list(sample_dir.glob('*'))
    print(f"📁 Found {len(sample_files)} sample files:")
    for file in sample_files:
        size = file.stat().st_size
        print(f"   📄 {file.name} ({size:,} bytes)")
else:
    print("❌ Sample invoices directory not found!")
    print("   Please run the sample invoice creation scripts first.")

## Display Sample Images

Let's visualize our sample invoice images to see what the OCR system is processing.

In [None]:
def display_sample_images():
    """Display sample invoice images"""
    image_files = list(sample_dir.glob('*.png')) + list(sample_dir.glob('*.jpg'))
    
    if not image_files:
        print("No image files found!")
        return
    
    # Create subplots for images
    fig, axes = plt.subplots(1, min(3, len(image_files)), figsize=(15, 5))
    if len(image_files) == 1:
        axes = [axes]
    
    for i, img_file in enumerate(image_files[:3]):
        try:
            img = Image.open(img_file)
            if len(image_files) > 1:
                axes[i].imshow(img)
                axes[i].set_title(img_file.name)
                axes[i].axis('off')
            else:
                axes.imshow(img)
                axes.set_title(img_file.name)
                axes.axis('off')
        except Exception as e:
            print(f"❌ Could not display {img_file.name}: {e}")
    
    plt.tight_layout()
    plt.show()

if sample_dir.exists():
    display_sample_images()
else:
    print("Sample directory not available for image display.")

## OCR Processing Function

Let's create a function to test OCR processing with our sample images.

In [None]:
def test_ocr_processing(image_path):
    """Test OCR processing on a specific image file"""
    try:
        print(f"🔍 Processing: {image_path.name}")
        print("-" * 50)
        
        # Prepare the file for upload
        with open(image_path, 'rb') as f:
            files = {'file': (image_path.name, f, 'image/png')}
            data = {
                'image_type': 'invoice_upload',
                'extract_data': 'true'
            }
            
            # Make the OCR request
            response = requests.post(OCR_ENDPOINT, files=files, data=data)
        
        if response.status_code == 200:
            result = response.json()
            
            print(f"✅ OCR Processing Successful!")
            print(f"   Confidence: {result.get('confidence_score', 0)*100:.1f}%")
            print(f"   Text Length: {len(result.get('ocr_text', ''))} characters")
            
            # Display extracted data
            extracted = result.get('extracted_data', {})
            if extracted:
                print("\n📋 Extracted Information:")
                print(f"   Student Name: {extracted.get('student_name', 'Not found')}")
                print(f"   Student ID: {extracted.get('student_id', 'Not found')}")
                print(f"   Email: {extracted.get('student_email', 'Not found')}")
                print(f"   Department: {extracted.get('department', 'Not found')}")
                print(f"   Due Date: {extracted.get('due_date', 'Not found')}")
                print(f"   Invoice #: {extracted.get('invoice_number', 'Not found')}")
                print(f"   Items Found: {len(extracted.get('items', []))}")
                
                # Show first few items
                items = extracted.get('items', [])
                if items:
                    print("\n🛠️ Equipment Items:")
                    for i, item in enumerate(items[:3]):
                        print(f"   {i+1}. {item.get('name', 'Unknown')} ({item.get('sku', 'No SKU')})")
                        print(f"      Qty: {item.get('quantity', 0)}, Value: ${item.get('unit_value', 0):.2f}")
            
            # Show sample of OCR text
            ocr_text = result.get('ocr_text', '')
            if ocr_text:
                print("\n📝 Raw OCR Text (first 300 characters):")
                print(f"   {ocr_text[:300]}{'...' if len(ocr_text) > 300 else ''}")
            
            return result
            
        else:
            print(f"❌ OCR Processing Failed!")
            print(f"   Status: {response.status_code}")
            print(f"   Response: {response.text}")
            return None
            
    except Exception as e:
        print(f"❌ Error during OCR processing: {e}")
        return None

print("OCR testing function ready!")

## Test OCR with Sample Images

Now let's test the OCR processing with our sample invoice images, starting with the optimized simple image.

In [None]:
# Test with the simple invoice (best for OCR)
if sample_dir.exists() and backend_available:
    simple_invoice = sample_dir / 'sample_invoice_simple.png'
    if simple_invoice.exists():
        print("🎯 Testing with OCR-optimized simple invoice:\n")
        simple_result = test_ocr_processing(simple_invoice)
    else:
        print("❌ Simple invoice file not found!")
else:
    print("❌ Cannot run OCR test - backend not available or samples missing")

## Test with Complex Layout Invoice

Let's test with a more complex invoice layout to see how well the enhanced parsing handles detailed invoices.

In [None]:
# Test with a more complex invoice
if sample_dir.exists() and backend_available:
    complex_invoice = sample_dir / 'sample_invoice_004.png'
    if complex_invoice.exists():
        print("🎯 Testing with complex invoice layout:\n")
        complex_result = test_ocr_processing(complex_invoice)
    else:
        print("❌ Complex invoice file not found!")
else:
    print("❌ Cannot run OCR test - backend not available or samples missing")

## OCR Accuracy Analysis

Let's analyze the accuracy of our OCR extraction by comparing expected vs extracted data.

In [None]:
def analyze_ocr_accuracy(ocr_result, expected_data):
    """Analyze OCR accuracy by comparing with expected data"""
    if not ocr_result or 'extracted_data' not in ocr_result:
        print("❌ No OCR result to analyze")
        return
    
    extracted = ocr_result['extracted_data']
    accuracy_score = 0
    total_fields = len(expected_data)
    
    print("📊 OCR Accuracy Analysis:")
    print("-" * 40)
    
    for field, expected_value in expected_data.items():
        extracted_value = extracted.get(field, '')
        
        if field == 'items':
            # Special handling for items
            expected_count = len(expected_value) if expected_value else 0
            extracted_count = len(extracted_value) if extracted_value else 0
            match = expected_count == extracted_count
            print(f"   Items: Expected {expected_count}, Got {extracted_count} {'✅' if match else '❌'}")
        else:
            # String comparison (case insensitive, partial match)
            if expected_value and extracted_value:
                match = expected_value.lower() in extracted_value.lower() or extracted_value.lower() in expected_value.lower()
            else:
                match = not expected_value and not extracted_value
            
            status = '✅' if match else '❌'
            print(f"   {field.replace('_', ' ').title()}: {status}")
            print(f"      Expected: '{expected_value}'")
            print(f"      Extracted: '{extracted_value}'")
        
        if match:
            accuracy_score += 1
    
    accuracy_percentage = (accuracy_score / total_fields) * 100
    print(f"\n🎯 Overall Accuracy: {accuracy_percentage:.1f}% ({accuracy_score}/{total_fields} fields correct)")
    
    return accuracy_percentage

# Expected data for the simple invoice
expected_simple = {
    'student_name': 'Alex Rodriguez',
    'student_id': 'STU98765',
    'student_email': 'alex.rodriguez@university.edu',
    'department': 'Chemistry',
    'invoice_number': 'INV-2025-005',
    'due_date': 'September 25, 2025',
    'items': [  # Expected 4 items
        'Analytical Balance',
        'pH Meter',
        'Beaker Set',
        'Safety Goggles'
    ]
}

# Analyze simple invoice if we have the result
if 'simple_result' in locals() and simple_result:
    simple_accuracy = analyze_ocr_accuracy(simple_result, expected_simple)
else:
    print("No simple invoice result available for analysis")

## Performance Summary

Let's summarize the OCR performance improvements and capabilities.

In [None]:
def display_performance_summary():
    """Display a summary of OCR performance and capabilities"""
    print("🚀 OCR System Performance Summary")
    print("=" * 50)
    
    print("\n✅ Implemented Features:")
    features = [
        "Tesseract OCR engine integration",
        "Multi-format support (PNG, JPG, PDF)",
        "Image preprocessing for better recognition",
        "Multiple PSM modes for different layouts",
        "Enhanced text parsing with regex patterns",
        "Structured data extraction",
        "Confidence scoring",
        "Equipment items parsing",
        "Error handling and fallback strategies"
    ]
    
    for i, feature in enumerate(features, 1):
        print(f"   {i:2d}. {feature}")
    
    print("\n📊 Extraction Capabilities:")
    capabilities = {
        "Student Information": ["Name", "ID", "Email", "Department"],
        "Invoice Details": ["Invoice Number", "Due Date", "Issue Date"],
        "Equipment Items": ["Names", "SKUs", "Quantities", "Values"],
        "Financial Data": ["Unit Values", "Total Values", "Line Items"]
    }
    
    for category, items in capabilities.items():
        print(f"   📋 {category}: {', '.join(items)}")
    
    print("\n🎯 Recommended Usage:")
    recommendations = [
        "Use 'sample_invoice_simple.png' for best OCR results",
        "Ensure good image quality and contrast",
        "Clear, readable fonts work best",
        "Structured layouts improve parsing accuracy",
        "Review extracted data before creating invoices"
    ]
    
    for i, rec in enumerate(recommendations, 1):
        print(f"   {i}. {rec}")
    
    if 'simple_accuracy' in locals():
        print(f"\n📈 Test Results: {simple_accuracy:.1f}% accuracy achieved")
    
    print("\n🔗 Integration Status:")
    print(f"   Backend Server: {'🟢 Running' if backend_available else '🔴 Not Available'}")
    print(f"   Sample Files: {'🟢 Available' if sample_dir.exists() else '🔴 Missing'}")
    print(f"   OCR Engine: 🟢 Tesseract Installed")

display_performance_summary()

## Next Steps

To further improve the OCR system, consider these enhancements:

1. **Image Preprocessing**: Add more advanced preprocessing techniques
2. **Machine Learning**: Train custom models for invoice-specific text recognition
3. **Template Matching**: Create templates for common invoice formats
4. **Validation**: Add data validation and correction suggestions
5. **User Feedback**: Implement learning from user corrections

The current system provides a solid foundation for invoice OCR processing with good accuracy on well-formatted documents.