# Function Integration - Complete Pipeline

This notebook integrates the functions from notebooks 2 and 3 into a complete document processing pipeline.

## Functions Moved to src/ Modules

All functions from notebooks 2 and 3 have been moved to the `src/` folder structure:
- **OCR functions** → `src/ocr/`
- **LLM functions** → `src/llm/`
- **Storage functions** → `src/storage/`

This allows for proper code organization and reusability across the project.

In [1]:
# Standard Library Imports
import asyncio
import json
import logging
import sys
import re
from datetime import datetime
from typing import Dict, Any, List, Protocol
from pathlib import Path

# Add project root to Python path so we can import from src/
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Force reload modules
if 'src.ocr.postprocess' in sys.modules:
    del sys.modules['src.ocr.postprocess']
if 'src.ocr.easyocr_client' in sys.modules:
    del sys.modules['src.ocr.easyocr_client']

# Import functions from notebook 02 - OCR Text Extraction
from src.ocr.easyocr_client import extract_text_bboxes_with_ocr
from src.ocr.postprocess import normalize_ocr_lines, convert_numpy_types

# Import functions from notebook 03 - LLM Field Extraction  
from src.llm.field_extractor import extract_fields_with_llm
from src.llm.client import OllamaClient, GenerativeLlm
from src.llm.config import load_document_config

# Import storage functions
from src.storage.storage import get_storage, Stage
from src.storage.blob_operations import write_ocr_results_to_bucket, read_ocr_results_from_bucket

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [2]:
# Upload the loan application PDF to blob storage using the storage client
document_id = "test-document-001"
filename = "loan_application.pdf"

# Load file from data folder
file_path = project_root / "data" / filename
with open(file_path, "rb") as f:
    file_data = f.read()

# Upload to blob storage using the storage client
storage_client = get_storage()
storage_client.upload_blob(
    uuid=document_id,
    stage=Stage.RAW,
    ext=".pdf",
    data=file_data,
    overwrite=True
)

blob_path = storage_client.blob_path(document_id, Stage.RAW, ".pdf")
print(f"File successfully uploaded to blob storage at: {Stage.RAW.value}/{blob_path}")

INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://127.0.0.1:10000/devstoreaccount1/raw?restype=REDACTED'
Request method: 'PUT'
Request headers:
    'x-ms-version': 'REDACTED'
    'Accept': 'application/xml'
    'User-Agent': 'azsdk-python-storage-blob/12.26.0 Python/3.10.16 (macOS-15.1-arm64-arm-64bit)'
    'x-ms-date': 'REDACTED'
    'x-ms-client-request-id': 'c396729e-89eb-11f0-9d9c-a526edf9c303'
    'Authorization': 'REDACTED'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 409
Response headers:
    'Server': 'Azurite-Blob/3.34.0'
    'x-ms-error-code': 'ContainerAlreadyExists'
    'x-ms-request-id': '90b0cff2-d89b-4b7d-9bab-5b990b2726f7'
    'content-type': 'application/xml'
    'Date': 'Fri, 05 Sep 2025 00:03:34 GMT'
    'Connection': 'keep-alive'
    'Keep-Alive': 'REDACTED'
    'Transfer-Encoding': 'chunked'
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://127.0.0.1:1000

BlobStorage initialized with multiple containers
Container 'raw' already exists
Uploaded blob: raw/test-document-001.pdf
File successfully uploaded to blob storage at: raw/test-document-001.pdf


## Configuration Loading

Load system configuration from the config file to get LLM settings.

In [3]:
def load_system_config():
    """Load system configuration from config file."""
    with open("../../config/credit-ocr-system.conf", 'r') as f:
        content = f.read()
    
    llm_match = re.search(r'generative_llm\s*\{\s*url\s*=\s*"([^"]+)"\s*model_name\s*=\s*"([^"]+)"\s*\}', content, re.DOTALL)
    if llm_match:
        return {
            'llm': {
                'url': llm_match.group(1),
                'model_name': llm_match.group(2)
            }
        }
    return {}

## Integrated Pipeline

This function combines all the processing steps into a single pipeline that processes a document from start to finish.

In [4]:
async def integrated_pipeline(document_id: str, filename: str, blob_path: str):
    """Complete integrated pipeline combining document loading from blob storage, OCR and LLM processing."""
    
    print(f"Starting integrated pipeline for document: {document_id}")
    print(f"  - Filename: {filename}")
    print(f"  - Blob path: {blob_path}")
    
    # Step 1: Load document from blob storage
    print("Step 1: Loading document from blob storage...")
    storage_client = get_storage()
    pdf_data = storage_client.download_blob(document_id, Stage.RAW, ".pdf")
    if pdf_data is None:
        raise FileNotFoundError(f"Document not found in blob storage: {document_id}")
    print(f"  - Loaded {filename} from blob storage ({len(pdf_data)} bytes)")
    
    # Step 2: OCR Processing (from notebook 2)
    print("Step 2: OCR Processing...")
    ocr_results, pdf_images = extract_text_bboxes_with_ocr(pdf_data)
    print(f"  - Extracted {len(ocr_results)} text elements")
    
    # Step 3: Normalize OCR results
    print("Step 3: Normalizing OCR results...")
    normalized_results = normalize_ocr_lines(ocr_results)
    print(f"  - Normalized to {len(normalized_results)} structured items")
    
    # Step 4: Convert NumPy types
    print("Step 4: Converting NumPy types...")
    ocr_results_converted = convert_numpy_types(ocr_results)
    normalized_results_converted = convert_numpy_types(normalized_results)
    
    # Step 5: LLM Processing (from notebook 3)
    print("Step 5: LLM Field Extraction...")
    
    # Load configuration
    system_config = load_system_config()
    doc_config = load_document_config("../../config/document_types.conf")
    
    # Initialize LLM client
    llm_client = OllamaClient(
        base_url=system_config['llm']['url'],
        model_name=system_config['llm']['model_name']
    )
    
    # Extract fields using LLM
    extraction_result = await extract_fields_with_llm(
        ocr_lines=normalized_results_converted,
        doc_config=doc_config["credit_request"],
        llm_client=llm_client,
        original_ocr_lines=ocr_results_converted
    )
    
    # Step 6: Prepare final results
    final_results = {
        "document_id": document_id,
        "timestamp": datetime.now().isoformat(),
        "document_info": {
            "filename": filename,
            "file_size": len(pdf_data),
            "blob_path": blob_path
        },
        "ocr_results": {
            "original_lines": ocr_results_converted,
            "normalized_lines": normalized_results_converted,
            "total_elements": len(ocr_results_converted),
            "structured_items": len(normalized_results_converted)
        },
        "llm_results": extraction_result,
        "status": "completed"
    }
    
    print("Step 6: Pipeline completed successfully!")
    print(f"  - Extracted {len(extraction_result.get('extracted_fields', {}))} fields")
    print(f"  - Missing {len(extraction_result.get('missing_fields', []))} fields")
    
    return final_results

## Run the Pipeline

Execute the complete integrated pipeline to process the loan application document.

In [None]:
# Run the integrated pipeline with document parameters
print("=" * 50)
print("INTEGRATED PIPELINE TEST")
print("=" * 50)

# Use the document that was uploaded to blob storage
document_id = "test-document-001"
filename = "loan_application.pdf"
blob_path = f"{Stage.RAW.value}/{document_id}.pdf"

results = await integrated_pipeline(document_id, filename, blob_path)
print("=" * 50)
print("PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 50)

INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://127.0.0.1:10000/devstoreaccount1/raw/test-document-001.pdf'
Request method: 'GET'
Request headers:
    'x-ms-range': 'REDACTED'
    'x-ms-version': 'REDACTED'
    'Accept': 'application/xml'
    'User-Agent': 'azsdk-python-storage-blob/12.26.0 Python/3.10.16 (macOS-15.1-arm64-arm-64bit)'
    'x-ms-date': 'REDACTED'
    'x-ms-client-request-id': 'c39cad62-89eb-11f0-9d9c-a526edf9c303'
    'Authorization': 'REDACTED'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 206
Response headers:
    'Server': 'Azurite-Blob/3.34.0'
    'last-modified': 'Fri, 05 Sep 2025 00:03:34 GMT'
    'x-ms-creation-time': 'REDACTED'
    'content-length': '147568'
    'content-type': 'application/octet-stream'
    'content-range': 'REDACTED'
    'etag': '"0x23DB91A2D5C27A0"'
    'x-ms-blob-type': 'REDACTED'
    'x-ms-lease-state': 'REDACTED'
    'x-ms-lease-status': 'REDACTED'
    

INTEGRATED PIPELINE TEST
Starting integrated pipeline for document: test-document-001
  - Filename: loan_application.pdf
  - Blob path: raw/test-document-001.pdf
Step 1: Loading document from blob storage...
Downloaded blob: raw/test-document-001.pdf
  - Loaded loan_application.pdf from blob storage (147568 bytes)
Step 2: OCR Processing...


INFO:src.ocr.easyocr_client:Processing PDF from bytes (size: 147568 bytes)
INFO:src.ocr.easyocr_client:Successfully converted PDF to 1 images


  - Extracted 62 text elements
Step 3: Normalizing OCR results...
  - Normalized to 26 structured items
Step 4: Converting NumPy types...
Step 5: LLM Field Extraction...
