## 2. Text Cleaning and Validation


> Use notebook 01_setup.ipynb to setup the environment and start the docker services before running the next sections

### 1.1 Import Libraries

In [1]:
from abc import ABC, abstractmethod
import aiohttp
import logging
from typing import Dict, List, Any, Optional
import json
import re
from dataclasses import asdict
from pathlib import Path
from datetime import datetime

from dataclasses import dataclass

We're creating a dataclass:

**`DocumentTypeConfig`**: Represents the configuration for a specific document type (like "credit_request")
   - Contains field definitions, validation rules, and descriptions
   - Makes it easy to access structured configuration data
   - Provides type safety when working with document configurations

This automatically gives us:
- A constructor that takes these 4 parameters
- A nice string representation when we print the object
- Type hints for better IDE support

In [2]:
@dataclass
class DocumentTypeConfig:
    name: str
    expected_fields: List[str]
    field_descriptions: Dict[str, str]
    validation_rules: Dict[str, Any]

We're using the following function to load the document type configuration from a .conf file in the config folder:

In [3]:
def load_document_config(config_path: str) -> Dict[str, DocumentTypeConfig]:
    """Load document configuration from JSON file."""
    with open(config_path, 'r', encoding='utf-8') as f:
        config_data = json.load(f)

    document_types = {}
    for doc_type, doc_config in config_data.items():
        document_types[doc_type] = DocumentTypeConfig(
            name=doc_config['name'],
            expected_fields=doc_config['expected_fields'],
            field_descriptions=doc_config['field_descriptions'],
            validation_rules=doc_config['validation_rules'],
        )

    return document_types

In [4]:
doc_config = load_document_config("../../config/document_types.conf")

## 2. LLM Client Architecture

We're creating an abstract `LLMClient` class and concrete `OllamaClient` implementation for:

**Benefits:**
- **Abstraction**: Common interface for any LLM service
- **Flexibility**: Easy to swap between providers (Ollama, OpenAI, etc.)
- **Testability**: Can create mock clients for testing
- **Consistency**: All LLM clients use the same interface

**Design Pattern:**
- `LLMClient` (abstract): Defines the contract with `generate()` method
- `OllamaClient` (concrete): Implements actual API calls to Ollama

This allows us to easily add support for other LLM providers in the future without changing our core field extraction logic.

We're using `@dataclass` for `GenerativeLlm` because:

**Simple Data Container:**
- Just stores configuration (URL and model name)
- No complex logic needed
- Perfect use case for dataclasses

In [5]:
GENERATIVE_MODEL_URL = "http://127.0.0.1:11435"
MODEL_NAME = "llama3.1:8b"

In [6]:
@dataclass
class GenerativeLlm:
    url: str
    model_name: str
    

class LLMClient(ABC):
    """Abstract base class for LLM clients."""
    
    @abstractmethod
    async def generate(self, prompt: str) -> str:
        """Generate a response from the LLM."""
        pass
    

class OllamaClient(LLMClient):
    """Client for Ollama LLM service."""
    
    def __init__(self, base_url: str, model_name: str):
        self.base_url = base_url.rstrip('/')
        self.model_name = model_name
        
    async def generate(self, prompt: str) -> str:
        """Generate a response from Ollama."""
        timeout = aiohttp.ClientTimeout(total=120)  # 2 minutes timeout
        async with aiohttp.ClientSession(timeout=timeout) as session:
            try:
                async with session.post(
                    f"{self.base_url}/api/generate",
                    json={
                        "model": self.model_name,
                        "prompt": prompt,
                        "stream": False
                    }
                ) as response:
                    if response.status != 200:
                        error_text = await response.text()
                        raise Exception(f"Ollama API error: {error_text}")
                    
                    result = await response.json()
                    return result.get("response", "")
                    
            except Exception as e:
                print("Error calling Ollama API")
                raise 

In [7]:
generative_llm = GenerativeLlm(
                url=GENERATIVE_MODEL_URL,
                model_name=MODEL_NAME,
            )

In [8]:
llm_client = OllamaClient(
    base_url=generative_llm.url,
    model_name=generative_llm.model_name
)

**Core Functions for LLM Field Extraction**

These functions handle the complete pipeline from OCR data to structured field extraction:

- **`clean_value()`**: Converts and validates field values based on type
- **`extract_fields_with_llm()`**: Main async function that uses LLM to extract fields
- **`create_extraction_prompt()`**: Generates prompts for the LLM
- **`validate_field()`** & **`validate_extracted_fields()`**: Validate extracted data against business rules
- **`extract_json_from_response()`**: Parses LLM responses safely

The pipeline: OCR data → LLM extraction → Field validation → Structured output

In [9]:
def clean_value(value: str, field_type: str) -> Any:
    """Clean and convert value based on field type."""
    if not value:
        return None

    if field_type == "string":
        return value.strip()
    
    elif field_type == "date":
        # Ensure date format DD.MM.YYYY
        if re.match(r"^\d{2}\.\d{2}\.\d{4}$", value):
            return value
        return None
    
    elif field_type == "currency":
        # Remove currency symbols, spaces, and convert comma to dot
        cleaned = value.replace("€", "").replace(" ", "").replace(",", ".")
        # Remove any non-numeric characters except decimal point
        cleaned = ''.join(c for c in cleaned if c.isdigit() or c == '.')
        return float(cleaned) if cleaned else None
    
    elif field_type == "area":
        # Remove unit and spaces
        cleaned = value.replace("m²", "").replace(" ", "")
        return float(cleaned) if cleaned else None
    
    elif field_type == "number":
        # Remove any non-numeric characters
        cleaned = ''.join(c for c in value if c.isdigit())
        return int(cleaned) if cleaned else None
    
    elif field_type == "boolean":
        return "[x]" in value.lower()
    
    return value

def extract_fields_with_llm(ocr_lines: List[Dict[str, Any]], document_type: str = "credit_request") -> Dict[str, Any]:
    """
    Extract fields from OCR lines using configuration-based rules.
    Returns a dictionary of field names to their values.
    """
    # Load document configuration
    config_path = Path("config/document_types.conf")
    if not config_path.exists():
        raise FileNotFoundError(f"Configuration file not found: {config_path}")
    
    config = load_document_config(config_path)
    if document_type not in config:
        raise ValueError(f"Unknown document type: {document_type}")
    
    # Extract fields from OCR lines
    extracted_fields = {}
    field_config = config[f"{document_type}.fields"]
    
    # Map OCR lines to fields
    for line in ocr_lines:
        if line["type"] != "line":
            continue

        text = line["text"].strip()
        confidence = line.get("confidence", 0.5)

        # Check each field's label in the configuration
        for field_name, field_rules in field_config.items():
            label = field_rules.get("label")
            if label and label in text:
                # Extract value by removing the label
                value = text.replace(label, "").strip()
                # Clean and convert value based on field type
                field_type = field_rules.get("type", "string")
                cleaned_value = clean_value(value, field_type)
                if cleaned_value is not None:
                    extracted_fields[field_name] = cleaned_value
                break

    return extracted_fields

def extract_json_from_response(response: str) -> Dict[str, Any]:
    """Extract JSON from LLM response, handling potential text prefixes and comments."""
    try:
        # Find JSON between code blocks if present
        if "```" in response:
            # Find the first code block
            start = response.find("```")
            if start != -1:
                # Skip the opening ```
                start = response.find("\n", start) + 1
                # Find the closing ```
                end = response.find("```", start)
                if end != -1:
                    response = response[start:end].strip()
        
        # Remove any comments
        lines = []
        for line in response.split('\n'):
            if '//' in line:
                line = line[:line.find('//')]
            lines.append(line)
        response = '\n'.join(lines)
        
        # Try to parse the JSON
        return json.loads(response)
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON in response: {e}")

def create_extraction_prompt(ocr_lines: List[Dict[str, Any]], config: DocumentTypeConfig) -> str:
    """Create a prompt for field extraction."""
    # Get field descriptions (attribute or dict)
    field_descs = (
        config.field_descriptions
        if hasattr(config, "field_descriptions")
        else config.get("field_descriptions", {})
    )

    # Format field descriptions: "<db_key>: <human label>"
    field_descriptions = [f"- {field}: {desc}" for field, desc in field_descs.items()]

    formatted_lines = []
    for line in ocr_lines:
        if line["type"] == "label_value":
            formatted_lines.append(f"{line['label']}: {line['value']}")
        elif line["type"] in ("text_line", "line"):
            formatted_lines.append(line["text"])

    # Construct the prompt
    prompt = f"""Extract the following fields from the document content below. Return a valid JSON object with the extracted fields.

Field Descriptions:
{chr(10).join(field_descriptions)}

Document Content:
{chr(10).join(formatted_lines)}

Instructions:
1. Return a valid JSON object with the extracted fields
2. Use the exact field names from the mappings above
3. Include only fields that are present in the document
4. For fields with units (e.g., years, currency), include the unit in the value
5. For boolean fields, return true/false
6. For dates, use the format DD.MM.YYYY
7. For numbers, include any units or currency symbols

Example response format:
{{
    "extracted_fields": {{
        "company_name": "DemoTech GmbH",
        "legal_form": "GmbH",
        "founding_date": "01.01.2020",
        "business_address": "Musterstraße 123, 12345 Berlin",
        "purchase_price": "€500.000",
        "term": "20 Years",
        "interest_rate": "3,5%"
    }},
    "missing_fields": ["website", "vat_id"],
    "validation_results": {{
        "company_name": {{"valid": true}},
        "legal_form": {{"valid": true}},
        "founding_date": {{"valid": true}}
    }}
}}

Please extract the fields from the document content above and return a JSON object in this format."""
    return prompt

def validate_field(value: Any, rules: Dict[str, Any]) -> Dict[str, Any]:
    """Validate a field value against validation rules."""
    validation_result = {
        "is_valid": True,
        "errors": []
    }
    
    if not isinstance(value, dict) or "value" not in value:
        validation_result["is_valid"] = False
        validation_result["errors"].append("Invalid field format")
        return validation_result
    
    field_value = value["value"]
    
    # Type validation
    if "type" in rules:
        expected_type = rules["type"]
        if expected_type == "number":
            try:
                # Handle German number format (1.234,56)
                if isinstance(field_value, str):
                    field_value = field_value.replace(".", "").replace(",", ".")
                float(field_value)
            except (ValueError, TypeError):
                validation_result["is_valid"] = False
                validation_result["errors"].append(f"Value must be a number")
        elif expected_type == "boolean":
            if str(field_value).lower() not in ["true", "false"]:
                validation_result["is_valid"] = False
                validation_result["errors"].append(f"Value must be a boolean")
        elif expected_type == "date":
            # Skip number validation for dates
            pass
    
    # Range validation (only for numbers)
    if "min" in rules and "type" in rules and rules["type"] == "number":
        try:
            if isinstance(field_value, str):
                field_value = field_value.replace(".", "").replace(",", ".")
            if float(field_value) < rules["min"]:
                validation_result["is_valid"] = False
                validation_result["errors"].append(f"Value must be at least {rules['min']}")
        except (ValueError, TypeError):
            pass
    
    if "max" in rules and "type" in rules and rules["type"] == "number":
        try:
            if isinstance(field_value, str):
                field_value = field_value.replace(".", "").replace(",", ".")
            if float(field_value) > rules["max"]:
                validation_result["is_valid"] = False
                validation_result["errors"].append(f"Value must be at most {rules['max']}")
        except (ValueError, TypeError):
            pass
    
    # Pattern validation
    if "pattern" in rules:
        import re
        if not re.match(rules["pattern"], str(field_value)):
            validation_result["is_valid"] = False
            validation_result["errors"].append(f"Value does not match required pattern")
    
    return validation_result

def validate_extracted_fields(fields: Dict[str, Any], doc_config: DocumentTypeConfig) -> Dict[str, Any]:
    """Validate all extracted fields against their validation rules."""
    validation_results = {}
    for field_name, field_data in fields.items():
        if field_name in doc_config.validation_rules:
            validation_results[field_name] = validate_field(field_data, doc_config.validation_rules[field_name])
    return validation_results

async def extract_fields_with_llm(
    ocr_lines: List[Dict[str, Any]],
    doc_config: DocumentTypeConfig,
    original_ocr_lines: List[Dict[str, Any]] = None
) -> Dict[str, Any]:
    """
    Extract fields from OCR lines using LLM.
    The LLM is only used to map OCR text to field names.
    Original OCR data (value, confidence, bounding box, page) is preserved.
    
    Args:
        ocr_lines: List of OCR lines with text and metadata
        doc_config: Document type configuration
        llm_client: LLM client for field extraction
        original_ocr_lines: Optional list of original OCR lines for reference
        
    Returns:
        Dictionary containing extracted fields, missing fields, and validation results
    """
    if not ocr_lines:
        return {
            "extracted_fields": {},
            "missing_fields": list(doc_config.expected_fields),
            "validation_results": {}
        }
        
    # Step 1: Let LLM map OCR text to field names
    prompt = create_extraction_prompt(ocr_lines, doc_config)
    response = await llm_client.generate(prompt)
    
    try:
        llm_result = extract_json_from_response(response)
    except ValueError as e:
        raise
        
    # Step 2: Process extracted fields
    extracted_fields = {}
    for field_name, field_data in llm_result.get("extracted_fields", {}).items():
        # Ensure field data is a dictionary
        if not isinstance(field_data, dict):
            field_data = {"value": field_data}
            
        # Ensure required keys exist
        if "value" not in field_data:
            field_data["value"] = None
            
        # Step 3: Find matching normalized label-value pair
        if field_data["value"] is not None:
            value_str = str(field_data["value"]).lower()
            
            # Get all possible labels for this field
            # Use the DB key and its human description as candidate labels
            df_field_names = []
            try:
                field_desc = doc_config.field_descriptions.get(field_name, "")
            except AttributeError:
                # If doc_config is a dict
                field_desc = (doc_config.get("field_descriptions", {}) or {}).get(field_name, "")
            df_field_names = [field_name.lower()]
            if field_desc:
                df_field_names.append(str(field_desc).lower())
            
            # First try to find a matching label-value pair
            matching_pair = None
            for line in ocr_lines:
                if line["type"] == "label_value":
                    line_label = line["label"].lower()
                    line_value = line["value"].lower()
                    
                    # Match if either the label or value matches
                    if (any(label in line_label for label in df_field_names) or 
                        value_str in line_value):
                        matching_pair = line
                        break
            
            if matching_pair:
                # Use the label-value pair's data directly
                extracted_fields[field_name] = {
                    "value": matching_pair["value"],
                    "confidence": matching_pair.get("confidence", 0.5),
                    "bounding_box": matching_pair.get("bounding_box"),
                    "page": matching_pair.get("page")
                }
            else:
                # If no matching pair found, try to find matching OCR line
                matching_line = None
                if original_ocr_lines:
                    for line in original_ocr_lines:
                        line_text = line["text"].lower()
                        
                        # Match if line contains either the value or any of the field's labels
                        if value_str in line_text or any(label in line_text for label in df_field_names):
                            matching_line = line
                            break
                
                if matching_line:
                    # Use the OCR line's data directly
                    extracted_fields[field_name] = {
                        "value": matching_line["text"],
                        "confidence": matching_line.get("confidence", 0.5),
                        "bounding_box": matching_line.get("bounding_box"),
                        "page": matching_line.get("page")
                    }
                else:
                    # If no matching line found, use LLM output with default confidence
                    extracted_fields[field_name] = {
                        "value": field_data["value"],
                        "confidence": 0.5
                    }
        else:
            # If no value provided, use LLM output with default confidence
            extracted_fields[field_name] = {
                "value": field_data["value"],
                "confidence": 0.5
            }
            
    # Step 4: Apply field mappings
    mapped_fields = extracted_fields
            
    # Step 5: Validate fields
    validation_results = validate_extracted_fields(mapped_fields, doc_config)
    
    # Prepare final result
    result = {
        "extracted_fields": mapped_fields,
        "missing_fields": llm_result.get("missing_fields", []),
        "validation_results": validation_results
    }
    
    return result 

## 3. Load OCR Data from Blob Storage

Now let's load the OCR output that was stored by notebook 02 and apply our LLM functions to extract fields.

**Document Processing State Machine**

We use a state machine to track documents through the processing pipeline:

**States:**
- **`RAW`**: Raw document input (PDFs, images)
- **`OCR`**: Documents processed with OCR, ready for LLM analysis
- **`LLM`**: Documents processed by LLM, fields extracted and validated

**Why Use This Pattern:**
- **Traceability**: Track each document's processing status
- **Storage Organization**: Separate containers for each stage
- **Error Recovery**: Can restart from any stage if processing fails
- **Scalability**: Easy to add new processing stages

**Implementation:**
- Each stage has its own blob storage container
- Documents move through stages as they're processed
- Thread-safe singleton ensures consistent state management

In [10]:
# Local connection string to Azurite
connection_string = "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"

In [11]:
import os
import threading
from enum import Enum
from pathlib import PurePosixPath
from typing import Optional

from azure.storage.blob import BlobServiceClient, BlobClient
from azure.core.exceptions import ResourceExistsError


class Stage(Enum):
    """Processing stages for credit documents."""
    RAW = "raw"
    OCR = "ocr"
    LLM = "llm"


class BlobStorage:
    """Thread-safe singleton for blob storage operations."""
    
    _instance: Optional['BlobStorage'] = None
    _lock = threading.Lock()
    
    def __new__(cls):
        with cls._lock:
            if cls._instance is None:
                cls._instance = super().__new__(cls)
        return cls._instance
    
    def __init__(self):
        if hasattr(self, '_initialized'):
            return
        
        self._connection_string = connection_string
        self._blob_service_client = None
        self._initialized_containers = set()
        self._container_lock = threading.Lock()
        self._initialized = True
    
    @property
    def blob_service_client(self) -> BlobServiceClient:
        """Get blob service client, initializing if needed."""
        if self._blob_service_client is None:
            self._blob_service_client = BlobServiceClient.from_connection_string(self._connection_string)
        return self._blob_service_client
    
    def _ensure_container_exists(self, container_name: str) -> None:
        """Ensure a specific container exists."""
        if container_name in self._initialized_containers:
            return
            
        with self._container_lock:
            if container_name in self._initialized_containers:
                return
                
            try:
                container_client = self.blob_service_client.get_container_client(container_name)
                container_client.create_container()
            except ResourceExistsError:
                pass
            
            self._initialized_containers.add(container_name)
    
    def blob_exists(self, uuid: str, stage: Stage, ext: str) -> bool:
        """Check if a blob exists at a specific stage."""
        try:
            container_name = stage.value
            self._ensure_container_exists(container_name)
            container_client = self.blob_service_client.get_container_client(container_name)
            blob_client = container_client.get_blob_client(f"{uuid}{ext}")
            blob_client.get_blob_properties()
            return True
        except Exception:
            return False
    
    def download_blob(self, uuid: str, stage: Stage, ext: str) -> Optional[bytes]:
        """Download data from a blob at a specific stage."""
        try:
            container_name = stage.value
            self._ensure_container_exists(container_name)
            container_client = self.blob_service_client.get_container_client(container_name)
            blob_client = container_client.get_blob_client(f"{uuid}{ext}")
            blob_data = blob_client.download_blob()
            data = blob_data.readall()
            return data
        except Exception as e:
            print(f"Failed to download blob: {e}")
            return None
    
    def list_blobs_in_stage(self, stage: Stage) -> list[str]:
        """List all blobs in a specific stage container."""
        container_name = stage.value
        self._ensure_container_exists(container_name)
        container_client = self.blob_service_client.get_container_client(container_name)
        
        blob_names = []
        try:
            blob_list = container_client.list_blobs()
            for blob in blob_list:
                blob_names.append(blob.name)
            return blob_names
        except Exception as e:
            print(f"Failed to list blobs: {e}")
            return []


def get_storage() -> BlobStorage:
    """Get the singleton BlobStorage instance."""
    return BlobStorage()


def read_ocr_results_from_bucket(document_uuid: str) -> Optional[Dict[str, Any]]:
    """Read OCR results from blob storage bucket."""
    storage_client = get_storage()
    
    if not storage_client.blob_exists(document_uuid, Stage.OCR, ".json"):
        print(f"OCR results not found for document: {document_uuid}")
        return None
    
    blob_data = storage_client.download_blob(document_uuid, Stage.OCR, ".json")
    
    if blob_data is None:
        return None
    
    try:
        json_string = blob_data.decode('utf-8')
        ocr_data = json.loads(json_string)
        return ocr_data
    except Exception as e:
        print(f"Failed to parse OCR results: {e}")
        return None


def list_ocr_results_in_bucket() -> list[str]:
    """List all OCR result files in the bucket."""
    storage_client = get_storage()
    
    try:
        blob_names = storage_client.list_blobs_in_stage(Stage.OCR)
        
        document_uuids = []
        for blob_name in blob_names:
            if blob_name.endswith('.json'):
                uuid_part = blob_name.replace('.json', '')
                document_uuids.append(uuid_part)
        
        return document_uuids
    except Exception as e:
        print(f"Failed to list OCR results: {e}")
        return []

**Practical implementation of the storage functions**

In [12]:
# Find the OCR file with stage "ocr" metadata
print("Looking for OCR results in blob storage...")
stored_documents = list_ocr_results_in_bucket()

if not stored_documents:
    print("No OCR results found in storage. Please run notebook 02 first to generate OCR data.")
else:
    print(f"Found {len(stored_documents)} stored documents:")
    for i, doc_uuid in enumerate(stored_documents):
        print(f"  {i+1}. {doc_uuid}")
    
    # Load the first (and only) document
    document_uuid = stored_documents[0]
    print(f"\nLoading OCR results for document: {document_uuid}")
    
    ocr_data = read_ocr_results_from_bucket(document_uuid)
    
    if ocr_data:
        print("Successfully loaded OCR data from blob storage")
        print(f"Document UUID: {ocr_data['document_uuid']}")
        print(f"Timestamp: {ocr_data['timestamp']}")
        
        # Extract the OCR results
        stored_ocr_results = ocr_data['ocr_results']
        normalized_lines = stored_ocr_results['normalized_lines']
        
        print(f"\nOCR Data Structure:")
        print(f"  - Normalized lines: {len(normalized_lines)}")
        
        # Count different types
        label_value_count = sum(1 for item in normalized_lines if item['type'] == 'label_value')
        text_line_count = sum(1 for item in normalized_lines if item['type'] == 'text_line')
        
        print(f"  - Label-value pairs: {label_value_count}")
        print(f"  - Text lines: {text_line_count}")
        
        # Show sample data
        print(f"\nSample label-value pairs:")
        label_value_pairs = [item for item in normalized_lines if item['type'] == 'label_value']
        for i, pair in enumerate(label_value_pairs[:5]):
            print(f"  {i+1}. {pair['label']} → {pair['value'][:30]}{'...' if len(pair['value']) > 30 else ''}")
    else:
        print("Failed to load OCR data")

Looking for OCR results in blob storage...
Found 2 stored documents:
  1. 945a6da1-a563-434e-ac6a-b992166b17fe
  2. de196c17-0f61-4a03-96b4-b17b0fd4102c

Loading OCR results for document: 945a6da1-a563-434e-ac6a-b992166b17fe
Successfully loaded OCR data from blob storage
Document UUID: 945a6da1-a563-434e-ac6a-b992166b17fe
Timestamp: 2025-09-04T22:15:14.746600

OCR Data Structure:
  - Normalized lines: 43
  - Label-value pairs: 26
  - Text lines: 17

Sample label-value pairs:
  1. 1 → Applicant
  2. Company Name → DemoTech Solutions GmbH
  3. Legal Form → Limited Liability Company (Gmb...
  4. Date of Incorporation → 12/05/2018
  5. Business Address → Main Street 123, 70173 Stuttga...


## 4. Apply LLM Functions to Extract Fields

Now let's apply our LLM functions to the loaded OCR data to extract structured fields.

In [13]:
# Apply LLM extraction to the loaded OCR data
if 'normalized_lines' in locals():
    print("Applying LLM field extraction to OCR data...")
    
    # Use the existing LLM client and document config
    # try:
        # Extract fields using LLM
    extraction_result = await extract_fields_with_llm(
        ocr_lines=normalized_lines,
        doc_config=doc_config["credit_request"],
        original_ocr_lines=stored_ocr_results.get('original_lines', [])
    )
    
    print("LLM field extraction completed successfully!")
    
    # Display results
    extracted_fields = extraction_result['extracted_fields']
    missing_fields = extraction_result['missing_fields']
    validation_results = extraction_result['validation_results']
    
    print(f"\nExtraction Results:")
    print(f"  - Extracted fields: {len(extracted_fields)}")
    print(f"  - Missing fields: {len(missing_fields)}")
    print(f"  - Validation results: {len(validation_results)}")
    
    print(f"\nExtracted Fields:")
    for field_name, field_data in extracted_fields.items():
        value = field_data.get('value', 'N/A')
        confidence = field_data.get('confidence', 0.0)
        print(f"  • {field_name}: {value} (confidence: {confidence:.3f})")
    
    if missing_fields:
        print(f"\nMissing Fields:")
        for field in missing_fields:
            print(f"  • {field}")
    
    if validation_results:
        print(f"\nValidation Results:")
        for field_name, validation in validation_results.items():
            is_valid = validation.get('is_valid', False)
            errors = validation.get('errors', [])
            status = "Valid" if is_valid else "Invalid"
            print(f"  • {field_name}: {status}")
            if errors:
                for error in errors:
                    print(f"    - {error}")
        
    # except Exception as e:
    #     print(f"Error during LLM extraction: {e}")
else:
    print("No OCR data loaded. Please run the previous cell first.")

Applying LLM field extraction to OCR data...
LLM field extraction completed successfully!

Extraction Results:
  - Extracted fields: 21
  - Missing fields: 0
  - Validation results: 21

Extracted Fields:
  • company_name: DemoTech Solutions GmbH (confidence: 0.994)
  • legal_form: Limited Liability Company (GmbH) (confidence: 0.803)
  • founding_date: 12.05.2018 (confidence: 0.500)
  • business_address: Main Street 123, 70173 Stuttgart, Germany (confidence: 0.721)
  • commercial_register: HRB 123456 / Stuttgart Local Court (confidence: 0.978)
  • vat_id: DE123456789 (confidence: 0.919)
  • property_type: Office and Commercial Building (confidence: 0.995)
  • property_name: Innovation Center Stuttgart (confidence: 0.595)
  • property_address: Tech Park 45, 70191 Stuttgart, Germany (confidence: 0.683)
  • purchase_price: €2,500,000 (confidence: 0.500)
  • requested_amount: €2,000,000 (confidence: 0.715)
  • purpose: Purchase and Renovation (confidence: 0.827)
  • equity_share: €500,000 (

## 5. Save LLM Results to Blob Storage

Let's save the LLM extraction results back to blob storage for the next processing stage.

In [14]:
# Save LLM results to blob storage
if 'extraction_result' in locals():
    print("Saving LLM extraction results to blob storage...")
    
    # Prepare LLM results data
    llm_results = {
        "document_uuid": document_uuid,
        "timestamp": datetime.utcnow().isoformat(),
        "llm_results": extraction_result,
        "metadata": {
            "source_ocr_uuid": document_uuid,
            "stage": "llm",
            "model_name": MODEL_NAME,
            "extraction_method": "llm_field_extraction"
        }
    }
    
    # Convert to JSON and save
    try:
        llm_data_bytes = json.dumps(llm_results, indent=2, ensure_ascii=False).encode('utf-8')
        
        storage_client = get_storage()
        container_name = Stage.LLM.value
        storage_client._ensure_container_exists(container_name)
        container_client = storage_client.blob_service_client.get_container_client(container_name)
        blob_client = container_client.get_blob_client(f"{document_uuid}.json")
        blob_client.upload_blob(llm_data_bytes, overwrite=True)
        
        print(f"LLM results saved to blob storage: {container_name}/{document_uuid}.json")
        
        # Verify the save
        if storage_client.blob_exists(document_uuid, Stage.LLM, ".json"):
            print("Verification: LLM results confirmed in storage")
        else:
            print("Verification failed: LLM results not found in storage")
            
    except Exception as e:
        print(f"Error saving LLM results: {e}")
else:
    print("No LLM extraction results to save. Please run the previous cell first.")

Saving LLM extraction results to blob storage...
LLM results saved to blob storage: llm/945a6da1-a563-434e-ac6a-b992166b17fe.json
Verification: LLM results confirmed in storage


## 6. Summary

### What We've Accomplished

**Loaded OCR Data**: Retrieved structured OCR results from blob storage
**LLM Field Extraction**: Applied LLM functions to extract structured fields
**Field Validation**: Validated extracted fields against business rules
**Results Storage**: Saved LLM results back to blob storage
**Complete Pipeline**: Connected OCR processing with LLM analysis

### Processing Pipeline

1. **OCR Processing** (Notebook 02) → Extract text with spatial analysis
2. **Storage** (Notebook 02) → Save OCR results to blob storage
3. **Data Loading** (This notebook) → Retrieve OCR data from storage
4. **LLM Analysis** (This notebook) → Extract structured fields using LLM
5. **Results Storage** (This notebook) → Save LLM results for next stage

### Next Steps

- **Database Integration**: Store extracted fields in PostgreSQL
- **API Development**: Create endpoints to serve processed data
- **Frontend Integration**: Display results with confidence scores
- **Validation Workflows**: Implement manual review processes

The document processing pipeline is now complete from OCR extraction through LLM analysis!