# 🇺🇦 Ukrainian OCR Pipeline - Local Environment Demo

This notebook demonstrates the **Ukrainian OCR Pipeline Package** in a local development environment.

## Features:
- **GPU/CPU Auto-detection** with performance optimization
- **Two-Stage Processing**: Segmentation → Recognition & Enhancement
- **Complete Pipeline**: Kraken → TrOCR → NER → Surname Matching → Enhanced ALTO
- **Professional Output**: ALTO XML v4, visualizations, and detailed reports

## Requirements:
- Ukrainian OCR package installed (`pip install -e .` from package directory)
- Document image in supported format (JPG, PNG, TIFF)
- Python 3.8+ with PyTorch

## 📦 Package Import & Setup

In [None]:
import os
import sys
import time
import json
from pathlib import Path
from typing import Dict, List, Optional

# Core libraries
import torch
import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt

# Add package to path if running from examples directory
if os.path.exists('../ukrainian_ocr'):
    sys.path.insert(0, '..')
    print("✅ Using local development version")

# Import Ukrainian OCR Package
from ukrainian_ocr import UkrainianOCRPipeline
from ukrainian_ocr.core.config import OCRPipelineConfig

print(f"✅ Ukrainian OCR Package loaded")
print(f"📍 Package location: {__import__('ukrainian_ocr').__file__}")

## 🖥️ Hardware Detection & Configuration

In [None]:
print("🖥️ Hardware Detection:")
print(f"PyTorch version: {torch.__version__}")

# GPU detection and optimization
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"✅ GPU: {gpu_name} ({gpu_memory:.1f}GB VRAM)")
    device = 'cuda'
    batch_size = 8 if gpu_memory > 12 else 4  # Optimize batch size for local GPU
else:
    print("⚠️ GPU not available - using CPU")
    device = 'cpu'
    batch_size = 1

print(f"🎯 Selected device: {device}")
print(f"📦 Batch size: {batch_size}")

# Create optimized configuration for local environment
config = {
    'device': device,
    'batch_size': batch_size,
    'verbose': True,
    'save_intermediate': True,  # Save all intermediate files locally
    
    'ocr': {
        'model_path': 'cyrillic-trocr/trocr-handwritten-cyrillic',
        'device': device,
        'batch_size': batch_size
    },
    
    'ner': {
        'backend': 'transformers',  # Best quality for local processing
        'device': device,
        'confidence_threshold': 0.7
    },
    
    'surname_matching': {
        'enabled': True,
        'threshold': 0.8,
        'use_phonetic': True,
        'export_matches': True,
        'surnames': [
            'Шевченко', 'Коваленко', 'Бондаренко', 'Ткаченко', 'Кравченко',
            'Петренко', 'Іваненко', 'Михайленко', 'Василенко', 'Григоренко',
            'Ковальчук', 'Савченко', 'Левченко', 'Павленко', 'Марченко',
            'Мельник', 'Коваль', 'Гончар', 'Кравець', 'Швець',
            'Жук', 'Козлов', 'Мороз', 'Терещенко', 'Рибалко'
        ]
    },
    
    'post_processing': {
        'extract_person_regions': True,
        'clustering_eps': 300
    }
}

print("✅ Configuration optimized for local environment")

## 📄 Load Test Document

In [None]:
# Define your test image path here
test_image_paths = [
    "/home/maria/ssd990/projects/tarkovsky/Н-2982_4_1001/804-03494422-l-m-a-n2982-4-1001-00002.jpg",
    "../test_images/ukrainian_document.jpg",
    "./sample_document.jpg",
    # Add your document paths here
]

test_image_path = None
for path in test_image_paths:
    if os.path.exists(path):
        test_image_path = path
        break

if test_image_path:
    # Load and display document information
    image_size = os.path.getsize(test_image_path) / (1024 * 1024)
    
    with Image.open(test_image_path) as img:
        width, height = img.size
        
        print(f"📄 Document: {os.path.basename(test_image_path)}")
        print(f"📊 Size: {image_size:.1f} MB")
        print(f"📐 Dimensions: {width} x {height} pixels")
        print(f"📝 Format: {img.format}")
        
        # Display preview
        plt.figure(figsize=(12, 8))
        plt.imshow(img)
        plt.title(f"Document: {os.path.basename(test_image_path)}")
        plt.axis('off')
        plt.show()
else:
    print("❌ No test document found!")
    print("Please update the test_image_paths list with your document path.")
    print("Supported formats: JPG, PNG, TIFF")

## 🏁 Stage 1: Document Segmentation & Basic ALTO

In [None]:
if not test_image_path:
    print("❌ Cannot proceed without a document. Please check the cell above.")
else:
    print("🚀 STAGE 1: Document Segmentation & Basic ALTO Creation")
    print("=" * 60)
    
    # Initialize pipeline
    pipeline_config = OCRPipelineConfig.from_dict(config)
    pipeline = UkrainianOCRPipeline(config=pipeline_config)
    
    print(f"✅ Pipeline ready (device: {pipeline.device})")
    
    # Create output directory
    output_dir = Path("./ukrainian_ocr_output")
    output_dir.mkdir(exist_ok=True)
    
    # Load and process image
    print("\n🔍 Kraken Segmentation...")
    start_time = time.time()
    
    # Initialize components
    pipeline._init_components()
    
    # Load image
    image = cv2.imread(test_image_path)
    
    # Segment image
    lines = pipeline.segmenter.segment_image(image)
    seg_time = time.time() - start_time
    
    print(f"✅ Segmentation complete: {seg_time:.2f}s")
    print(f"📊 Detected {len(lines)} text lines")
    
    # Create basic ALTO XML
    print("\n📄 Creating basic ALTO XML...")
    basic_alto_xml = pipeline._create_alto_xml(Path(test_image_path), image, lines)
    
    basic_alto_path = output_dir / f"{Path(test_image_path).stem}_basic_alto.xml"
    with open(basic_alto_path, 'w', encoding='utf-8') as f:
        f.write(basic_alto_xml)
    
    print(f"✅ Basic ALTO created: {basic_alto_path}")
    
    # Create visualization
    print("\n🎨 Creating segmentation visualization...")
    vis_image = image.copy()
    colors = [(0, 255, 0), (255, 0, 0), (0, 0, 255), (255, 255, 0)]
    
    for idx, line in enumerate(lines[:200]):  # Show first 200 lines
        color = colors[idx % len(colors)]
        polygon = line.get('polygon', [])
        if polygon and len(polygon) >= 3:
            pts = np.array(polygon, np.int32)
            cv2.polylines(vis_image, [pts], True, color, 2)
    
    # Save and display visualization
    vis_path = output_dir / f"{Path(test_image_path).stem}_segmentation.png"
    cv2.imwrite(str(vis_path), vis_image)
    
    plt.figure(figsize=(15, 10))
    plt.imshow(cv2.cvtColor(vis_image, cv2.COLOR_BGR2RGB))
    plt.title(f"Segmentation: {len(lines)} lines detected")
    plt.axis('off')
    plt.show()
    
    print(f"✅ Stage 1 complete - ready for text recognition")
    
    # Store results for Stage 2
    stage1_results = {
        'image': image,
        'lines': lines,
        'segmentation_time': seg_time,
        'basic_alto_path': basic_alto_path,
        'output_dir': output_dir
    }

## 🤖 Stage 2: Text Recognition, NER & Enhancement

In [None]:
if 'stage1_results' not in locals():
    print("❌ Please run Stage 1 first")
else:
    print("🚀 STAGE 2: Text Recognition, NER & Enhancement")
    print("=" * 60)
    
    image = stage1_results['image']
    lines = stage1_results['lines']
    output_dir = stage1_results['output_dir']
    
    print(f"📄 Processing {len(lines)} lines")
    
    # Text Recognition with TrOCR
    print("\n🤖 TrOCR Text Recognition...")
    if device == 'cpu':
        print("⏳ CPU processing - this will take several minutes")
    
    start_time = time.time()
    lines_with_text = pipeline.ocr_processor.process_lines(image, lines)
    ocr_time = time.time() - start_time
    
    recognized_lines = [l for l in lines_with_text if l.get('text', '').strip()]
    print(f"✅ OCR complete: {ocr_time:.2f}s")
    print(f"📊 Text in {len(recognized_lines)}/{len(lines)} lines")
    
    # Show sample text
    print("\n📝 Sample recognized text:")
    for i, line in enumerate(recognized_lines[:8]):
        text = line.get('text', '')
        conf = line.get('confidence', 0)
        print(f"  {i+1}. '{text}' ({conf:.2f})")
    
    # Named Entity Recognition
    print("\n🏷️ Named Entity Recognition...")
    start_time = time.time()
    
    ner_results = pipeline.ner_extractor.extract_entities_from_lines(lines_with_text)
    ner_time = time.time() - start_time
    
    all_entities = ner_results.get('all_entities', [])
    print(f"✅ NER complete: {ner_time:.2f}s")
    print(f"🏷️ Found {len(all_entities)} entities")
    print(f"🧠 Backend: {ner_results.get('backend', 'unknown')}")
    
    if all_entities:
        # Group by type
        entity_types = {}
        for entity in all_entities:
            label = entity.get('label', 'UNKNOWN')
            entity_types[label] = entity_types.get(label, 0) + 1
        
        print("\nEntity types:")
        for etype, count in entity_types.items():
            print(f"  {etype}: {count}")
        
        print("\nSample entities:")
        for entity in all_entities[:8]:
            print(f"  '{entity.get('text', '')}' -> {entity.get('label', '')}")
    
    # Surname Matching
    print("\n👥 Surname Matching...")
    start_time = time.time()
    
    surname_matches = pipeline.surname_matcher.find_in_lines(lines_with_text)
    surname_time = time.time() - start_time
    
    print(f"✅ Surname matching: {surname_time:.2f}s")
    print(f"👥 Found {len(surname_matches)} matches")
    
    if surname_matches:
        unique_surnames = set(m.matched_surname for m in surname_matches)
        print(f"📊 Unique surnames: {len(unique_surnames)}")
        
        print("\nSample matches:")
        for match in surname_matches[:8]:
            print(f"  '{match.found_text}' -> '{match.matched_surname}' ({match.confidence:.2f})")
        
        # Export matches
        matches_file = output_dir / f"{Path(test_image_path).stem}_surnames.json"
        pipeline.surname_matcher.export_matches(surname_matches, str(matches_file))
        print(f"💾 Surnames saved: {matches_file}")
    
    # Create Enhanced ALTO
    print("\n✨ Creating Enhanced ALTO...")
    
    # Complete ALTO with text
    complete_alto_xml = pipeline._create_alto_xml(Path(test_image_path), image, lines_with_text)
    complete_alto_path = output_dir / f"{Path(test_image_path).stem}_complete_alto.xml"
    
    with open(complete_alto_path, 'w', encoding='utf-8') as f:
        f.write(complete_alto_xml)
    
    print(f"✅ Complete ALTO: {complete_alto_path}")
    
    # Enhanced ALTO with NER (if entities found)
    enhanced_alto_path = None
    if all_entities:
        # Map entities to lines
        entities_by_line_id = {}
        for idx, line in enumerate(lines_with_text):
            line_id = f"line_{idx}"
            line_text = line.get('text', '')
            
            line_entities = []
            for entity in all_entities:
                if entity.get('text', '') in line_text:
                    line_entities.append(entity)
            
            if line_entities:
                entities_by_line_id[line_id] = {'entities': line_entities}
        
        if entities_by_line_id:
            enhanced_alto_path = output_dir / f"{Path(test_image_path).stem}_enhanced_alto.xml"
            pipeline.alto_enhancer.enhance_alto_with_ner(
                str(complete_alto_path), entities_by_line_id, str(enhanced_alto_path)
            )
            print(f"✅ Enhanced ALTO: {enhanced_alto_path}")
    
    # Final Summary
    total_time = stage1_results['segmentation_time'] + ocr_time + ner_time + surname_time
    
    print("\n📊 PROCESSING COMPLETE")
    print("=" * 40)
    print(f"⏱️ Total time: {total_time:.2f}s")
    print(f"🔍 Lines detected: {len(lines)}")
    print(f"📝 Lines with text: {len(recognized_lines)}")
    print(f"🏷️ Entities found: {len(all_entities)}")
    print(f"👥 Surname matches: {len(surname_matches)}")
    
    # List output files
    print("\n📁 Generated files:")
    for file_path in output_dir.glob("*"):
        size_kb = file_path.stat().st_size / 1024
        print(f"  📄 {file_path.name} ({size_kb:.1f} KB)")
    
    print(f"\n✅ All files saved in: {output_dir}")
    print("🎉 Ukrainian OCR processing completed!")

## 📋 Results Summary

### Generated Files:
- **Basic ALTO XML**: Segmentation with coordinates
- **Complete ALTO XML**: Full transcription with confidence scores  
- **Enhanced ALTO XML**: With NER semantic annotations (if entities found)
- **Surname Matches JSON**: Genealogical findings with fuzzy matching
- **Segmentation PNG**: Visual representation of detected lines

### Next Steps:
1. **Review ALTO files** in XML editor or eScriptorium
2. **Analyze surname matches** for genealogical research
3. **Process additional documents** with the same configuration
4. **Customize surname lists** for specific family names
5. **Adjust confidence thresholds** for better precision/recall balance

### Performance Tips:
- Use **GPU** for faster processing (especially OCR step)
- Increase **batch_size** for better GPU utilization
- Try different **NER backends** (transformers/spacy/rule-based)
- Adjust **surname matching threshold** (0.7-0.9 range)

---
**🇺🇦 Ukrainian OCR Pipeline** - Optimized for local development environments