# 🇺🇦 Ukrainian OCR Pipeline - Google Colab Demo

This notebook demonstrates the **Ukrainian OCR Pipeline Package** in Google Colab environment.

## Features:
- **Automatic GPU Detection** and optimization for Colab T4/V100/A100
- **GitHub Integration** with automatic repository cloning
- **Two-Stage Processing**: Segmentation → Recognition & Enhancement
- **Complete Pipeline**: Kraken → TrOCR → NER → Surname Matching → Enhanced ALTO
- **Download Support**: Results automatically downloadable from Colab

## Quick Start:
1. **Upload your document** using the file uploader
2. **Run all cells** for complete processing
3. **Download results** from the generated files section

---
**⚡ Optimized for Google Colab Free/Pro/Pro+ GPUs**

## 🔧 Environment Setup & Package Installation

In [None]:
# Check if we're in Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("🚀 Running in Google Colab")
    
    # Mount Google Drive (optional)
    from google.colab import drive
    try:
        drive.mount('/content/drive')
        print("✅ Google Drive mounted")
    except:
        print("⚠️ Google Drive mount failed (optional)")
    
    # Install system dependencies
    print("📦 Installing system dependencies...")
    !apt-get update -qq
    !apt-get install -y -qq libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1
    
    # Clone the repository
    print("📥 Cloning Ukrainian OCR repository...")
    !git clone https://github.com/your-repo/ukrainian-ocr-package.git /content/ukrainian_ocr_package
    
    # Install the package
    print("⚙️ Installing Ukrainian OCR package...")
    !cd /content/ukrainian_ocr_package && pip install -e .
    
    print("✅ Installation complete!")
else:
    print("💻 Running in local environment")
    print("⚠️ This notebook is optimized for Google Colab")
    print("📝 For local development, use Ukrainian_OCR_Local_Demo.ipynb")

## 📦 Package Import & Hardware Detection

In [None]:
import os
import sys
import time
import json
from pathlib import Path
from typing import Dict, List, Optional

# Core libraries
import torch
import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt

# Import Ukrainian OCR Package
if IN_COLAB:
    sys.path.insert(0, '/content/ukrainian_ocr_package')

from ukrainian_ocr import UkrainianOCRPipeline
from ukrainian_ocr.core.config import OCRPipelineConfig

print(f"✅ Ukrainian OCR Package loaded")
print(f"📍 Package location: {__import__('ukrainian_ocr').__file__}")

# Google Colab specific imports
if IN_COLAB:
    from google.colab import files
    from IPython.display import display, HTML

In [None]:
print("🖥️ Hardware Detection:")
print(f"PyTorch version: {torch.__version__}")

# GPU detection optimized for Colab
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"✅ GPU: {gpu_name} ({gpu_memory:.1f}GB VRAM)")
    
    # Optimize for different Colab GPU types
    if 'T4' in gpu_name:
        device = 'cuda'
        batch_size = 4  # Conservative for T4
        print("🎯 Optimized for Colab T4 GPU")
    elif 'V100' in gpu_name or 'A100' in gpu_name:
        device = 'cuda'
        batch_size = 8  # More aggressive for Pro/Pro+
        print("🎯 Optimized for Colab Pro/Pro+ GPU")
    else:
        device = 'cuda'
        batch_size = 6  # Default for other GPUs
        print("🎯 Optimized for GPU")
else:
    print("⚠️ GPU not available - using CPU (will be slower)")
    device = 'cpu'
    batch_size = 1

print(f"🎯 Selected device: {device}")
print(f"📦 Batch size: {batch_size}")

# Create Colab-optimized configuration
config = {
    'device': device,
    'batch_size': batch_size,
    'verbose': True,
    'save_intermediate': True,
    
    'ocr': {
        'model_path': 'cyrillic-trocr/trocr-handwritten-cyrillic',
        'device': device,
        'batch_size': batch_size
    },
    
    'ner': {
        'backend': 'spacy',  # Most stable for Colab
        'device': device,
        'confidence_threshold': 0.7
    },
    
    'surname_matching': {
        'enabled': True,
        'threshold': 0.8,
        'use_phonetic': True,
        'export_matches': True,
        'surnames': [
            'Шевченко', 'Коваленко', 'Бондаренко', 'Ткаченко', 'Кравченко',
            'Петренко', 'Іваненко', 'Михайленко', 'Василенко', 'Григоренко',
            'Ковальчук', 'Савченко', 'Левченко', 'Павленко', 'Марченко',
            'Мельник', 'Коваль', 'Гончар', 'Кравець', 'Швець',
            'Жук', 'Козлов', 'Мороз', 'Терещенко', 'Рибалко'
        ]
    },
    
    'post_processing': {
        'extract_person_regions': True,
        'clustering_eps': 300
    }
}

print("✅ Configuration optimized for Google Colab")

## 📤 Upload Document for Processing

In [None]:
if IN_COLAB:
    print("📤 Upload your Ukrainian document (JPG, PNG, TIFF):")
    uploaded = files.upload()
    
    if uploaded:
        # Get the uploaded file
        test_image_path = list(uploaded.keys())[0]
        print(f"✅ Document uploaded: {test_image_path}")
        
        # Move to content directory for easier access
        import shutil
        dest_path = f"/content/{test_image_path}"
        if os.path.exists(test_image_path) and test_image_path != dest_path:
            shutil.move(test_image_path, dest_path)
            test_image_path = dest_path
    else:
        print("❌ No file uploaded")
        test_image_path = None
else:
    # For local testing
    test_image_paths = [
        "./sample_document.jpg",
        "../test_images/ukrainian_document.jpg"
    ]
    
    test_image_path = None
    for path in test_image_paths:
        if os.path.exists(path):
            test_image_path = path
            break

if test_image_path and os.path.exists(test_image_path):
    # Display document information
    image_size = os.path.getsize(test_image_path) / (1024 * 1024)
    
    with Image.open(test_image_path) as img:
        width, height = img.size
        
        print(f"📄 Document: {os.path.basename(test_image_path)}")
        print(f"📊 Size: {image_size:.1f} MB")
        print(f"📐 Dimensions: {width} x {height} pixels")
        print(f"📝 Format: {img.format}")
        
        # Display preview
        plt.figure(figsize=(12, 8))
        plt.imshow(img)
        plt.title(f"Document: {os.path.basename(test_image_path)}")
        plt.axis('off')
        plt.show()
else:
    print("❌ No document available for processing")
    if IN_COLAB:
        print("Please run the upload cell above")
    else:
        print("Please add a document to the test_image_paths list")

## 🏁 Stage 1: Document Segmentation & Basic ALTO

In [None]:
if not test_image_path or not os.path.exists(test_image_path):
    print("❌ Cannot proceed without a document. Please upload a file.")
else:
    print("🚀 STAGE 1: Document Segmentation & Basic ALTO Creation")
    print("=" * 60)
    
    # Initialize pipeline
    pipeline_config = OCRPipelineConfig.from_dict(config)
    
    # Update for Colab environment
    pipeline_config.update_for_colab()
    
    pipeline = UkrainianOCRPipeline(config=pipeline_config)
    
    print(f"✅ Pipeline ready (device: {pipeline.device})")
    
    # Create output directory
    output_dir = Path("/content/ukrainian_ocr_output")
    output_dir.mkdir(exist_ok=True)
    
    # Load and process image
    print("\n🔍 Kraken Segmentation...")
    if device == 'cpu':
        print("⏳ CPU processing detected - this may take a few minutes")
    
    start_time = time.time()
    
    # Initialize components
    pipeline._init_components()
    
    # Load image
    image = cv2.imread(test_image_path)
    
    # Segment image
    lines = pipeline.segmenter.segment_image(image)
    seg_time = time.time() - start_time
    
    print(f"✅ Segmentation complete: {seg_time:.2f}s")
    print(f"📊 Detected {len(lines)} text lines")
    
    # Create basic ALTO XML
    print("\n📄 Creating basic ALTO XML...")
    basic_alto_xml = pipeline._create_alto_xml(Path(test_image_path), image, lines)
    
    basic_alto_path = output_dir / f"{Path(test_image_path).stem}_basic_alto.xml"
    with open(basic_alto_path, 'w', encoding='utf-8') as f:
        f.write(basic_alto_xml)
    
    print(f"✅ Basic ALTO created: {basic_alto_path}")
    
    # Create visualization
    print("\n🎨 Creating segmentation visualization...")
    vis_image = image.copy()
    colors = [(0, 255, 0), (255, 0, 0), (0, 0, 255), (255, 255, 0)]
    
    for idx, line in enumerate(lines[:200]):  # Show first 200 lines to avoid clutter
        color = colors[idx % len(colors)]
        polygon = line.get('polygon', [])
        if polygon and len(polygon) >= 3:
            pts = np.array(polygon, np.int32)
            cv2.polylines(vis_image, [pts], True, color, 2)
    
    # Save and display visualization
    vis_path = output_dir / f"{Path(test_image_path).stem}_segmentation.png"
    cv2.imwrite(str(vis_path), vis_image)
    
    plt.figure(figsize=(15, 10))
    plt.imshow(cv2.cvtColor(vis_image, cv2.COLOR_BGR2RGB))
    plt.title(f"Segmentation: {len(lines)} lines detected")
    plt.axis('off')
    plt.show()
    
    print(f"✅ Stage 1 complete - ready for text recognition")
    
    # Store results for Stage 2
    stage1_results = {
        'image': image,
        'lines': lines,
        'segmentation_time': seg_time,
        'basic_alto_path': basic_alto_path,
        'output_dir': output_dir
    }

## 🤖 Stage 2: Text Recognition, NER & Enhancement

In [None]:
if 'stage1_results' not in locals():
    print("❌ Please run Stage 1 first")
else:
    print("🚀 STAGE 2: Text Recognition, NER & Enhancement")
    print("=" * 60)
    
    image = stage1_results['image']
    lines = stage1_results['lines']
    output_dir = stage1_results['output_dir']
    
    print(f"📄 Processing {len(lines)} lines")
    
    # Text Recognition with TrOCR
    print("\n🤖 TrOCR Text Recognition...")
    if device == 'cpu':
        print("⏳ CPU processing - this will take several minutes")
    elif 'T4' in torch.cuda.get_device_name(0):
        print("⚡ T4 GPU detected - optimized processing")
    
    start_time = time.time()
    lines_with_text = pipeline.ocr_processor.process_lines(image, lines)
    ocr_time = time.time() - start_time
    
    recognized_lines = [l for l in lines_with_text if l.get('text', '').strip()]
    print(f"✅ OCR complete: {ocr_time:.2f}s")
    print(f"📊 Text in {len(recognized_lines)}/{len(lines)} lines")
    
    # Show sample text
    print("\n📝 Sample recognized text:")
    for i, line in enumerate(recognized_lines[:8]):
        text = line.get('text', '')
        conf = line.get('confidence', 0)
        print(f"  {i+1}. '{text}' ({conf:.2f})")
    
    # Named Entity Recognition
    print("\n🏷️ Named Entity Recognition...")
    start_time = time.time()
    
    ner_results = pipeline.ner_extractor.extract_entities_from_lines(lines_with_text)
    ner_time = time.time() - start_time
    
    all_entities = ner_results.get('all_entities', [])
    print(f"✅ NER complete: {ner_time:.2f}s")
    print(f"🏷️ Found {len(all_entities)} entities")
    print(f"🧠 Backend: {ner_results.get('backend', 'unknown')}")
    
    if all_entities:
        # Group by type
        entity_types = {}
        for entity in all_entities:
            label = entity.get('label', 'UNKNOWN')
            entity_types[label] = entity_types.get(label, 0) + 1
        
        print("\nEntity types:")
        for etype, count in entity_types.items():
            print(f"  {etype}: {count}")
        
        print("\nSample entities:")
        for entity in all_entities[:8]:
            print(f"  '{entity.get('text', '')}' -> {entity.get('label', '')}")
    
    # Surname Matching
    print("\n👥 Surname Matching...")
    start_time = time.time()
    
    surname_matches = pipeline.surname_matcher.find_in_lines(lines_with_text)
    surname_time = time.time() - start_time
    
    print(f"✅ Surname matching: {surname_time:.2f}s")
    print(f"👥 Found {len(surname_matches)} matches")
    
    if surname_matches:
        unique_surnames = set(m.matched_surname for m in surname_matches)
        print(f"📊 Unique surnames: {len(unique_surnames)}")
        
        print("\nSample matches:")
        for match in surname_matches[:8]:
            print(f"  '{match.found_text}' -> '{match.matched_surname}' ({match.confidence:.2f})")
        
        # Export matches
        matches_file = output_dir / f"{Path(test_image_path).stem}_surnames.json"
        pipeline.surname_matcher.export_matches(surname_matches, str(matches_file))
        print(f"💾 Surnames saved: {matches_file}")
    
    # Create Enhanced ALTO
    print("\n✨ Creating Enhanced ALTO...")
    
    # Complete ALTO with text
    complete_alto_xml = pipeline._create_alto_xml(Path(test_image_path), image, lines_with_text)
    complete_alto_path = output_dir / f"{Path(test_image_path).stem}_complete_alto.xml"
    
    with open(complete_alto_path, 'w', encoding='utf-8') as f:
        f.write(complete_alto_xml)
    
    print(f"✅ Complete ALTO: {complete_alto_path}")
    
    # Enhanced ALTO with NER (if entities found)
    enhanced_alto_path = None
    if all_entities:
        # Map entities to lines
        entities_by_line_id = {}
        for idx, line in enumerate(lines_with_text):
            line_id = f"line_{idx}"
            line_text = line.get('text', '')
            
            line_entities = []
            for entity in all_entities:
                if entity.get('text', '') in line_text:
                    line_entities.append(entity)
            
            if line_entities:
                entities_by_line_id[line_id] = {'entities': line_entities}
        
        if entities_by_line_id:
            enhanced_alto_path = output_dir / f"{Path(test_image_path).stem}_enhanced_alto.xml"
            pipeline.alto_enhancer.enhance_alto_with_ner(
                str(complete_alto_path), entities_by_line_id, str(enhanced_alto_path)
            )
            print(f"✅ Enhanced ALTO: {enhanced_alto_path}")
    
    # Final Summary
    total_time = stage1_results['segmentation_time'] + ocr_time + ner_time + surname_time
    
    print("\n📊 PROCESSING COMPLETE")
    print("=" * 40)
    print(f"⏱️ Total time: {total_time:.2f}s")
    print(f"🔍 Lines detected: {len(lines)}")
    print(f"📝 Lines with text: {len(recognized_lines)}")
    print(f"🏷️ Entities found: {len(all_entities)}")
    print(f"👥 Surname matches: {len(surname_matches)}")
    
    # List output files
    print("\n📁 Generated files:")
    generated_files = []
    for file_path in output_dir.glob("*"):
        size_kb = file_path.stat().st_size / 1024
        print(f"  📄 {file_path.name} ({size_kb:.1f} KB)")
        generated_files.append(str(file_path))
    
    print(f"\n✅ All files saved in: {output_dir}")
    print("🎉 Ukrainian OCR processing completed!")
    
    # Store for download section
    processing_results = {
        'output_dir': output_dir,
        'generated_files': generated_files,
        'total_time': total_time,
        'stats': {
            'lines_detected': len(lines),
            'lines_with_text': len(recognized_lines),
            'entities_found': len(all_entities),
            'surname_matches': len(surname_matches)
        }
    }

## 📥 Download Results (Colab Only)

In [None]:
if IN_COLAB and 'processing_results' in locals():
    print("📥 Download your OCR results:")
    
    output_dir = processing_results['output_dir']
    
    # Create a zip file with all results
    import zipfile
    zip_path = "/content/ukrainian_ocr_results.zip"
    
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        for file_path in output_dir.glob("*"):
            zipf.write(file_path, file_path.name)
    
    print(f"📦 Created results archive: {zip_path}")
    
    # Download the zip file
    files.download(zip_path)
    
    # Also offer individual file downloads
    print("\n📄 Or download individual files:")
    
    # Create download buttons for each file
    for file_path in sorted(output_dir.glob("*")):
        file_size = file_path.stat().st_size / 1024
        print(f"\n📄 {file_path.name} ({file_size:.1f} KB)")
        
        # Create download button
        display(HTML(f'''
        <button onclick="window.open('/files/{file_path}', '_blank')">
            Download {file_path.name}
        </button>
        '''))
    
    print("\n✅ All files are ready for download!")
    
elif IN_COLAB:
    print("❌ No processing results found. Please run the processing cells first.")
else:
    print("💻 Running locally - files are saved to the output directory")
    if 'processing_results' in locals():
        print(f"📁 Results saved in: {processing_results['output_dir']}")

## 📋 Results Summary & Usage Tips

### Generated Files:
- **Basic ALTO XML**: Segmentation with coordinates only
- **Complete ALTO XML**: Full transcription with confidence scores  
- **Enhanced ALTO XML**: With NER semantic annotations (if entities found)
- **Surname Matches JSON**: Genealogical findings with fuzzy matching
- **Segmentation PNG**: Visual representation of detected text lines

### Google Colab Tips:
1. **GPU Usage**: This notebook automatically detects and optimizes for T4/V100/A100 GPUs
2. **Runtime Management**: Long processing may timeout - save intermediate results
3. **File Persistence**: Files are saved in `/content/` and will be lost when runtime disconnects
4. **Memory Management**: Large documents may require Pro/Pro+ for sufficient RAM

### Next Steps:
1. **Review ALTO files** in XML editor or import into eScriptorium
2. **Analyze surname matches** for genealogical research projects
3. **Process multiple documents** by re-running with different uploads
4. **Customize surname lists** by modifying the configuration above
5. **Adjust confidence thresholds** for better precision/recall balance

### Performance Optimization:
- **Batch Processing**: Upload multiple documents and process in sequence
- **Model Caching**: Models are cached between runs in the same session
- **GPU Memory**: Restart runtime if you encounter CUDA out-of-memory errors
- **Image Size**: Very large images may need resizing for optimal processing

---
**🇺🇦 Ukrainian OCR Pipeline** - Optimized for Google Colab environments

**🔗 Repository**: [GitHub Repository Link]

**📚 Documentation**: [Package Documentation Link]