# 🇺🇦 Ukrainian OCR Pipeline - Google Colab Demo

High-performance OCR pipeline for historical Ukrainian documents with Named Entity Recognition (NER).

**Features:**
- ⚡ GPU-accelerated TrOCR for Cyrillic handwriting
- 🎯 Named Entity Recognition for persons and locations
- 📋 ALTO XML output for archival standards
- 🎨 Person-dense region extraction
- 📊 Progress tracking and performance monitoring


## 🚀 Setup and Installation

First, let's install the Ukrainian OCR pipeline package and check GPU availability.

In [None]:
# Install the Ukrainian OCR pipeline package
!pip install ukrainian-ocr-pipeline[colab] --quiet

# Install additional dependencies for Colab
!pip install ipywidgets --quiet

print("✅ Installation complete!")

In [None]:
# Check GPU availability and setup
import ukrainian_ocr
from ukrainian_ocr.utils.gpu import check_gpu_availability, setup_colab_gpu

# Check GPU
gpu_info = setup_colab_gpu()

if gpu_info['cuda_available']:
    print(f"🎉 GPU detected: {gpu_info['gpu_names'][0]}")
    print(f"💾 GPU Memory: {gpu_info['gpu_memory'][0]:.1f}GB")
    print(f"🔥 Recommended device: {gpu_info['recommended_device']}")
else:
    print("⚠️ No GPU detected. Enable GPU: Runtime -> Change runtime type -> GPU")
    print("💻 Will use CPU (slower processing)")

## 📁 Upload Images

Upload your historical Ukrainian document images for processing.

In [None]:
from google.colab import files
import os

# Create upload directory
os.makedirs('/content/images', exist_ok=True)

# Upload files
print("📤 Select your historical document images to upload:")
uploaded = files.upload()

# Move uploaded files to images directory
for filename in uploaded.keys():
    os.rename(filename, f'/content/images/{filename}')
    print(f"✅ Uploaded: {filename}")

# List uploaded files
image_files = [f'/content/images/{f}' for f in os.listdir('/content/images')]
print(f"\n📊 Total images uploaded: {len(image_files)}")

## ⚙️ Configuration

Configure the OCR pipeline for optimal performance in Colab.

In [None]:
from ukrainian_ocr import UkrainianOCRPipeline, OCRConfig

# Create optimized configuration for Colab
config = OCRConfig()
config.update_for_colab()  # Optimize for Colab environment

# Customize settings if needed
config.verbose = True  # Enable progress bars
config.save_intermediate = True  # Save visualization images
config.post_processing.extract_person_regions = True  # Extract person-dense regions

print("⚙️ Configuration settings:")
print(f"  Device: {config.device}")
print(f"  Batch size: {config.batch_size}")
print(f"  NER backend: {config.ner.backend}")
print(f"  Extract person regions: {config.post_processing.extract_person_regions}")

## 🔄 Processing Pipeline

Initialize the pipeline and process your documents.

In [None]:
# Initialize the OCR pipeline
print("🚀 Initializing Ukrainian OCR Pipeline...")
pipeline = UkrainianOCRPipeline(
    config=config,
    device='auto',  # Auto-detect best device
    verbose=True
)

print("✅ Pipeline initialized successfully!")

In [None]:
# Process all uploaded images
import time

print(f"🔄 Processing {len(image_files)} image(s)...\n")

# Create output directory
output_dir = '/content/ocr_results'
os.makedirs(output_dir, exist_ok=True)

# Start processing
start_time = time.time()

if len(image_files) == 1:
    # Single image processing
    results = [pipeline.process_single_image(
        image_files[0], 
        output_dir=output_dir,
        save_intermediate=True
    )]
else:
    # Batch processing
    results = pipeline.process_batch(
        image_files, 
        output_dir=output_dir,
        save_intermediate=True
    )

total_time = time.time() - start_time

# Display results summary
successful = sum(1 for r in results if r['success'])
failed = len(results) - successful

print(f"\n🎉 Processing complete!")
print(f"✅ Successful: {successful}/{len(results)}")
print(f"❌ Failed: {failed}/{len(results)}")
print(f"⏱️ Total time: {total_time:.1f}s")
print(f"📊 Average per image: {total_time/len(results):.1f}s")

# Show pipeline statistics
stats = pipeline.get_stats()
print(f"\n📈 Pipeline Statistics:")
print(f"  Images processed: {stats['images_processed']}")
print(f"  Total processing time: {stats['total_processing_time']:.1f}s")
print(f"  Average time per image: {stats['average_time_per_image']:.1f}s")

## 📊 Results Analysis

Analyze the processing results and view extracted entities.

In [None]:
# Analyze results for each processed image
from IPython.display import display, Image, HTML
import xml.etree.ElementTree as ET

for i, result in enumerate(results):
    if result['success']:
        print(f"\n📄 Image {i+1}: {result['image_path']}")
        print(f"⏱️ Processing time: {result['processing_time']:.2f}s")
        print(f"📏 Lines detected: {result['lines_detected']}")
        print(f"📝 Lines with text: {result['lines_with_text']}")
        
        # Show output files
        print(f"\n📁 Output files:")
        for file_type, path in result['output_paths'].items():
            if path and os.path.exists(path):
                size_mb = os.path.getsize(path) / 1024 / 1024
                print(f"  {file_type}: {os.path.basename(path)} ({size_mb:.1f}MB)")
                
        # Try to extract entities from enhanced ALTO
        alto_enhanced = result['output_paths'].get('alto_enhanced')
        if alto_enhanced and os.path.exists(alto_enhanced):
            try:
                tree = ET.parse(alto_enhanced)
                root = tree.getroot()
                
                # Count entity lines
                person_lines = len(root.findall('.//*[@ENTITY_TYPES="PERSON"]'))
                location_lines = len(root.findall('.//*[@ENTITY_TYPES="LOCATION"]'))
                
                print(f"\n🎯 Entities extracted:")
                print(f"  👤 Person lines: {person_lines}")
                print(f"  📍 Location lines: {location_lines}")
                
                # Check for person-dense regions
                dense_blocks = root.findall('.//TextBlock[@PERSON_LINES_COUNT]')
                if dense_blocks:
                    for block in dense_blocks:
                        person_count = block.get('PERSON_LINES_COUNT', 0)
                        print(f"  🎯 Person-dense region: {person_count} person lines")
                        
            except Exception as e:
                print(f"  ⚠️ Could not parse ALTO file: {e}")
    else:
        print(f"\n❌ Image {i+1} failed: {result.get('error', 'Unknown error')}")
        
    print("-" * 50)

## 🎨 Visualization

Display processing visualizations and person-dense regions.

In [None]:
from IPython.display import display, Image as IPImage, HTML
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Display visualizations for successful results
for i, result in enumerate(results[:3]):  # Limit to first 3 images
    if result['success']:
        print(f"\n🎨 Visualizations for Image {i+1}")
        
        # Show segmentation visualization if available
        viz_path = result['output_paths'].get('visualization')
        if viz_path and os.path.exists(viz_path):
            try:
                plt.figure(figsize=(12, 8))
                img = mpimg.imread(viz_path)
                plt.imshow(img)
                plt.title(f"Segmentation Results - {os.path.basename(result['image_path'])}")
                plt.axis('off')
                plt.tight_layout()
                plt.show()
            except Exception as e:
                print(f"  ⚠️ Could not display visualization: {e}")
                
        # Show person region if available
        person_regions_path = result['output_paths'].get('person_regions')
        if person_regions_path and os.path.exists(person_regions_path):
            try:
                plt.figure(figsize=(10, 6))
                img = mpimg.imread(person_regions_path)
                plt.imshow(img)
                plt.title(f"Person-Dense Region - {os.path.basename(result['image_path'])}")
                plt.axis('off')
                plt.tight_layout()
                plt.show()
            except Exception as e:
                print(f"  ⚠️ Could not display person region: {e}")

## 💾 Download Results

Package and download your processing results.

In [None]:
import shutil
import zipfile

# Create a zip file with all results
zip_path = '/content/ukrainian_ocr_results.zip'

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add all files from results directory
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arc_path = os.path.relpath(file_path, '/content')
            zipf.write(file_path, arc_path)

zip_size_mb = os.path.getsize(zip_path) / 1024 / 1024
print(f"📦 Created results archive: ukrainian_ocr_results.zip ({zip_size_mb:.1f}MB)")

# Download the zip file
files.download(zip_path)
print("✅ Results downloaded successfully!")

## 🧹 Cleanup

Clean up GPU memory and temporary files.

In [None]:
# Clean up pipeline resources
pipeline.cleanup()

# Show final memory usage
from ukrainian_ocr.utils.gpu import monitor_gpu_memory

if gpu_info['cuda_available']:
    memory_stats = monitor_gpu_memory()
    for gpu_id, stats in memory_stats.items():
        print(f"📊 {gpu_id.upper()} Memory Usage:")
        print(f"  Allocated: {stats['allocated']:.1f}GB")
        print(f"  Utilization: {stats['utilization']:.1f}%")

print("\n🎉 Ukrainian OCR Pipeline processing complete!")
print("📚 Check the downloaded archive for all your results.")

## 🚀 Next Steps

**What you can do with the results:**

1. **ALTO XML files** - Import into eScriptorium or other document analysis tools
2. **Enhanced ALTO** - Contains named entity annotations for persons and locations
3. **Person-dense regions** - Cropped images focusing on genealogically valuable content
4. **Entity extraction** - Use the identified persons and locations for genealogical research

**For production use:**
- Install locally: `pip install ukrainian-ocr-pipeline[all]`
- Use CLI interface: `ukrainian-ocr --help`
- Customize configuration files
- Integrate with existing workflows

**Need help?**
- 📖 Documentation: [Link to docs]
- 🐛 Issues: [Link to GitHub issues]
- 💬 Discussions: [Link to discussions]
