# LLM OCR Package - Real Workflow Demo

This notebook demonstrates the complete OCR workflow using real API keys.

## Prerequisites

1. Set environment variables for API keys:
   ```bash
   export ANTHROPIC_API_KEY="your-anthropic-key"
   export OPENAI_API_KEY="your-openai-key"
   export GEMINI_API_KEY="your-gemini-key"
   export TOGETHER_API_KEY="your-together-key"
   ```

2. Or create a `.env` file in the project root with the same variables.

3. Make sure you have test data (ALTO XML, images, ground truth) in the `ground_truth/` directory.

## Setup and Imports

In [2]:
from llm_ocr.workflow import OCRPipelineWorkflow
from llm_ocr.models import ProcessingMode
from llm_ocr.prompts.prompt import PromptVersion 

workflow = OCRPipelineWorkflow(
    id="0e2a73f13785",
    folder="ground_truth",
    ocr_model_name="gemini-1.5-flash",
    correction_model_name="gpt-4.1-2025-04-14",
    modes=[ProcessingMode.FULL_PAGE],
    output_dir="outputs",
    prompt_version=PromptVersion.V1,
    evaluation=True,
    rerun=False
)
workflow.run_pipeline()

2025-05-24 15:39:53,544 - root - INFO - Model initialized: GeminiOCRModel
2025-05-24 15:39:53,545 - root - INFO - Saved OCR results to outputs/0e2a73f13785_ocr_results.json
2025-05-24 15:39:53,545 - root - INFO - STEP 1: Processing document: 0e2a73f13785 with OCR model: gemini-1.5-flash
2025-05-24 15:39:53,546 - root - INFO - Processing with mode: fullpage
2025-05-24 15:39:53,631 - OCRPipeline - INFO - Processing with model: GeminiOCRModel


Model type for 'gemini-1.5-flash': ModelType.GEMINI
Creating Gemini model...
Formatting prompt with kwargs: 
            Task: Extract OCR text from this full page of an 18th century Russian book.
            Requirements:
            - Preserve original Old Russian orthography
            - Process each line independently
            Respond with ONLY a JSON array where each object has a 'line' field containing the corrected text.
            No explanations needed.
            


2025-05-24 15:39:57,494 - llm_ocr.llm.gemini - INFO - Full page response text received: ```json
[
  {
    "line": "330"
  },
  {
    "line": "чемъ известковатая земля селениша"
  },
  {
    "line": "упадаетъ на дно, а масло не могши"
  },
  {
    "line": "соединиться съ водою, разсѣваетс...
2025-05-24 15:39:57,497 - llm_ocr.llm.gemini - INFO - Combined text: 330
чемъ известковатая земля селениша
упадаетъ на дно, а масло не могши
соединиться съ водою, разсѣвается въ
видѣ маленькихъ звѣздочекъ, чему все-
му и великому множеству другихъ
подобныхъ явленій ясн...
2025-05-24 15:39:57,500 - root - INFO - Processed 9 lines with mode fullpage in 3.95 seconds
2025-05-24 15:39:57,505 - root - INFO - Saved OCR results to outputs/0e2a73f13785_ocr_results.json
2025-05-24 15:39:57,514 - root - INFO - STEP 2: Evaluating OCR results for model: gemini-1.5-flash
2025-05-24 15:39:57,516 - root - INFO - Evaluating mode: fullpage
2025-05-24 15:39:57,519 - root - INFO - Converting raw results: [{'ground_trut

Model type for 'gpt-4.1-2025-04-14': ModelType.GPT
Creating OpenAI model...
Formatting prompt with kwargs: 
            You are an expert in Old Russian text correction.
            Correct this OCR text from an 18th century Russian book and format it as a single continuous paragraph.
            
            Requirements:
            - Preserve original Old Russian orthography
            - Join hyphenated words at line breaks (e.g., "вне-" + "дряться" -> "внедряться")
            - Remove all line breaks
            - Keep original punctuation
            - Maintain a single continuous paragraph
            - Keep the page number at the start if present
            
            Original text:
            330
чемъ известковатая земля селениша
упадаетъ на дно, а масло не могши
соединиться съ водою, разсѣвается въ
видѣ маленькихъ звѣздочекъ, чему все-
му и великому множеству другихъ
подобныхъ явленій ясно можно видѣть
причину въ слѣдующей Части.
Конецъ третей Части.
            
       

2025-05-24 15:40:10,883 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-24 15:40:10,916 - llm_ocr.pipelines.correction - INFO - Correction completed in 13.20 seconds
2025-05-24 15:40:10,917 - root - INFO - Mode 'line' completed in 13.20 seconds
2025-05-24 15:40:10,918 - root - INFO - Saved correction results to outputs/0e2a73f13785_correction_results.json
2025-05-24 15:40:10,918 - root - INFO - STEP 4: Evaluating correction results for combination 'gemini-1.5-flash_ocr__gpt-4.1-2025-04-14_correction', modes: ['line']
2025-05-24 15:40:10,919 - root - INFO - Evaluating mode: line
2025-05-24 15:40:10,920 - root - INFO - Converting raw results: [{'ground_truth_text': '330 чемъ известковатая земля селенита упадаетъ на дно , а масло не могши соединиться съ водою, разсѣвается въ видѣ маленькихъ звѣздочекъ, чему всему и великому множеству другихъ подобныхъ явленїй ясно можно видѣть причину въ слѣдующей Части. Конецъ третей Части. ', 'ext

In [1]:
import os
import sys
import logging
from pathlib import Path
from pprint import pprint

# Add the package to Python path
sys.path.insert(0, str(Path.cwd()))

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("✅ Basic imports complete")

✅ Basic imports complete


## Check Environment and API Keys

In [2]:
# Check for .env file
env_file = Path('.env')
if env_file.exists():
    print("📁 Found .env file, loading environment variables...")
    from dotenv import load_dotenv
    load_dotenv()
else:
    print("⚠️  No .env file found, using system environment variables")

# Check API keys
api_keys = {
    'ANTHROPIC_API_KEY': os.getenv('ANTHROPIC_API_KEY'),
    'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'),
    'GEMINI_API_KEY': os.getenv('GEMINI_API_KEY'),
    'TOGETHER_API_KEY': os.getenv('TOGETHER_API_KEY')
}

print("\n🔐 API Key Status:")
for key, value in api_keys.items():
    status = "✅ Set" if value else "❌ Missing"
    masked = f"{value[:8]}...{value[-4:]}" if value and len(value) > 12 else "Not set"
    print(f"  {key}: {status} ({masked})")

available_providers = [k.replace('_API_KEY', '').lower() for k, v in api_keys.items() if v]
print(f"\n🤖 Available providers: {available_providers}")

📁 Found .env file, loading environment variables...

🔐 API Key Status:
  ANTHROPIC_API_KEY: ✅ Set (sk-ant-a...CwAA)
  OPENAI_API_KEY: ✅ Set (sk-ssj4t...osiX)
  GEMINI_API_KEY: ✅ Set (AIzaSyBZ...zR5A)
  TOGETHER_API_KEY: ✅ Set (tgp_v1_H...igsQ)

🤖 Available providers: ['anthropic', 'openai', 'gemini', 'together']


## Import LLM OCR Package

In [3]:
try:
    from llm_ocr.workflow import OCRPipelineWorkflow
    from llm_ocr.models import ProcessingMode
    from llm_ocr.prompts.prompt import PromptVersion
    from llm_ocr.config import EvaluationConfig
    from llm_ocr.evaluators.evaluator import OCREvaluator
    from llm_ocr.model_factory import create_model
    
    print("✅ LLM OCR package imported successfully!")
except ImportError as e:
    print(f"❌ Failed to import LLM OCR package: {e}")
    print("Make sure you're running this from the package root directory")
    raise

✅ LLM OCR package imported successfully!


## Check Test Data

In [4]:
# Check for ground truth data
ground_truth_dir = Path('ground_truth')
if ground_truth_dir.exists():
    files = list(ground_truth_dir.glob('*'))
    print(f"📂 Found {len(files)} files in ground_truth directory:")
    
    # Group files by document ID
    documents = {}
    for file in files:
        doc_id = file.stem
        ext = file.suffix
        if doc_id not in documents:
            documents[doc_id] = {}
        documents[doc_id][ext] = file
    
    print("\n📄 Available documents:")
    complete_docs = []
    for doc_id, files in documents.items():
        has_xml = '.xml' in files
        has_image = any(ext in files for ext in ['.jpg', '.jpeg', '.png', '.tiff'])
        status = "✅ Complete" if has_xml and has_image else "⚠️  Incomplete"
        print(f"  {doc_id}: {status} (XML: {has_xml}, Image: {has_image})")
        if has_xml and has_image:
            complete_docs.append(doc_id)
    
    if complete_docs:
        selected_doc = complete_docs[0]
        print(f"\n🎯 Will use document: {selected_doc}")
    else:
        print("\n❌ No complete documents found!")
        print("Each document needs both XML and image files.")
else:
    print("❌ No ground_truth directory found!")
    print("Please create ground_truth/ directory with test ALTO XML and image files.")

📂 Found 4 files in ground_truth directory:

📄 Available documents:
  0e2a73f13785: ✅ Complete (XML: True, Image: True)
  0e2a73f13785_line: ⚠️  Incomplete (XML: False, Image: False)

🎯 Will use document: 0e2a73f13785


## Test Basic Components

In [5]:
# Test OCR Evaluator
print("🧪 Testing OCR Evaluator...")
config = EvaluationConfig()
evaluator = OCREvaluator(config)

# Test with sample data
ground_truth = "This is a test sentence."
extracted = "This is a tost sentence."

metrics = evaluator.evaluate_line(ground_truth, extracted)
print(f"✅ Character accuracy: {metrics.char_accuracy:.3f}")
print(f"✅ Word accuracy: {metrics.word_accuracy:.3f}")
print(f"✅ Case accuracy: {metrics.case_accuracy:.3f}")

🧪 Testing OCR Evaluator...
✅ Character accuracy: 0.958
✅ Word accuracy: 0.958
✅ Case accuracy: 1.000


## Test Model Creation

In [6]:
# Test model creation with available providers
model_configs = {
    'anthropic': 'claude-3-haiku-20240307',
    'openai': 'gpt-4o-mini',
    'gemini': 'gemini-1.5-flash',
    'together': 'meta-llama/Llama-3.2-3B-Instruct-Turbo'
}

selected_model = "gemini-1.5-flash"
selected_provider = "gemini"
selected_model_name = "gemini-1.5-flash"

print("🤖 Testing model creation...")
for provider in available_providers:
    if provider in model_configs:
        model_name = model_configs[provider]
        try:
            print(f"\n  Testing {provider} with {model_name}...")
            model = create_model(model_name, PromptVersion.V1)
            print(f"  ✅ {provider} model created successfully")
            if not selected_model:  # Use first successful model
                selected_model = model
                selected_provider = provider
                selected_model_name = model_name
        except Exception as e:
            print(f"  ❌ {provider} failed: {e}")

if selected_model:
    print(f"\n🎯 Will use {selected_provider} ({selected_model_name}) for testing")
else:
    print("\n❌ No models could be created. Check your API keys.")

🤖 Testing model creation...

  Testing anthropic with claude-3-haiku-20240307...
Model type for 'claude-3-haiku-20240307': ModelType.CLAUDE
Creating Claude model...
Model name: claude-3-haiku-20240307
Model type: ModelType.CLAUDE
Prompt version: PromptVersion.V1
  ✅ anthropic model created successfully

  Testing openai with gpt-4o-mini...
Model type for 'gpt-4o-mini': ModelType.GPT
Creating OpenAI model...
  ✅ openai model created successfully

  Testing gemini with gemini-1.5-flash...
Model type for 'gemini-1.5-flash': ModelType.GEMINI
Creating Gemini model...
  ✅ gemini model created successfully

  Testing together with meta-llama/Llama-3.2-3B-Instruct-Turbo...
Model type for 'meta-llama/Llama-3.2-3B-Instruct-Turbo': ModelType.TOGETHER
Creating Together model...
  ✅ together model created successfully

🎯 Will use gemini (gemini-1.5-flash) for testing


## Test Simple Text Correction

In [12]:
if selected_model:
    print(f"🔤 Testing text correction with {selected_provider}...")
    
    # Create a simple base64 image (1x1 white pixel)
    import base64
    simple_image = base64.b64encode(b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00\x00\x0bIDATx\x9cc```\x00\x00\x00\x04\x00\x01\x827\x9a\xd1\x00\x00\x00\x00IEND\xaeB`\x82').decode('utf-8')
    
    test_text = "Helo wrold, this is a tost with sme errors."
    
    try:
        print(f"\n📝 Original text: {test_text}")
        print(f"🔄 Sending to {selected_provider} for correction...")
        
        corrected = selected_model.correct_text(test_text, simple_image)
        
        print(f"✅ Corrected text: {corrected}")
        
        # Evaluate the correction
        expected = "Hello world, this is a test with some errors."
        original_metrics = evaluator.evaluate_line(expected, test_text)
        corrected_metrics = evaluator.evaluate_line(expected, corrected)
        
        print("\n📊 Correction Results:")
        print(f"  Original accuracy: {original_metrics.char_accuracy:.3f}")
        print(f"  Corrected accuracy: {corrected_metrics.char_accuracy:.3f}")
        improvement = corrected_metrics.char_accuracy - original_metrics.char_accuracy
        print(f"  Improvement: {improvement:+.3f}")
        
    except Exception as e:
        print(f"❌ Text correction failed: {e}")
        import traceback
        traceback.print_exc()
else:
    print("⏭️  Skipping text correction test - no model available")

🔤 Testing text correction with gemini...

📝 Original text: Helo wrold, this is a tost with sme errors.
🔄 Sending to gemini for correction...
❌ Text correction failed: 'str' object has no attribute 'correct_text'


Traceback (most recent call last):
  File "/tmp/ipykernel_3635186/1298088697.py", line 14, in <module>
    corrected = selected_model.correct_text(test_text, simple_image)
AttributeError: 'str' object has no attribute 'correct_text'


## Test Full Workflow (if test data available)

In [7]:
if 'complete_docs' in locals() and complete_docs and selected_model:
    print(f"🚀 Testing full workflow with document: {selected_doc}")
    
    try:
        # Create output directory
        output_dir = Path('workflow_output')
        output_dir.mkdir(exist_ok=True)
        
        # Initialize workflow
        workflow = OCRPipelineWorkflow(
            id=selected_doc,
            folder=str(ground_truth_dir),
            model_name=selected_model_name,
            modes=[ProcessingMode.FULL_PAGE],
            output_dir=str(output_dir),
            prompt_version=PromptVersion.V1,
            evaluation=True,
            rerun=False
        )
        
        print("✅ Workflow initialized")
        print(f"   Document ID: {workflow.id}")
        print(f"   Model: {workflow.model_name}")
        print(f"   Modes: {workflow.modes}")
        print(f"   Output: {workflow.output_dir}")
        
        # Note: Uncomment the next lines to run the full pipeline
        # WARNING: This will make API calls and consume credits!
        
        # print("\n⚠️  To run the full pipeline, uncomment the following lines:")
        # print("    # results = workflow.run_pipeline()")
        # print("    # print('Pipeline completed!')")
        # print("    # pprint(results)")
        
        # Uncomment these lines when you're ready to test with real API calls:
        workflow.run_pipeline()
        print("\n🎉 Pipeline completed successfully!")
        print("\n📊 Results summary:")
        pprint(workflow.results)
        if 'models' in workflow.results and selected_model_name in workflow.results['models']:
            model_results = workflow.results['models'][selected_model_name]
            if 'metrics' in model_results:
                metrics = model_results['metrics']
                print(f"   Character accuracy: {metrics.get('character_accuracy', 'N/A')}")
                print(f"   Word accuracy: {metrics.get('word_accuracy', 'N/A')}")
        
    except Exception as e:
        print(f"❌ Workflow test failed: {e}")
        import traceback
        traceback.print_exc()
        
else:
    print("⏭️  Skipping full workflow test")
    if 'complete_docs' not in locals() or not complete_docs:
        print("   Reason: No complete test documents available")
    if not selected_model:
        print("   Reason: No model available")

2025-05-24 12:10:55,575 - root - INFO - Loaded existing results file: workflow_output/0e2a73f13785_results.json
2025-05-24 12:10:55,576 - root - INFO - Model initialized: GeminiOCRModel
2025-05-24 12:10:55,578 - root - INFO - Saved results to workflow_output/0e2a73f13785_results.json
2025-05-24 12:10:55,578 - root - INFO - STEP 1: Processing document: 0e2a73f13785 with model: gemini-1.5-flash
2025-05-24 12:10:55,579 - root - INFO - Processing with mode: fullpage
2025-05-24 12:10:55,579 - root - INFO - Mode fullpage already processed. Skipping.
2025-05-24 12:10:55,579 - root - INFO - STEP 2: Evaluating OCR results for model: gemini-1.5-flash
2025-05-24 12:10:55,580 - root - INFO - Evaluating mode: fullpage
2025-05-24 12:10:55,580 - root - INFO - Converting raw results: [{'ground_truth_text': '330\nчемъ известковатая земля селенита\nупадаетъ на дно , а масло не могши\nсоединиться съ водою, разсѣвается въ\nвидѣ маленькихъ звѣздочекъ, чему все-\nму и великому множеству другихъ\nподобныхъ я

🚀 Testing full workflow with document: 0e2a73f13785
Model type for 'gemini-1.5-flash': ModelType.GEMINI
Creating Gemini model...
✅ Workflow initialized
   Document ID: 0e2a73f13785
   Model: gemini-1.5-flash
   Modes: [<ProcessingMode.FULL_PAGE: 'fullpage'>]
   Output: workflow_output
line_filename: ground_truth/0e2a73f13785_line.txt

🎉 Pipeline completed successfully!

📊 Results summary:
{'document_info': {'document_name': '0e2a73f13785',
                   'ground_truth': '330\n'
                                   'чемъ известковатая земля селенита\n'
                                   'упадаетъ на дно , а масло не могши\n'
                                   'соединиться съ водою, разсѣвается въ\n'
                                   'видѣ маленькихъ звѣздочекъ, чему все-\n'
                                   'му и великому множеству другихъ\n'
                                   'подобныхъ явленїй ясно можно видѣть\n'
                                   'причину въ слѣдующей Части.\n'


## Summary and Next Steps

In [8]:
print("📋 Test Summary:")
print(f"  Available API providers: {len(available_providers)}")
print(f"  Model creation: {'✅ Success' if selected_model else '❌ Failed'}")
print(f"  Test documents: {len(complete_docs) if 'complete_docs' in locals() else 0}")
print("  Basic evaluator: ✅ Working")

print("\n🎯 Next Steps:")
print("1. Ensure you have API keys set up for at least one provider")
print("2. Add test documents to ground_truth/ directory (XML + image pairs)")
print("3. Uncomment the full workflow test to run real OCR processing")
print("4. Monitor API usage and costs when running with real data")

print("\n📁 Files created:")
if Path('workflow_output').exists():
    output_files = list(Path('workflow_output').glob('*'))
    for file in output_files:
        print(f"  {file}")
else:
    print("  No output files (workflow not run)")

📋 Test Summary:
  Available API providers: 4
  Model creation: ✅ Success
  Test documents: 1
  Basic evaluator: ✅ Working

🎯 Next Steps:
1. Ensure you have API keys set up for at least one provider
2. Add test documents to ground_truth/ directory (XML + image pairs)
3. Uncomment the full workflow test to run real OCR processing
4. Monitor API usage and costs when running with real data

📁 Files created:
  workflow_output/0e2a73f13785_results.json
