Skip to content

pakkio/ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Advanced OCR Benchmark Suite

A comprehensive comparison system for traditional OCR vs modern Vision Language Models (VLMs), featuring structured JSON extraction from complex dashboard images.

Python 3.9+ OpenRouter License: MIT

🎯 Overview

This project demonstrates how modern Vision Language Models significantly outperform traditional OCR systems for complex structured data extraction. It provides a complete testing framework comparing 13 VLM models against traditional OCR engines.

Key Features

  • 🤖 13 VLM Models: OpenAI GPT-4 series, Anthropic Claude, Google Gemini
  • ⚙️ Traditional OCR: EasyOCR, PaddleOCR, Tesseract comparison
  • 📊 Structured Extraction: JSON schema-based data extraction
  • 🧪 Automated Testing: Comprehensive test suite with performance metrics
  • 🎭 Multiple Interfaces: Streamlit and Gradio web applications
  • 📈 Quality Assessment: LLM-powered extraction quality analysis

🚀 Quick Start

Prerequisites

# Python 3.9+ required
python --version

# Get OpenRouter API key from https://openrouter.ai/

Installation

# Clone the repository
git clone <repository-url>
cd ocr

# Install dependencies  
pip install -r requirements.txt

# Set up environment
cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Run Applications

# Modern Gradio interface (recommended)
python gradio_main.py

# Streamlit interface  
streamlit run structured_benchmark.py

# Traditional OCR comparison
python ocr_tester.py

# Comprehensive testing
python run_tests.py --mode quick

📊 Supported Models

🤖 Vision Language Models (13 models)

OpenAI

  • GPT-4o - Best general performance ($0.005/1k tokens)
  • GPT-4o Mini - Cost-effective alternative ($0.00015/1k tokens)
  • GPT-4.1 - Latest generation ($0.008/1k tokens)
  • GPT-4.1 Mini/Nano - Efficient variants ($0.0001-0.0002/1k tokens)

Anthropic

  • Claude 3.5 Sonnet - Excellent reasoning ($0.003/1k tokens)
  • Claude Sonnet 4 - Latest generation ($0.003/1k tokens)
  • Claude 3.7 Sonnet - Improved reasoning ($0.003/1k tokens)

Google

  • Gemini 2.5 Pro - Latest generation ($0.001/1k tokens)
  • Gemini 2.5 Flash - Ultra fast/cheap ($0.000075/1k tokens)
  • Gemini 2.5 Flash Lite - Most economical ($0.00005/1k tokens)
  • Gemini Pro/Flash 1.5 - Previous generation

⚙️ Traditional OCR

  • EasyOCR - General purpose OCR
  • PaddleOCR - Chinese-focused with English support
  • Tesseract - Google's OCR engine

🏗️ Architecture

Core Components

src/
├── config.py              # Configuration and model definitions
├── schemas.py              # Pydantic data models
├── factory.py              # Provider factory pattern
└── providers/
    ├── base.py             # Abstract base provider
    ├── structured_provider.py  # VLM structured extraction
    ├── openrouter_provider.py  # OpenRouter API integration
    └── traditional_providers.py # OCR engines

Applications

├── gradio_main.py          # Modern Gradio interface  
├── structured_benchmark.py # Streamlit interface
├── ocr_tester.py          # Traditional OCR testing
├── comprehensive_test_suite.py # Automated testing
└── run_tests.py           # Test runner with multiple modes

🧪 Testing Framework

Automated Testing

# Quick test (2-3 minutes) - subset of models
python run_tests.py --mode quick

# Full test (10-15 minutes) - all models and images  
python run_tests.py --mode full

# Provider-specific testing
python run_tests.py --mode provider --provider openai
python run_tests.py --mode provider --provider anthropic
python run_tests.py --mode provider --provider google

Test Coverage

  • VLM Structured Extraction - JSON schema validation
  • VLM Gradio Mode - Simplified JSON object extraction
  • Traditional OCR - Text extraction comparison
  • Error Handling - Fallback mechanisms and graceful failures
  • Performance Metrics - Execution time and success rates

Generated Reports

  • JSON Results - Machine-readable test data
  • Markdown Reports - Human-readable summaries with success rates
  • Performance Logs - Detailed execution and error logs

🎯 Use Cases

Dashboard Data Extraction

Extract structured data from complex analytics dashboards:

  • 📈 Chart data points and values
  • 📊 Key metrics and KPIs
  • 🕒 Time series data
  • 🏷️ Labels and categories

Document Processing

  • 📄 Invoice and receipt processing
  • 📋 Form data extraction
  • 📊 Table and spreadsheet analysis
  • 🖼️ Image-based document understanding

Quality Assurance

  • 🔍 Automated extraction accuracy testing
  • 📊 Model performance comparison
  • 💰 Cost vs accuracy analysis
  • ⏱️ Processing speed benchmarks

📈 Performance Results

Based on testing across 3 complex dashboard images:

Provider Success Rate Avg Accuracy Avg Speed Cost Efficiency
VLM Models 85-95% 8.5-9.2/10 2-15s High value
Traditional OCR 45-65% 3.2-5.1/10 0.5-2s Low cost

Key Findings

  • VLM superiority: 70-80% more accurate on complex layouts
  • Context understanding: Filters watermarks, understands relationships
  • Cost effectiveness: Gemini Flash offers best value/performance ratio
  • Schema compliance: Varies by provider (strict vs fallback modes)

🔧 Technical Features

Advanced Compatibility

  • Schema Fallback - Automatic fallback from strict JSON schema to flexible mode
  • Markdown Parsing - Handles Claude's markdown-wrapped JSON responses
  • Model-Specific Handling - Optimized parameters per provider
  • Error Recovery - Graceful handling of API failures and timeouts

Quality Assessment

  • Dual-LLM Pipeline - Separate models for extraction and quality assessment
  • Structured Validation - Pydantic schema enforcement
  • Confidence Scoring - Completeness, accuracy, and structure metrics
  • Automated Recommendations - Specific improvement suggestions

Scalability

  • Async Processing - Concurrent model execution
  • Rate Limiting - Built-in API quota management
  • Caching - Response caching for development
  • Monitoring - Comprehensive logging and metrics

🛠️ Configuration

Environment Variables

OPENROUTER_API_KEY=your_api_key_here
MAX_RETRIES=3
TIMEOUT_SECONDS=30

Model Configuration

# In src/config.py
available_models = {
    "gpt-4o": {
        "cost_per_1k_tokens": 0.005,
        "supports_strict_json_schema": True,
        "supports_vision": True
    },
    # ... additional models
}

📚 Documentation

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-provider)
  3. Add your changes with tests
  4. Run the test suite (python run_tests.py --mode full)
  5. Submit a pull request

Adding New Models

  1. Update model list in src/config.py
  2. Set appropriate capability flags (supports_strict_json_schema, etc.)
  3. Test with the comprehensive test suite
  4. Update documentation

📊 Monitoring & Analytics

Success Rate Tracking

Monitor model performance over time:

  • Provider-specific success rates
  • Schema compliance metrics
  • Response time trends
  • Cost optimization opportunities

Quality Metrics

  • Data extraction completeness
  • Accuracy vs ground truth
  • Schema validation success
  • Error pattern analysis

🔮 Future Enhancements

Short Term

  • Batch API optimization for cost reduction
  • Custom prompting for domain-specific use cases
  • Error correction pipeline with human feedback
  • Template matching for common dashboard types

Long Term

  • Fine-tuning models for OCR-specific tasks
  • Hybrid approaches combining VLM + traditional OCR
  • Real-time processing for video streams
  • Enterprise API service deployment

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

  • OpenRouter for unified VLM API access
  • Pydantic for robust data validation
  • Streamlit & Gradio for rapid UI development
  • Traditional OCR libraries for baseline comparison

Ready to revolutionize your document processing? 🚀

Get started | View documentation | Run tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •