🚀 Advanced OCR Benchmark Suite

A comprehensive comparison system for traditional OCR vs modern Vision Language Models (VLMs), featuring structured JSON extraction from complex dashboard images.

🎯 Overview

This project demonstrates how modern Vision Language Models significantly outperform traditional OCR systems for complex structured data extraction. It provides a complete testing framework comparing 13 VLM models against traditional OCR engines.

Key Features

🤖 13 VLM Models: OpenAI GPT-4 series, Anthropic Claude, Google Gemini
⚙️ Traditional OCR: EasyOCR, PaddleOCR, Tesseract comparison
📊 Structured Extraction: JSON schema-based data extraction
🧪 Automated Testing: Comprehensive test suite with performance metrics
🎭 Multiple Interfaces: Streamlit and Gradio web applications
📈 Quality Assessment: LLM-powered extraction quality analysis

🚀 Quick Start

Prerequisites

# Python 3.9+ required
python --version

# Get OpenRouter API key from https://openrouter.ai/

Installation

# Clone the repository
git clone <repository-url>
cd ocr

# Install dependencies  
pip install -r requirements.txt

# Set up environment
cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Run Applications

# Modern Gradio interface (recommended)
python gradio_main.py

# Streamlit interface  
streamlit run structured_benchmark.py

# Traditional OCR comparison
python ocr_tester.py

# Comprehensive testing
python run_tests.py --mode quick

📊 Supported Models

🤖 Vision Language Models (13 models)

OpenAI

GPT-4o - Best general performance ($0.005/1k tokens)
GPT-4o Mini - Cost-effective alternative ($0.00015/1k tokens)
GPT-4.1 - Latest generation ($0.008/1k tokens)
GPT-4.1 Mini/Nano - Efficient variants ($0.0001-0.0002/1k tokens)

Anthropic

Claude 3.5 Sonnet - Excellent reasoning ($0.003/1k tokens)
Claude Sonnet 4 - Latest generation ($0.003/1k tokens)
Claude 3.7 Sonnet - Improved reasoning ($0.003/1k tokens)

Google

Gemini 2.5 Pro - Latest generation ($0.001/1k tokens)
Gemini 2.5 Flash - Ultra fast/cheap ($0.000075/1k tokens)
Gemini 2.5 Flash Lite - Most economical ($0.00005/1k tokens)
Gemini Pro/Flash 1.5 - Previous generation

⚙️ Traditional OCR

EasyOCR - General purpose OCR
PaddleOCR - Chinese-focused with English support
Tesseract - Google's OCR engine

🏗️ Architecture

Core Components

src/
├── config.py              # Configuration and model definitions
├── schemas.py              # Pydantic data models
├── factory.py              # Provider factory pattern
└── providers/
    ├── base.py             # Abstract base provider
    ├── structured_provider.py  # VLM structured extraction
    ├── openrouter_provider.py  # OpenRouter API integration
    └── traditional_providers.py # OCR engines

Applications

├── gradio_main.py          # Modern Gradio interface  
├── structured_benchmark.py # Streamlit interface
├── ocr_tester.py          # Traditional OCR testing
├── comprehensive_test_suite.py # Automated testing
└── run_tests.py           # Test runner with multiple modes

🧪 Testing Framework

Automated Testing

# Quick test (2-3 minutes) - subset of models
python run_tests.py --mode quick

# Full test (10-15 minutes) - all models and images  
python run_tests.py --mode full

# Provider-specific testing
python run_tests.py --mode provider --provider openai
python run_tests.py --mode provider --provider anthropic
python run_tests.py --mode provider --provider google

Test Coverage

✅ VLM Structured Extraction - JSON schema validation
✅ VLM Gradio Mode - Simplified JSON object extraction
✅ Traditional OCR - Text extraction comparison
✅ Error Handling - Fallback mechanisms and graceful failures
✅ Performance Metrics - Execution time and success rates

Generated Reports

JSON Results - Machine-readable test data
Markdown Reports - Human-readable summaries with success rates
Performance Logs - Detailed execution and error logs

🎯 Use Cases

Dashboard Data Extraction

Extract structured data from complex analytics dashboards:

📈 Chart data points and values
📊 Key metrics and KPIs
🕒 Time series data
🏷️ Labels and categories

Document Processing

📄 Invoice and receipt processing
📋 Form data extraction
📊 Table and spreadsheet analysis
🖼️ Image-based document understanding

Quality Assurance

🔍 Automated extraction accuracy testing
📊 Model performance comparison
💰 Cost vs accuracy analysis
⏱️ Processing speed benchmarks

📈 Performance Results

Based on testing across 3 complex dashboard images:

Provider	Success Rate	Avg Accuracy	Avg Speed	Cost Efficiency
VLM Models	85-95%	8.5-9.2/10	2-15s	High value
Traditional OCR	45-65%	3.2-5.1/10	0.5-2s	Low cost

Key Findings

VLM superiority: 70-80% more accurate on complex layouts
Context understanding: Filters watermarks, understands relationships
Cost effectiveness: Gemini Flash offers best value/performance ratio
Schema compliance: Varies by provider (strict vs fallback modes)

🔧 Technical Features

Advanced Compatibility

Schema Fallback - Automatic fallback from strict JSON schema to flexible mode
Markdown Parsing - Handles Claude's markdown-wrapped JSON responses
Model-Specific Handling - Optimized parameters per provider
Error Recovery - Graceful handling of API failures and timeouts

Quality Assessment

Dual-LLM Pipeline - Separate models for extraction and quality assessment
Structured Validation - Pydantic schema enforcement
Confidence Scoring - Completeness, accuracy, and structure metrics
Automated Recommendations - Specific improvement suggestions

Scalability

Async Processing - Concurrent model execution
Rate Limiting - Built-in API quota management
Caching - Response caching for development
Monitoring - Comprehensive logging and metrics

🛠️ Configuration

Environment Variables

OPENROUTER_API_KEY=your_api_key_here
MAX_RETRIES=3
TIMEOUT_SECONDS=30

Model Configuration

# In src/config.py
available_models = {
    "gpt-4o": {
        "cost_per_1k_tokens": 0.005,
        "supports_strict_json_schema": True,
        "supports_vision": True
    },
    # ... additional models
}

📚 Documentation

CLAUDE.md - Detailed technical documentation (Italian)
TEST_SUITE_README.md - Testing framework guide
src/schemas.py - Data model definitions
src/config.py - Configuration reference

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-provider)
Add your changes with tests
Run the test suite (python run_tests.py --mode full)
Submit a pull request

Adding New Models

Update model list in src/config.py
Set appropriate capability flags (supports_strict_json_schema, etc.)
Test with the comprehensive test suite
Update documentation

📊 Monitoring & Analytics

Success Rate Tracking

Monitor model performance over time:

Provider-specific success rates
Schema compliance metrics
Response time trends
Cost optimization opportunities

Quality Metrics

Data extraction completeness
Accuracy vs ground truth
Schema validation success
Error pattern analysis

🔮 Future Enhancements

Short Term

Batch API optimization for cost reduction
Custom prompting for domain-specific use cases
Error correction pipeline with human feedback
Template matching for common dashboard types

Long Term

Fine-tuning models for OCR-specific tasks
Hybrid approaches combining VLM + traditional OCR
Real-time processing for video streams
Enterprise API service deployment

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

OpenRouter for unified VLM API access
Pydantic for robust data validation
Streamlit & Gradio for rapid UI development
Traditional OCR libraries for baseline comparison

Ready to revolutionize your document processing? 🚀

Get started | View documentation | Run tests

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
cache/judge_results		cache/judge_results
data		data
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DEPLOYMENT_SUMMARY.md		DEPLOYMENT_SUMMARY.md
FINAL_DEPLOYMENT_SUMMARY.md		FINAL_DEPLOYMENT_SUMMARY.md
MODEL_COMPATIBILITY_MATRIX.md		MODEL_COMPATIBILITY_MATRIX.md
README.md		README.md
TEST_SUITE_README.md		TEST_SUITE_README.md
advanced_ocr_app.py		advanced_ocr_app.py
ai_vs_traditional_battle.py		ai_vs_traditional_battle.py
comprehensive_test_suite.py		comprehensive_test_suite.py
gradio_app.py		gradio_app.py
gradio_app_fixed.py		gradio_app_fixed.py
gradio_main.py		gradio_main.py
gradio_minimal.py		gradio_minimal.py
kill_7860.sh		kill_7860.sh
ocr_tester.py		ocr_tester.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_tests.py		run_tests.py
run_tests.sh		run_tests.sh
structured_benchmark.py		structured_benchmark.py
test_call.sh		test_call.sh
test_report_20250623_193506.md		test_report_20250623_193506.md
test_results_20250623_193506.json		test_results_20250623_193506.json

pakkio/ocr

Folders and files

Latest commit

History

Repository files navigation