# ImageToText - OCR Text Extraction

This notebook provides an interactive interface for extracting text from images using multiple OCR engines.

## Features
- Support for Tesseract and EasyOCR engines
- Multiple image formats (PNG, JPG, PDF, etc.)
- Batch processing
- Confidence scoring
- Multi-language support

## Quick Setup

Run this cell to install required packages:

In [None]:
# Install required packages (uncomment if needed)
# !pip install pytesseract easyocr Pillow opencv-python PyMuPDF numpy

# Import required libraries
import sys
import os
from pathlib import Path
import json
import time
from datetime import datetime
from typing import List, Dict, Optional
from dataclasses import dataclass, asdict
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from IPython.display import display, Image, HTML
import pandas as pd

# Import OCR libraries with graceful fallbacks
try:
    import pytesseract
    from PIL import Image as PILImage, ImageEnhance
    TESSERACT_AVAILABLE = True
    print("✓ Tesseract OCR available")
except ImportError:
    TESSERACT_AVAILABLE = False
    print("✗ Tesseract OCR not available. Install with: pip install pytesseract pillow")

try:
    import easyocr
    EASYOCR_AVAILABLE = True
    print("✓ EasyOCR available")
except ImportError:
    EASYOCR_AVAILABLE = False
    print("✗ EasyOCR not available. Install with: pip install easyocr")

try:
    import fitz  # PyMuPDF
    PDF_SUPPORT = True
    print("✓ PDF support available")
except ImportError:
    PDF_SUPPORT = False
    print("✗ PDF support not available. Install with: pip install PyMuPDF")

print("\n🚀 ImageToText notebook ready!")

## Core Classes and Functions

Define the main OCR functionality:

In [None]:
@dataclass
class OCRResult:
    """Container for OCR results"""
    text: str
    confidence: float
    engine: str
    language: str
    processing_time: float
    image_path: str
    timestamp: str

class ImagePreprocessor:
    """Image preprocessing utilities"""
    
    @staticmethod
    def enhance_image(image: PILImage.Image, enhance_contrast: bool = True, 
                     enhance_sharpness: bool = True) -> PILImage.Image:
        """Apply image enhancements"""
        if enhance_contrast:
            enhancer = ImageEnhance.Contrast(image)
            image = enhancer.enhance(1.5)
        
        if enhance_sharpness:
            enhancer = ImageEnhance.Sharpness(image)
            image = enhancer.enhance(1.2)
        
        return image
    
    @staticmethod
    def convert_to_grayscale(image: PILImage.Image) -> PILImage.Image:
        """Convert to grayscale"""
        return image.convert('L')

class TesseractOCR:
    """Tesseract OCR wrapper"""
    
    def __init__(self, language: str = 'eng'):
        if not TESSERACT_AVAILABLE:
            raise ImportError("Tesseract OCR not available")
        self.language = language
    
    def extract_text(self, image_path: str) -> OCRResult:
        """Extract text using Tesseract"""
        start_time = time.time()
        
        try:
            image = PILImage.open(image_path)
            
            # Preprocess image
            preprocessor = ImagePreprocessor()
            image = preprocessor.enhance_image(image)
            image = preprocessor.convert_to_grayscale(image)
            
            # Get confidence data
            data = pytesseract.image_to_data(image, lang=self.language, 
                                           output_type=pytesseract.Output.DICT)
            confidences = [int(conf) for conf in data['conf'] if int(conf) > 0]
            avg_confidence = sum(confidences) / len(confidences) if confidences else 0
            
            # Extract text
            text = pytesseract.image_to_string(image, lang=self.language)
            
            return OCRResult(
                text=text.strip(),
                confidence=avg_confidence,
                engine='tesseract',
                language=self.language,
                processing_time=time.time() - start_time,
                image_path=image_path,
                timestamp=datetime.now().isoformat()
            )
            
        except Exception as e:
            print(f"Error with Tesseract OCR: {e}")
            return OCRResult(
                text="", confidence=0.0, engine='tesseract',
                language=self.language, processing_time=time.time() - start_time,
                image_path=image_path, timestamp=datetime.now().isoformat()
            )

class EasyOCREngine:
    """EasyOCR wrapper"""
    
    def __init__(self, languages: List[str] = ['en']):
        if not EASYOCR_AVAILABLE:
            raise ImportError("EasyOCR not available")
        self.languages = languages
        print(f"Initializing EasyOCR with languages: {languages}...")
        self.reader = easyocr.Reader(languages)
        print("EasyOCR initialized!")
    
    def extract_text(self, image_path: str) -> OCRResult:
        """Extract text using EasyOCR"""
        start_time = time.time()
        
        try:
            results = self.reader.readtext(image_path, detail=1)
            
            text_parts = []
            confidences = []
            
            for (bbox, text, confidence) in results:
                text_parts.append(text)
                confidences.append(confidence)
            
            combined_text = '\n'.join(text_parts)
            avg_confidence = sum(confidences) / len(confidences) if confidences else 0
            
            return OCRResult(
                text=combined_text.strip(),
                confidence=avg_confidence * 100,
                engine='easyocr',
                language=','.join(self.languages),
                processing_time=time.time() - start_time,
                image_path=image_path,
                timestamp=datetime.now().isoformat()
            )
            
        except Exception as e:
            print(f"Error with EasyOCR: {e}")
            return OCRResult(
                text="", confidence=0.0, engine='easyocr',
                language=','.join(self.languages), processing_time=time.time() - start_time,
                image_path=image_path, timestamp=datetime.now().isoformat()
            )

print("✓ OCR classes defined")

## Helper Functions

Utility functions for visualization and analysis:

In [None]:
def display_image_with_text(image_path: str, result: OCRResult, max_width: int = 800):
    """Display image alongside extracted text"""
    
    # Display image
    print(f"📷 Image: {Path(image_path).name}")
    display(Image(image_path, width=max_width))
    
    # Display metadata
    print(f"\n📊 OCR Results:")
    print(f"Engine: {result.engine}")
    print(f"Language: {result.language}")
    print(f"Confidence: {result.confidence:.1f}%")
    print(f"Processing Time: {result.processing_time:.2f}s")
    
    # Display extracted text
    print(f"\n📝 Extracted Text:")
    print("-" * 50)
    print(result.text if result.text else "[No text detected]")
    print("-" * 50)

def compare_ocr_engines(image_path: str, languages: List[str] = ['eng']):
    """Compare results from different OCR engines"""
    results = {}
    
    # Test Tesseract
    if TESSERACT_AVAILABLE:
        try:
            tesseract = TesseractOCR(languages[0] if languages else 'eng')
            results['tesseract'] = tesseract.extract_text(image_path)
        except Exception as e:
            print(f"Tesseract failed: {e}")
    
    # Test EasyOCR
    if EASYOCR_AVAILABLE:
        try:
            easy_langs = [lang.replace('eng', 'en').replace('fra', 'fr').replace('deu', 'de') for lang in languages]
            easyocr_engine = EasyOCREngine(easy_langs)
            results['easyocr'] = easyocr_engine.extract_text(image_path)
        except Exception as e:
            print(f"EasyOCR failed: {e}")
    
    # Display comparison
    print(f"\n🔍 OCR Engine Comparison for {Path(image_path).name}")
    print("=" * 60)
    
    for engine, result in results.items():
        print(f"\n{engine.upper()}:")
        print(f"  Confidence: {result.confidence:.1f}%")
        print(f"  Time: {result.processing_time:.2f}s")
        print(f"  Text Length: {len(result.text)} characters")
        print(f"  Preview: {result.text[:100]}{'...' if len(result.text) > 100 else ''}")
    
    return results

def batch_process_images(image_paths: List[str], engine: str = 'tesseract', language: str = 'eng') -> pd.DataFrame:
    """Process multiple images and return results as DataFrame"""
    results = []
    
    # Initialize OCR engine
    if engine == 'tesseract' and TESSERACT_AVAILABLE:
        ocr = TesseractOCR(language)
    elif engine == 'easyocr' and EASYOCR_AVAILABLE:
        lang_map = {'eng': 'en', 'fra': 'fr', 'deu': 'de', 'spa': 'es'}
        easy_lang = lang_map.get(language, language)
        ocr = EasyOCREngine([easy_lang])
    else:
        raise ValueError(f"OCR engine '{engine}' not available")
    
    # Process images
    for i, image_path in enumerate(image_paths):
        print(f"Processing {i+1}/{len(image_paths)}: {Path(image_path).name}")
        
        if not os.path.exists(image_path):
            print(f"  ⚠️ File not found: {image_path}")
            continue
            
        try:
            result = ocr.extract_text(image_path)
            results.append(asdict(result))
            print(f"  ✓ Processed ({result.confidence:.1f}% confidence)")
        except Exception as e:
            print(f"  ❌ Failed: {e}")
    
    # Convert to DataFrame
    if results:
        df = pd.DataFrame(results)
        return df
    else:
        return pd.DataFrame()

def analyze_results(df: pd.DataFrame):
    """Analyze batch processing results"""
    if df.empty:
        print("No results to analyze")
        return
    
    print("📊 Batch Processing Analysis")
    print("=" * 40)
    print(f"Total Images: {len(df)}")
    print(f"Average Confidence: {df['confidence'].mean():.1f}%")
    print(f"Average Processing Time: {df['processing_time'].mean():.2f}s")
    print(f"Total Processing Time: {df['processing_time'].sum():.2f}s")
    
    # Confidence distribution
    high_conf = df[df['confidence'] >= 80]
    medium_conf = df[(df['confidence'] >= 60) & (df['confidence'] < 80)]
    low_conf = df[df['confidence'] < 60]
    
    print(f"\nConfidence Distribution:")
    print(f"  High (80%+): {len(high_conf)} images")
    print(f"  Medium (60-80%): {len(medium_conf)} images")
    print(f"  Low (<60%): {len(low_conf)} images")
    
    # Plot confidence distribution
    plt.figure(figsize=(10, 6))
    
    plt.subplot(1, 2, 1)
    plt.hist(df['confidence'], bins=20, alpha=0.7, color='skyblue')
    plt.xlabel('Confidence (%)')
    plt.ylabel('Number of Images')
    plt.title('OCR Confidence Distribution')
    
    plt.subplot(1, 2, 2)
    plt.scatter(df['processing_time'], df['confidence'], alpha=0.7)
    plt.xlabel('Processing Time (s)')
    plt.ylabel('Confidence (%)')
    plt.title('Processing Time vs Confidence')
    
    plt.tight_layout()
    plt.show()

print("✓ Helper functions defined")

## 🚀 Quick Start Examples

### Example 1: Single Image Processing

Upload an image and extract text:

In [None]:
# Example: Process a single image
# Replace 'path/to/your/image.png' with your actual image path

image_path = "examples/sample_image.png"  # Change this to your image path

# Check if file exists
if os.path.exists(image_path):
    # Using Tesseract
    if TESSERACT_AVAILABLE:
        tesseract = TesseractOCR('eng')  # English
        result = tesseract.extract_text(image_path)
        display_image_with_text(image_path, result)
    else:
        print("Tesseract not available. Please install it first.")
else:
    print(f"Image file not found: {image_path}")
    print("Please update the image_path variable with a valid image file.")
    print("\nYou can upload images to the same directory as this notebook or use full paths.")

### Example 2: Compare OCR Engines

Test both Tesseract and EasyOCR on the same image:

In [None]:
# Compare different OCR engines
image_path = "examples/sample_image.png"  # Change this to your image path

if os.path.exists(image_path):
    results = compare_ocr_engines(image_path, ['eng'])
    
    # Display the image
    display(Image(image_path, width=600))
else:
    print(f"Image file not found: {image_path}")
    print("Please update the image_path variable with a valid image file.")

### Example 3: Multi-language OCR

Process text in different languages:

In [None]:
# Multi-language example
# For French text
image_path = "examples/french_text.png"  # Change this to your French image

if os.path.exists(image_path):
    if TESSERACT_AVAILABLE:
        # Tesseract with French
        tesseract_fr = TesseractOCR('fra')  # French
        result_fr = tesseract_fr.extract_text(image_path)
        display_image_with_text(image_path, result_fr)
else:
    print("French image not found. Using sample text processing...")
    
    # Show available languages
    if TESSERACT_AVAILABLE:
        try:
            langs = pytesseract.get_languages()
            print("Available Tesseract languages:")
            print(", ".join(sorted(langs)))
        except:
            print("Could not retrieve language list")
            print("Common language codes: eng, fra, deu, spa, ita, por, rus, chi_sim, jpn")

### Example 4: Batch Processing

Process multiple images at once:

In [None]:
# Batch processing example
from glob import glob

# Find all images in examples directory
image_patterns = [
    "examples/*.png",
    "examples/*.jpg", 
    "examples/*.jpeg"
]

all_images = []
for pattern in image_patterns:
    all_images.extend(glob(pattern))

if all_images:
    print(f"Found {len(all_images)} images to process")
    
    # Process with Tesseract
    if TESSERACT_AVAILABLE:
        df = batch_process_images(all_images, engine='tesseract', language='eng')
        
        if not df.empty:
            # Show results summary
            analyze_results(df)
            
            # Show detailed results
            print("\n📋 Detailed Results:")
            display(df[['image_path', 'engine', 'confidence', 'processing_time']].round(2))
        else:
            print("No images were successfully processed")
    else:
        print("Tesseract not available for batch processing")
else:
    print("No images found in examples directory")
    print("Add some .png, .jpg, or .jpeg files to the examples/ directory")

### Example 5: PDF Processing

Extract text from PDF files:

In [None]:
# PDF processing example
pdf_path = "examples/sample_document.pdf"  # Change this to your PDF path

if PDF_SUPPORT and os.path.exists(pdf_path):
    import fitz  # PyMuPDF
    
    print(f"Processing PDF: {Path(pdf_path).name}")
    
    # Open PDF
    doc = fitz.open(pdf_path)
    print(f"PDF has {doc.page_count} pages")
    
    # Process first page as example
    if doc.page_count > 0 and TESSERACT_AVAILABLE:
        page = doc[0]  # First page
        
        # Convert to image
        mat = fitz.Matrix(2, 2)  # Zoom factor
        pix = page.get_pixmap(matrix=mat)
        img_data = pix.tobytes("png")
        
        # Save temporarily
        temp_path = "/tmp/pdf_page.png"
        with open(temp_path, "wb") as f:
            f.write(img_data)
        
        # OCR the page
        tesseract = TesseractOCR('eng')
        result = tesseract.extract_text(temp_path)
        result.image_path = f"{pdf_path} (page 1)"
        
        # Display results
        print(f"\n📄 Page 1 Results:")
        print(f"Confidence: {result.confidence:.1f}%")
        print(f"Processing Time: {result.processing_time:.2f}s")
        print(f"\nExtracted Text:")
        print("-" * 50)
        print(result.text[:500] + ("..." if len(result.text) > 500 else ""))
        print("-" * 50)
        
        # Clean up
        os.remove(temp_path)
    
    doc.close()
    
else:
    if not PDF_SUPPORT:
        print("PDF support not available. Install with: pip install PyMuPDF")
    else:
        print(f"PDF file not found: {pdf_path}")
        print("Add a PDF file to the examples/ directory")

## 🛠️ Custom Processing

Use this cell for your own image processing experiments:

In [None]:
# Your custom processing code here
# Example: Upload and process your own image

# Step 1: Upload your image file to this notebook's directory
# Step 2: Update the path below
my_image = "your_image.png"  # Replace with your image filename

if os.path.exists(my_image):
    # Configure settings
    engine = 'tesseract'  # or 'easyocr'
    language = 'eng'      # or 'fra', 'deu', 'spa', etc.
    
    # Process the image
    if engine == 'tesseract' and TESSERACT_AVAILABLE:
        ocr = TesseractOCR(language)
        result = ocr.extract_text(my_image)
        display_image_with_text(my_image, result)
    elif engine == 'easyocr' and EASYOCR_AVAILABLE:
        # Map Tesseract language codes to EasyOCR codes
        lang_map = {'eng': 'en', 'fra': 'fr', 'deu': 'de', 'spa': 'es'}
        easy_lang = lang_map.get(language, language)
        ocr = EasyOCREngine([easy_lang])
        result = ocr.extract_text(my_image)
        display_image_with_text(my_image, result)
    else:
        print(f"OCR engine '{engine}' not available")
else:
    print(f"Image file not found: {my_image}")
    print("\n📁 Upload an image file to this notebook's directory and update the 'my_image' variable")

## 📊 Export Results

Save your OCR results in various formats:

In [None]:
# Export results to files
# Assuming you have a DataFrame from batch processing

# If you have batch processing results, uncomment and modify:
# df.to_csv('ocr_results.csv', index=False)
# df.to_json('ocr_results.json', orient='records', indent=2)

# Example: Create sample results for demonstration
sample_results = [
    {
        'image_path': 'sample1.png',
        'text': 'This is sample extracted text from the first image.',
        'confidence': 95.2,
        'engine': 'tesseract',
        'language': 'eng',
        'processing_time': 1.23,
        'timestamp': datetime.now().isoformat()
    },
    {
        'image_path': 'sample2.png', 
        'text': 'This is sample extracted text from the second image.',
        'confidence': 87.6,
        'engine': 'tesseract',
        'language': 'eng',
        'processing_time': 0.98,
        'timestamp': datetime.now().isoformat()
    }
]

sample_df = pd.DataFrame(sample_results)

print("Sample results DataFrame:")
display(sample_df)

# Export options
print("\n💾 Export options:")
print("1. CSV: sample_df.to_csv('results.csv', index=False)")
print("2. JSON: sample_df.to_json('results.json', orient='records', indent=2)")
print("3. Excel: sample_df.to_excel('results.xlsx', index=False)")

# Uncomment to actually save files:
# sample_df.to_csv('ocr_results_sample.csv', index=False)
# sample_df.to_json('ocr_results_sample.json', orient='records', indent=2)
# print("✓ Sample results exported")

## 🎯 Next Steps

1. **Add your own images** to the `examples/` directory
2. **Experiment with different OCR engines** and languages
3. **Try batch processing** on multiple images
4. **Adjust preprocessing settings** for better accuracy
5. **Export results** in your preferred format

## 🔧 Troubleshooting

- **Missing libraries**: Install with `pip install pytesseract easyocr pillow opencv-python PyMuPDF`
- **Tesseract not found**: Install system package (`brew install tesseract` on macOS)
- **Poor accuracy**: Try different engines, languages, or image preprocessing
- **Memory issues**: Process images in smaller batches

## 📚 Resources

- [Tesseract Language Codes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)
- [EasyOCR Documentation](https://github.com/JaidedAI/EasyOCR)
- [PIL/Pillow Documentation](https://pillow.readthedocs.io/)

---

**Happy OCR processing! 📖✨**