# Batch OCR for arXiv PDFs

This notebook provides an interactive interface for batch OCR processing of arXiv PDFs using Tesseract with layout preservation.

## Features
- PDF download with rate limiting
- High-quality image conversion (300 DPI)
- Tesseract OCR with layout preservation
- Progress tracking and error handling
- Structured text output in `pdf_ocr/` folder

## Setup and Imports

In [None]:
import sys
import os
import json
import time
from pathlib import Path
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
from tqdm.notebook import tqdm

# Import our OCR processor
from pdf_ocr_processor import ArxivPDFOCR

print("✅ All imports successful!")
print(f"Current working directory: {os.getcwd()}")

## Initialize OCR Processor

In [None]:
# Initialize the OCR processor
processor = ArxivPDFOCR(output_dir="pdf_ocr")

print(f"📁 Output directory: {processor.output_dir}")
print(f"📄 PDFs will be saved to: {processor.pdfs_dir}")
print(f"🖼️  Images will be saved to: {processor.images_dir}")
print(f"📝 Text files will be saved to: {processor.texts_dir}")

# Load paper data
papers = processor.load_paper_data("arxiv_clean.json")
print(f"\n📚 Loaded {len(papers)} papers from arxiv_clean.json")

if papers:
    print(f"\nFirst paper: {papers[0]['title'][:60]}...")
    print(f"Authors: {', '.join(papers[0]['authors'][:3])}{'...' if len(papers[0]['authors']) > 3 else ''}")

## Processing Controls

In [None]:
# Interactive controls for processing
max_papers_widget = widgets.IntSlider(
    value=5,
    min=1,
    max=len(papers) if papers else 10,
    step=1,
    description='Max Papers:',
    style={'description_width': 'initial'}
)

start_button = widgets.Button(
    description='Start OCR Processing',
    button_style='success',
    icon='play'
)

stop_button = widgets.Button(
    description='Stop Processing',
    button_style='danger',
    icon='stop',
    disabled=True
)

progress_bar = widgets.IntProgress(
    value=0,
    min=0,
    max=100,
    description='Progress:',
    bar_style='info',
    orientation='horizontal'
)

status_text = widgets.HTML(value="<b>Ready to start processing</b>")

display(widgets.VBox([
    max_papers_widget,
    widgets.HBox([start_button, stop_button]),
    progress_bar,
    status_text
]))

## Processing Function with Progress Tracking

In [None]:
# Global variable to control processing
should_stop = False

def update_progress(current, total, paper_title=""):
    """Update progress bar and status."""
    percentage = int((current / total) * 100)
    progress_bar.value = percentage
    
    status_html = f"""
    <b>Processing {current}/{total}</b><br>
    <i>{paper_title[:80]}{'...' if len(paper_title) > 80 else ''}</i><br>
    <small>Downloaded: {processor.stats['downloaded']} | 
    Processed: {processor.stats['processed']} | 
    Failed: {processor.stats['failed_downloads'] + processor.stats['failed_ocr']}</small>
    """
    status_text.value = status_html

def process_papers_interactive(max_papers):
    """Process papers with interactive progress updates."""
    global should_stop
    should_stop = False
    
    # Reset statistics
    processor.stats = {
        'total_papers': min(max_papers, len(papers)),
        'downloaded': 0,
        'processed': 0,
        'failed_downloads': 0,
        'failed_ocr': 0,
        'start_time': time.time(),
        'processing_log': []
    }
    
    start_button.disabled = True
    stop_button.disabled = False
    
    try:
        papers_to_process = papers[:max_papers]
        
        for i, paper in enumerate(papers_to_process, 1):
            if should_stop:
                status_text.value = "<b style='color: orange;'>Processing stopped by user</b>"
                break
                
            paper_title = paper.get('title', 'Unknown Title')
            update_progress(i, len(papers_to_process), paper_title)
            
            # Process the paper
            success = processor.process_paper(paper)
            
            # Save progress periodically
            if i % 5 == 0:
                processor.save_processing_log()
        
        if not should_stop:
            status_text.value = "<b style='color: green;'>✅ Processing completed successfully!</b>"
            
    except Exception as e:
        status_text.value = f"<b style='color: red;'>❌ Error: {str(e)}</b>"
    
    finally:
        start_button.disabled = False
        stop_button.disabled = True
        processor.save_processing_log()

def on_start_click(b):
    """Handle start button click."""
    max_papers = max_papers_widget.value
    process_papers_interactive(max_papers)

def on_stop_click(b):
    """Handle stop button click."""
    global should_stop
    should_stop = True
    status_text.value = "<b style='color: orange;'>Stopping after current paper...</b>"

# Connect button handlers
start_button.on_click(on_start_click)
stop_button.on_click(on_stop_click)

print("🎮 Interactive controls ready! Use the buttons above to start processing.")

## Alternative: Run Processing Directly

In [None]:
# Alternative: Run processing directly without interactive controls
# Uncomment and run this cell if you prefer direct processing

# MAX_PAPERS = 3  # Set number of papers to process
# processor.run_batch_processing(max_papers=MAX_PAPERS)

## Results and Statistics

In [None]:
# Display processing results and statistics
def show_processing_stats():
    """Display processing statistics and results."""
    
    # Load processing log
    log_file = processor.output_dir / "processing_log.json"
    
    if not log_file.exists():
        print("❌ No processing log found. Run processing first.")
        return
    
    with open(log_file, 'r') as f:
        stats = json.load(f)
    
    # Display summary statistics
    print("📊 PROCESSING STATISTICS")
    print("=" * 50)
    print(f"Total papers: {stats.get('total_papers', 0)}")
    print(f"Successfully downloaded: {stats.get('downloaded', 0)}")
    print(f"Successfully processed: {stats.get('processed', 0)}")
    print(f"Failed downloads: {stats.get('failed_downloads', 0)}")
    print(f"Failed OCR: {stats.get('failed_ocr', 0)}")
    
    if 'total_time_seconds' in stats:
        total_time = stats['total_time_seconds']
        print(f"Total processing time: {total_time/60:.1f} minutes")
        if stats.get('processed', 0) > 0:
            avg_time = total_time / stats['processed']
            print(f"Average time per paper: {avg_time:.1f} seconds")
    
    # Show file counts
    txt_files = list(processor.texts_dir.glob("*.txt"))
    pdf_files = list(processor.pdfs_dir.glob("*.pdf"))
    
    print(f"\n📁 OUTPUT FILES")
    print(f"Text files created: {len(txt_files)}")
    print(f"PDFs downloaded: {len(pdf_files)}")
    
    # Show success rate visualization
    if stats.get('total_papers', 0) > 0:
        success_rate = (stats.get('processed', 0) / stats['total_papers']) * 100
        
        plt.figure(figsize=(10, 4))
        
        # Pie chart of results
        plt.subplot(1, 2, 1)
        labels = ['Successful', 'Failed Downloads', 'Failed OCR']
        sizes = [stats.get('processed', 0), 
                stats.get('failed_downloads', 0), 
                stats.get('failed_ocr', 0)]
        colors = ['#2ecc71', '#e74c3c', '#f39c12']
        
        plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
        plt.title('Processing Results')
        
        # Processing timeline (if available)
        plt.subplot(1, 2, 2)
        if 'processing_log' in stats and stats['processing_log']:
            log_df = pd.DataFrame(stats['processing_log'])
            status_counts = log_df['status'].value_counts()
            
            plt.bar(status_counts.index, status_counts.values, 
                   color=['#2ecc71' if x == 'success' else '#e74c3c' for x in status_counts.index])
            plt.title('Processing Status Breakdown')
            plt.xticks(rotation=45)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\n✅ Overall success rate: {success_rate:.1f}%")
    
    return stats

# Button to show stats
stats_button = widgets.Button(
    description='Show Statistics',
    button_style='info',
    icon='chart-bar'
)

def on_stats_click(b):
    with output_widget:
        clear_output()
        show_processing_stats()

stats_button.on_click(on_stats_click)

output_widget = widgets.Output()

display(widgets.VBox([stats_button, output_widget]))

## Sample Output Viewer

In [None]:
# View sample OCR results
def show_sample_outputs(num_samples=3):
    """Display sample OCR output files."""
    
    txt_files = list(processor.texts_dir.glob("*.txt"))
    
    if not txt_files:
        print("❌ No text files found. Run processing first.")
        return
    
    print(f"📝 SAMPLE OCR OUTPUTS ({min(num_samples, len(txt_files))} of {len(txt_files)} files)")
    print("=" * 80)
    
    for i, txt_file in enumerate(txt_files[:num_samples]):
        print(f"\n🔍 FILE: {txt_file.name}")
        print("-" * 40)
        
        try:
            with open(txt_file, 'r', encoding='utf-8') as f:
                content = f.read()
            
            # Show first 1000 characters
            preview = content[:1000]
            if len(content) > 1000:
                preview += "\n\n... (truncated) ..."
            
            print(preview)
            print(f"\n📊 File size: {len(content):,} characters")
            
        except Exception as e:
            print(f"❌ Error reading file: {e}")
        
        if i < min(num_samples, len(txt_files)) - 1:
            print("\n" + "=" * 80)

# Button to show samples
samples_button = widgets.Button(
    description='Show Sample Outputs',
    button_style='primary',
    icon='file-text'
)

def on_samples_click(b):
    with samples_output:
        clear_output()
        show_sample_outputs()

samples_button.on_click(on_samples_click)

samples_output = widgets.Output()

display(widgets.VBox([samples_button, samples_output]))

## File Management

In [None]:
# File management utilities
def show_file_structure():
    """Display the output directory structure."""
    
    print("📁 OUTPUT DIRECTORY STRUCTURE")
    print("=" * 40)
    
    def print_tree(path, prefix="", is_last=True):
        if path.is_dir():
            files = sorted(path.iterdir(), key=lambda x: (not x.is_dir(), x.name))
            print(f"{prefix}{'└── ' if is_last else '├── '}{path.name}/")
            
            for i, file in enumerate(files[:10]):  # Limit to first 10 items
                is_last_item = i == len(files) - 1 or i == 9
                new_prefix = prefix + ("    " if is_last else "│   ")
                
                if file.is_dir():
                    print_tree(file, new_prefix, is_last_item)
                else:
                    size = file.stat().st_size
                    size_str = f"{size:,} bytes" if size < 1024 else f"{size/1024:.1f} KB"
                    print(f"{new_prefix}{'└── ' if is_last_item else '├── '}{file.name} ({size_str})")
            
            if len(files) > 10:
                new_prefix = prefix + ("    " if is_last else "│   ")
                print(f"{new_prefix}└── ... and {len(files) - 10} more files")
    
    print_tree(processor.output_dir)
    
    # Summary
    txt_count = len(list(processor.texts_dir.glob("*.txt")))
    pdf_count = len(list(processor.pdfs_dir.glob("*.pdf")))
    img_count = len(list(processor.images_dir.glob("*.png")))
    
    print(f"\n📊 SUMMARY:")
    print(f"Text files: {txt_count}")
    print(f"PDF files: {pdf_count}")
    print(f"Image files: {img_count}")

# Clean up temporary files
def cleanup_temp_files():
    """Remove temporary image files to save space."""
    img_files = list(processor.images_dir.glob("*.png"))
    
    if not img_files:
        print("✅ No temporary image files to clean up.")
        return
    
    print(f"🧹 Cleaning up {len(img_files)} temporary image files...")
    
    removed_count = 0
    for img_file in img_files:
        try:
            img_file.unlink()
            removed_count += 1
        except Exception as e:
            print(f"❌ Error removing {img_file}: {e}")
    
    print(f"✅ Removed {removed_count} temporary files.")

# File management buttons
structure_button = widgets.Button(
    description='Show File Structure',
    button_style='info',
    icon='folder'
)

cleanup_button = widgets.Button(
    description='Cleanup Temp Files',
    button_style='warning',
    icon='trash'
)

def on_structure_click(b):
    with files_output:
        clear_output()
        show_file_structure()

def on_cleanup_click(b):
    with files_output:
        clear_output()
        cleanup_temp_files()

structure_button.on_click(on_structure_click)
cleanup_button.on_click(on_cleanup_click)

files_output = widgets.Output()

display(widgets.VBox([
    widgets.HBox([structure_button, cleanup_button]),
    files_output
]))

## Summary

This notebook provides a complete interactive interface for batch OCR processing of arXiv PDFs with the following features:

### 🔧 **Core Functionality**
- **PDF Download**: Automatic download with arXiv rate limiting (16 seconds between requests)
- **Image Conversion**: High-quality PDF to image conversion (300 DPI)
- **OCR Processing**: Tesseract with layout preservation settings
- **Text Enhancement**: Post-processing to improve structure and readability

### 📊 **Interactive Features**
- **Progress Tracking**: Real-time progress bars and status updates
- **Control Panel**: Start/stop processing with customizable batch sizes
- **Statistics Dashboard**: Processing success rates and timing analysis
- **Sample Viewer**: Preview OCR results directly in the notebook

### 📁 **Output Structure**
```
pdf_ocr/
├── texts/          # Final OCR text files (.txt)
├── pdfs/           # Downloaded PDF files
├── images/         # Temporary image files (can be cleaned up)
└── processing_log.json  # Detailed processing statistics
```

### 🎯 **Quality Features**
- **Error Handling**: Graceful failure recovery with detailed logging
- **Memory Management**: Automatic cleanup of temporary files
- **Resume Capability**: Skip already processed papers
- **Layout Preservation**: OCR configured to maintain document structure

Use the controls above to process your arXiv papers and monitor the results!