# Task 2: Batch OCR for arXiv PDFs

**Objective**: Convert PDFs from arXiv (same paper set as Task 1) to text using Tesseract OCR.

**Core Tools**: `pytesseract`, `pdf2image`

**Deliverables**: 
- `pdf_ocr/` folder with TXT files (one per paper)
- This notebook

**Features**:
- Downloads PDFs from arXiv based on `arxiv_clean.json`
- Converts each PDF page to images using `pdf2image` (requires Poppler)
- Runs Tesseract OCR on each page
- Preserves layout using page-break markers
- Caches downloaded PDFs to avoid re-downloading

## 1. Install Dependencies

If running for the first time, ensure you have:
```bash
pip install pytesseract pdf2image pillow requests
```

**System Requirements**:
- **Tesseract OCR**: Download from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki) and install
- **Poppler**: This notebook will auto-download Poppler for Windows if not found

In [None]:
# Import required libraries
import os
import sys
import json
import re
import time
from pathlib import Path
from typing import List, Optional, Tuple

import requests
import pytesseract
from pdf2image import convert_from_path
from PIL import Image

print('✓ Imports successful')

## 2. Configure Paths and Environment

In [None]:
# Base directories
BASE_DIR = Path.cwd()
DATA_FILE = BASE_DIR / 'arxiv_clean.json'
OUT_DIR = BASE_DIR / 'pdf_ocr'
PDF_DIR = OUT_DIR / 'pdfs'

# Create output directories
OUT_DIR.mkdir(exist_ok=True)
PDF_DIR.mkdir(parents=True, exist_ok=True)

print(f'Output directory: {OUT_DIR}')
print(f'PDF cache: {PDF_DIR}')

In [None]:
# Auto-configure Tesseract (Windows default location)
if os.name == 'nt':
    default_tesseract = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    if os.path.exists(default_tesseract):
        pytesseract.pytesseract.tesseract_cmd = default_tesseract
        print(f'✓ Tesseract found: {default_tesseract}')
    else:
        print('⚠ Tesseract not found at default location. Please install or set pytesseract.pytesseract.tesseract_cmd')
else:
    print('✓ Using system Tesseract (Linux/Mac)')

In [None]:
# Auto-configure Poppler (Windows)
poppler_path = None

if os.name == 'nt':
    # Check common install locations
    candidates = [
        BASE_DIR / 'vendor_poppler' / 'poppler-23.08.0' / 'Library' / 'bin',
        Path(r"C:\ProgramData\chocolatey\lib\poppler\tools"),
        Path(r"C:\Program Files\poppler\Library\bin"),
        Path(r"C:\poppler\Library\bin"),
    ]
    
    for candidate in candidates:
        if candidate.exists() and (candidate / 'pdftoppm.exe').exists():
            poppler_path = str(candidate)
            print(f'✓ Poppler found: {poppler_path}')
            break
    
    # If not found, download it
    if not poppler_path:
        print('⚠ Poppler not found. Downloading...')
        import zipfile
        
        POPPLER_ZIP_URL = 'https://github.com/oschwartz10612/poppler-windows/releases/download/v23.08.0-0/Release-23.08.0-0.zip'
        VENDOR_DIR = BASE_DIR / 'vendor_poppler'
        VENDOR_DIR.mkdir(exist_ok=True)
        
        try:
            tmp_zip = VENDOR_DIR / 'poppler.zip'
            print(f'  Downloading to {tmp_zip}...')
            
            with requests.get(POPPLER_ZIP_URL, stream=True, timeout=120) as r:
                r.raise_for_status()
                with open(tmp_zip, 'wb') as f:
                    for chunk in r.iter_content(8192):
                        f.write(chunk)
            
            print('  Extracting...')
            with zipfile.ZipFile(tmp_zip, 'r') as zf:
                zf.extractall(VENDOR_DIR)
            
            # Find pdftoppm.exe
            for root, dirs, files in os.walk(VENDOR_DIR):
                if 'pdftoppm.exe' in files:
                    poppler_path = root
                    print(f'✓ Poppler installed: {poppler_path}')
                    break
            
            if not poppler_path:
                print('⚠ Could not locate pdftoppm.exe after extraction')
        
        except Exception as e:
            print(f'✗ Poppler download failed: {e}')
else:
    print('✓ Using system Poppler (Linux/Mac)')

## 3. Define Helper Functions

In [None]:
# OCR configuration for better layout preservation
TESS_LANG = 'eng'
TESS_CONFIG = '--oem 1 --psm 1 -c preserve_interword_spaces=1'

# Limit pages per document (None for all pages)
MAX_PAGES_PER_DOC: Optional[int] = None  # Set to e.g., 5 for quick testing

# arXiv URL pattern
ARXIV_ABS_RE = re.compile(r"https?://arxiv\.org/abs/([\w\.-]+)")


def derive_id_and_pdf(url: str) -> Tuple[str, str]:
    """Extract arXiv ID from URL and construct PDF URL."""
    m = ARXIV_ABS_RE.match(url)
    if m:
        arxiv_id = m.group(1)
    else:
        # Assume URL ends with ID
        arxiv_id = url.split('/')[-1]
    
    pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
    return arxiv_id, pdf_url


def safe_filename(name: str) -> str:
    """Sanitize filename while preserving arXiv ID format."""
    return re.sub(r"[^\w\.-]+", "_", name).strip('_')


def download_pdf(pdf_url: str, dest: Path, retries: int = 3) -> bool:
    """Download PDF with retry logic."""
    for i in range(retries):
        try:
            with requests.get(pdf_url, stream=True, timeout=30) as r:
                r.raise_for_status()
                with open(dest, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        if chunk:
                            f.write(chunk)
            return True
        except Exception as e:
            if i < retries - 1:
                time.sleep(2 ** i)  # Exponential backoff
            else:
                print(f"  ✗ Download failed: {e}")
    return False


def pdf_to_images(pdf_path: Path, dpi: int = 300) -> List[Image.Image]:
    """Convert PDF pages to images using pdf2image."""
    kwargs = {'dpi': dpi}
    if poppler_path:
        kwargs['poppler_path'] = poppler_path
    
    images = convert_from_path(str(pdf_path), **kwargs)
    
    if MAX_PAGES_PER_DOC is not None:
        images = images[:MAX_PAGES_PER_DOC]
    
    return images


def ocr_image(img: Image.Image) -> str:
    """Run Tesseract OCR on a single image."""
    return pytesseract.image_to_string(img, lang=TESS_LANG, config=TESS_CONFIG)


def ocr_images_to_text(images: List[Image.Image]) -> str:
    """OCR all images and combine with page break markers."""
    texts = []
    for idx, img in enumerate(images, 1):
        txt = ocr_image(img)
        texts.append(f"\n{'='*80}\n PAGE {idx}\n{'='*80}\n\n{txt}")
    return '\n'.join(texts)


print('✓ Helper functions defined')

## 4. Batch Processing Function

In [None]:
def load_paper_list(data_file: Path) -> List[dict]:
    """Load paper metadata from arxiv_clean.json."""
    with open(data_file, 'r', encoding='utf-8') as f:
        return json.load(f)


def process_one_paper(item: dict) -> Optional[Path]:
    """Download, convert, OCR, and save text for one paper."""
    url = item.get('url') or ''
    if not url:
        return None
    
    # Get arXiv ID and PDF URL
    arxiv_id, pdf_url = derive_id_and_pdf(url)
    base = safe_filename(arxiv_id)
    
    # Output paths
    out_txt = OUT_DIR / f"{base}.txt"
    
    # Skip if already processed
    if out_txt.exists():
        return out_txt
    
    # Download PDF if needed
    pdf_path = PDF_DIR / f"{base}.pdf"
    if not pdf_path.exists():
        if not download_pdf(pdf_url, pdf_path):
            return None
    
    # Convert PDF to images
    try:
        images = pdf_to_images(pdf_path)
    except Exception as e:
        print(f"  ✗ PDF conversion failed: {e}")
        return None
    
    # Run OCR
    text = ocr_images_to_text(images)
    
    # Save text file
    out_txt.write_text(text, encoding='utf-8')
    
    return out_txt


def run_batch_ocr(limit: Optional[int] = None):
    """Process all papers from arxiv_clean.json."""
    if not DATA_FILE.exists():
        raise FileNotFoundError(f"Missing {DATA_FILE}")
    
    items = load_paper_list(DATA_FILE)
    if limit is not None:
        items = items[:limit]
    
    print(f"\n{'='*80}")
    print(f"Starting batch OCR: {len(items)} papers")
    print(f"{'='*80}\n")
    
    success, failed = 0, 0
    
    for i, item in enumerate(items, 1):
        title = item.get('title', 'Unknown')[:60]
        print(f"[{i}/{len(items)}] {title}...")
        
        try:
            result = process_one_paper(item)
            if result:
                success += 1
                print(f"  ✓ Saved: {result.name}")
            else:
                failed += 1
                print("  ✗ Failed")
        except KeyboardInterrupt:
            print("\n⚠ Interrupted by user")
            break
        except Exception as e:
            failed += 1
            print(f"  ✗ Error: {e}")
    
    print(f"\n{'='*80}")
    print(f"Batch complete: {success} successful, {failed} failed")
    print(f"Output directory: {OUT_DIR}")
    print(f"{'='*80}")


print('✓ Batch processing functions defined')

## 5. Run Batch OCR

Execute the cell below to process all papers from `arxiv_clean.json`.

**Options**:
- Set `LIMIT = None` to process all papers
- Set `LIMIT = 5` to test with just 5 papers
- Adjust `MAX_PAGES_PER_DOC` in section 3 to limit pages per document

In [None]:
# Run batch OCR
LIMIT = None  # Set to a number for testing, or None for all papers

run_batch_ocr(limit=LIMIT)

## 6. Verify Results

Check the output directory and preview a sample file.

In [None]:
# List all TXT files in output directory
txt_files = sorted(OUT_DIR.glob('*.txt'))
print(f"Total TXT files generated: {len(txt_files)}\n")

for txt_file in txt_files:
    size_kb = txt_file.stat().st_size / 1024
    print(f"  {txt_file.name:40s} ({size_kb:>8.1f} KB)")

In [None]:
# Preview first TXT file (first 2000 characters)
if txt_files:
    sample_file = txt_files[0]
    print(f"Preview of {sample_file.name}:")
    print("=" * 80)
    content = sample_file.read_text(encoding='utf-8')
    print(content[:2000])
    if len(content) > 2000:
        print(f"\n... (truncated, total {len(content)} characters)")
else:
    print("No TXT files found.")

---

## Summary

**Task 2 Complete!**

✓ Downloaded arXiv PDFs based on `arxiv_clean.json`  
✓ Converted each PDF to images using `pdf2image` (Poppler)  
✓ Ran Tesseract OCR on all pages  
✓ Saved TXT files to `pdf_ocr/` folder with page-break markers for layout preservation  

**Output**: `pdf_ocr/` folder containing one `.txt` file per paper

**Next Steps**:
- Adjust OCR parameters in section 3 if needed (DPI, page limits, Tesseract config)
- For better layout preservation, consider using `pytesseract.image_to_pdf_or_hocr()` to generate hOCR files
- Process additional papers by updating `arxiv_clean.json`