# Find Non-OCR'd PDFs in Zotero Library

This notebook identifies PDFs in your Zotero library that lack embedded text and need OCR preprocessing.

**Strategy:**
1. Connect to Zotero library via pyzotero API
2. Download and analyze each PDF for text content
3. Identify PDFs that are image-only (need OCR)
4. Generate list for batch OCR preprocessing

In [3]:
# Install required packages if needed
# !pip install pyzotero langchain-community pymupdf

import os
import time
import json
from pathlib import Path
from typing import List, Dict, Tuple
import re

# PDF processing - Fast scan + PDFMiner verification
from langchain_community.document_loaders import PDFMinerLoader

try:
    import fitz  # PyMuPDF for fast scanning

    print("✅ PyMuPDF available for fast scanning")
except ImportError:
    print("⚠️  PyMuPDF not available - will use PDFMiner only (slower)")

# Zotero API
from pyzotero import zotero

print("📚 Zotero Non-OCR PDF Detection System")
print("⚡ Fast PyMuPDF scan + PDFMiner verification strategy")
print("🎯 Only verifies borderline cases for maximum speed")
print("=" * 50)

✅ PyMuPDF available for fast scanning
📚 Zotero Non-OCR PDF Detection System
⚡ Fast PyMuPDF scan + PDFMiner verification strategy
🎯 Only verifies borderline cases for maximum speed


## Configuration

Set up paths and detection parameters for your local Zotero storage scan.

In [4]:
# Configuration - SAFE COPY APPROACH
# Step 1: Copy your Zotero storage to a safe location first!
# cp -r ~/snap/zotero-snap/common/Zotero/storage ~/Desktop/zotero_storage_copy

# Option 1: Work on a SAFE COPY (RECOMMENDED!)
USE_SAFE_COPY = True
SAFE_COPY_PATH = Path("~/Desktop/zotero-storage-copy").expanduser()

# Option 2: Work on live storage (RISKY - not recommended)
# LIVE_STORAGE_PATH = Path("~/snap/zotero-snap/common/Zotero/storage/").expanduser()

# Choose which path to use
STORAGE_PATH = SAFE_COPY_PATH

# Optional: Zotero API for metadata lookup (read-only)
import os
import dotenv

dotenv.load_dotenv()

ZOTERO_USER_ID = os.getenv("ZOTERO_USER")  # Only needed for metadata lookup
ZOTERO_API_KEY = os.getenv("ZOTERO_API_KEY")  # Only needed for metadata lookup
ZOTERO_LIBRARY_TYPE = "user"  # "user" or "group"

# Detection Parameters
MIN_WORDS_PER_PAGE = 50  # Minimum words per page to consider "has text"
MAX_SAMPLE_PAGES = 5  # Number of pages to sample for detection

# Output Configuration
OUTPUT_DIR = Path("./ocr_analysis")
OUTPUT_DIR.mkdir(exist_ok=True)

print("🛡️  SAFE COPY MODE" if USE_SAFE_COPY else "⚠️  LIVE STORAGE MODE")
print(f"📁 Storage path: {STORAGE_PATH}")
print(f"📁 Output directory: {OUTPUT_DIR}")
print(f"⚙️  Detection threshold: {MIN_WORDS_PER_PAGE} words/page")

# Verify paths (FAST CHECK - don't enumerate all folders!)
if USE_SAFE_COPY:
    if not SAFE_COPY_PATH.exists():
        print(f"❌ Safe copy not found: {SAFE_COPY_PATH}")
        print("   Please run first:")
        print(f"   cp -r ~/snap/zotero-snap/common/Zotero/storage {SAFE_COPY_PATH}")
    else:
        # Fast check - just verify it's a directory, don't count folders
        print(f"✅ Safe copy found at: {SAFE_COPY_PATH}")
        print("✅ Safe to experiment without affecting live library!")
        print("📊 Folder counting will happen during scan (to avoid long startup)")
else:
    print("⚠️  WARNING: Working on live Zotero storage!")
    if STORAGE_PATH.exists():
        print(f"✅ Storage path found: {STORAGE_PATH}")
    else:
        print(f"❌ Storage path not found: {STORAGE_PATH}")

print("🚀 Configuration complete! Ready to scan.")

🛡️  SAFE COPY MODE
📁 Storage path: /home/nathan/Desktop/zotero-storage-copy
📁 Output directory: ocr_analysis
⚙️  Detection threshold: 50 words/page
✅ Safe copy found at: /home/nathan/Desktop/zotero-storage-copy
✅ Safe to experiment without affecting live library!
📊 Folder counting will happen during scan (to avoid long startup)
🚀 Configuration complete! Ready to scan.


## 🛡️ Super Safe Workflow

**STEP 1: Create Safe Copy**
```bash
# Copy Zotero storage to desktop (safe location)
cp -r ~/snap/zotero-snap/common/Zotero/storage ~/Desktop/zotero_storage_copy
```

**STEP 2: Run Analysis on Copy**
- This notebook will analyze the **copy** (no risk to live library)
- Generate list of PDFs needing OCR
- Test OCR on sample files safely

**STEP 3: Selective Integration**
- OCR only the files you want to improve
- Copy back only the successfully processed files
- Your live library stays untouched until you explicitly choose to update it

**Why This Is Perfect:**
- ✅ Zero risk to synced library
- ✅ Test OCR quality first
- ✅ Selective updating
- ✅ Perfect for large 23GB libraries

## Text Detection Functions

PDFMiner-only detection to match your actual processing pipeline exactly.

In [None]:
def analyze_pdf_text_content(pdf_path: str) -> Dict:
    """
    Fast PDF analysis with PDFMiner verification for accuracy.

    Strategy:
    1. Quick scan with PyMuPDF (100x faster)
    2. Only use PDFMiner for borderline cases
    3. Ensures compatibility with your clean_ocr_auto pipeline

    Returns:
        Dict with analysis results
    """
    from random import choice

    results = {
        "file_path": pdf_path,
        "file_size_mb": 0,
        "total_pages": 0,
        "has_extractable_text": False,
        "avg_words_per_page": 0,
        "total_characters": 0,
        "detection_method": "none",
        "sample_text": "",
        "error": None,
    }

    try:
        # Get file info
        file_stats = Path(pdf_path).stat()
        results["file_size_mb"] = file_stats.st_size / (1024 * 1024)

        # STEP 1: Fast scan with PyMuPDF
        try:
            import fitz

            doc = fitz.open(pdf_path)
            results["total_pages"] = len(doc)

            # Quick sample of first few pages
            sample_pages = min(5, len(doc))  # Just 5 pages for speed
            total_text = ""

            for _ in range(sample_pages):
                page_num = choice(range(len(doc)))
                page = doc[page_num]
                text = page.get_text()
                total_text += text

            doc.close()

            if total_text.strip():
                word_count = len(total_text.split())
                quick_words_per_page = word_count / sample_pages

                # CLEAR CASES - no need for PDFMiner verification
                if (
                    quick_words_per_page >= MIN_WORDS_PER_PAGE * 2
                ):  # Clearly has lots of text
                    results["avg_words_per_page"] = quick_words_per_page
                    results["total_characters"] = len(total_text)
                    results["sample_text"] = (
                        total_text[:200] + "..."
                        if len(total_text) > 200
                        else total_text
                    )
                    results["detection_method"] = "pymupdf_fast"
                    results["has_extractable_text"] = True
                    print(
                        f"   ⚡ FAST: Clearly has text ({quick_words_per_page:.1f} words/page)"
                    )
                    return results

                elif quick_words_per_page < 5:  # Clearly image-only
                    results["avg_words_per_page"] = quick_words_per_page
                    results["detection_method"] = "pymupdf_fast"
                    results["has_extractable_text"] = False
                    print(
                        f"   ⚡ FAST: Clearly image-only ({quick_words_per_page:.1f} words/page)"
                    )
                    return results

                else:
                    # BORDERLINE CASE - verify with PDFMiner
                    print(
                        f"   🔍 BORDERLINE: Verifying with PDFMiner ({quick_words_per_page:.1f} words/page)..."
                    )

            else:
                # No text found with PyMuPDF
                results["detection_method"] = "pymupdf_fast"
                results["has_extractable_text"] = False
                print(f"   ⚡ FAST: No text detected")
                return results

        except ImportError:
            print("   ⚠️  PyMuPDF not available, using PDFMiner only")
        except Exception as e:
            print(f"   ⚠️  PyMuPDF failed: {e}, trying PDFMiner")

        # STEP 2: PDFMiner verification (only for borderline cases or PyMuPDF failures)
        print(f"   🔍 PDFMiner verification...")
        loader = PDFMinerLoader(pdf_path)
        documents = loader.load()

        if documents and documents[0].page_content:
            total_text = documents[0].page_content.strip()

            if total_text:
                # Use actual page count from PyMuPDF if available
                estimated_pages = (
                    results["total_pages"]
                    if results["total_pages"] > 0
                    else max(1, len(total_text) // 3000)
                )
                results["total_pages"] = estimated_pages

                word_count = len(total_text.split())
                results["avg_words_per_page"] = word_count / estimated_pages
                results["total_characters"] = len(total_text)
                results["sample_text"] = (
                    total_text[:200] + "..." if len(total_text) > 200 else total_text
                )
                results["detection_method"] = "pdfminer_verified"

                if results["avg_words_per_page"] >= MIN_WORDS_PER_PAGE:
                    results["has_extractable_text"] = True
                    print(
                        f"      ✅ PDFMiner confirms: {word_count} words, {results['avg_words_per_page']:.1f} words/page"
                    )
                else:
                    print(
                        f"      ⚠️  PDFMiner: Minimal text ({results['avg_words_per_page']:.1f} words/page)"
                    )
            else:
                print(f"      ❌ PDFMiner: No extractable text")
        else:
            print(f"      ❌ PDFMiner: Could not load document")

    except Exception as e:
        results["error"] = str(e)
        print(f"      ❌ Analysis failed: {e}")

    return results


def classify_pdf_type(analysis: Dict) -> str:
    """
    Classify PDF based on analysis results.
    """
    if analysis["error"]:
        return "ERROR"
    elif analysis["has_extractable_text"]:
        return "HAS_TEXT"
    elif analysis["avg_words_per_page"] > 0:
        return "MINIMAL_TEXT"
    else:
        return "IMAGE_ONLY"


print("✅ Fast PDF detection with PDFMiner verification ready!")
print("⚡ Uses PyMuPDF for speed, PDFMiner for accuracy")
print("🎯 Only verifies borderline cases - much faster overall!")

✅ Fast PDF detection with PDFMiner verification ready!
⚡ Uses PyMuPDF for speed, PDFMiner for accuracy
🎯 Only verifies borderline cases - much faster overall!


## Local PDF Testing

Test the detection system on your local PDFs first.

In [11]:
# Test on your local PDFs first
test_pdfs = [
    "./brantome.pdf",
    "./medici.pdf",
    "./brotton.pdf",
    "./vankley.pdf",
    "./henriiv.pdf",
    "./huguenots.pdf",
]

print("🧪 Testing detection system on local PDFs")
print("=" * 50)

test_results = []
for pdf_path in test_pdfs:
    if Path(pdf_path).exists():
        print(f"\n📄 Analyzing: {pdf_path}")
        analysis = analyze_pdf_text_content(pdf_path)
        classification = classify_pdf_type(analysis)

        print(f"   📊 Result: {classification}")
        print(f"   📏 Pages: {analysis['total_pages']}")
        print(f"   📝 Words/page: {analysis['avg_words_per_page']:.1f}")
        print(f"   🔧 Method: {analysis['detection_method']}")
        print(f"   💾 Size: {analysis['file_size_mb']:.1f} MB")

        if analysis["sample_text"]:
            print(f"   📖 Sample: {analysis['sample_text'][:100]}...")

        test_results.append({**analysis, "classification": classification})
    else:
        print(f"⚠️  File not found: {pdf_path}")

print(f"\n✅ Local testing complete! Analyzed {len(test_results)} PDFs")

🧪 Testing detection system on local PDFs

📄 Analyzing: ./brantome.pdf
   ⚡ FAST: Clearly has text (138.3 words/page)
   📊 Result: HAS_TEXT
   📏 Pages: 749
   📝 Words/page: 138.3
   🔧 Method: pymupdf_fast
   💾 Size: 27.0 MB
   📖 Sample: Source gallica.bnf.fr / Bibliothèque nationale de France
Oeuvres complètes de Pierre
de Bourdeille s...

📄 Analyzing: ./medici.pdf
   ⚡ FAST: Clearly has text (125.7 words/page)
   📊 Result: HAS_TEXT
   📏 Pages: 640
   📝 Words/page: 125.7
   🔧 Method: pymupdf_fast
   💾 Size: 125.8 MB
   📖 Sample: Source gallica.bnf.fr / Bibliothèque nationale de France
Lettres de Catherine de
Médicis
Catherine d...

📄 Analyzing: ./brotton.pdf
   ⚡ FAST: Clearly image-only (0.7 words/page)
   📊 Result: MINIMAL_TEXT
   📏 Pages: 209
   📝 Words/page: 0.7
   🔧 Method: pymupdf_fast
   💾 Size: 28.5 MB

📄 Analyzing: ./vankley.pdf
   ⚡ FAST: Clearly has text (384.7 words/page)
   📊 Result: HAS_TEXT
   📏 Pages: 20
   📝 Words/page: 384.7
   🔧 Method: pymupdf_fast
   💾 Size: 2.6 MB


## Local Storage Scanner

Recursively scan your Zotero storage directory to find all PDFs and test their OCR status.

In [5]:
def scan_zotero_storage_directory():
    """
    Recursively scan Zotero storage directory for PDFs and analyze OCR status.

    Zotero storage structure:
    storage/
    ├── AB12CD34/  <- Zotero item key (8 characters)
    │   ├── document.pdf
    │   └── .zotero-ft-cache (full-text cache)
    ├── EF56GH78/
    │   └── another.pdf
    ...

    Returns:
        Dict with analysis results keyed by Zotero item key
    """
    print("🔍 Scanning Zotero storage directory...")
    print("=" * 50)

    if not STORAGE_PATH.exists():
        raise FileNotFoundError(f"Zotero storage not found: {STORAGE_PATH}")

    results = {
        "all_pdfs": {},
        "needs_ocr": [],
        "has_text": [],
        "errors": [],
        "no_pdfs": [],
        "total_folders": 0,
        "total_pdfs": 0,
    }

    # Get all storage folders (each is a Zotero item key)
    storage_folders = [f for f in STORAGE_PATH.iterdir() if f.is_dir()]
    results["total_folders"] = len(storage_folders)

    print(f"📁 Found {len(storage_folders)} storage folders to analyze")

    for i, folder in enumerate(storage_folders):
        folder_name = folder.name  # This is the Zotero item key

        if i % 100 == 0:  # Progress update every 100 folders
            print(f"   📊 Progress: {i}/{len(storage_folders)} folders...")

        # Find PDF files in this folder
        pdf_files = list(folder.glob("*.pdf"))

        if not pdf_files:
            results["no_pdfs"].append(folder_name)
            continue

        # Analyze the first PDF found (usually only one per folder)
        pdf_path = pdf_files[0]
        results["total_pdfs"] += 1

        try:
            analysis = analyze_pdf_text_content(str(pdf_path))
            analysis["zotero_key"] = folder_name
            analysis["folder_path"] = str(folder)
            analysis["pdf_filename"] = pdf_path.name

            classification = classify_pdf_type(analysis)
            analysis["classification"] = classification

            results["all_pdfs"][folder_name] = analysis

            # Categorize results
            if classification in ["IMAGE_ONLY", "MINIMAL_TEXT"]:
                results["needs_ocr"].append(folder_name)
            elif classification == "HAS_TEXT":
                results["has_text"].append(folder_name)
            else:
                results["errors"].append(folder_name)

        except Exception as e:
            print(f"   ❌ Error analyzing {folder_name}: {e}")
            error_analysis = {
                "zotero_key": folder_name,
                "folder_path": str(folder),
                "error": str(e),
                "classification": "ERROR",
            }
            results["all_pdfs"][folder_name] = error_analysis
            results["errors"].append(folder_name)

    # Summary
    print(f"\n📊 SCAN COMPLETE")
    print(f"   📁 Total folders: {results['total_folders']}")
    print(f"   📄 Total PDFs found: {results['total_pdfs']}")
    print(f"   ✅ PDFs with text: {len(results['has_text'])}")
    print(f"   🔍 PDFs needing OCR: {len(results['needs_ocr'])}")
    print(f"   ❌ Analysis errors: {len(results['errors'])}")
    print(f"   📂 Folders with no PDFs: {len(results['no_pdfs'])}")

    return results


def lookup_zotero_metadata(item_keys: List[str]) -> Dict:
    """
    Use Zotero API to get metadata for specific item keys.
    For PDF attachments, also fetches parent item metadata (books, articles, etc.)
    READ-ONLY operation - no writes to library.
    """
    if not ZOTERO_API_KEY or ZOTERO_API_KEY == "YOUR_API_KEY":
        print("⚠️  No Zotero API credentials - skipping metadata lookup")
        return {}

    print(f"🔍 Looking up metadata for {len(item_keys)} items...")

    try:
        zot = zotero.Zotero(ZOTERO_USER_ID, ZOTERO_LIBRARY_TYPE, ZOTERO_API_KEY)

        # Lookup items in batches (API limit is usually 50-100 items per request)
        batch_size = 50
        metadata = {}
        parent_keys_to_fetch = set()

        # STEP 1: Get attachment metadata and identify parent items
        for i in range(0, len(item_keys), batch_size):
            batch = item_keys[i : i + batch_size]
            print(
                f"   📥 Fetching batch {i//batch_size + 1}/{(len(item_keys)-1)//batch_size + 1} (attachments)"
            )

            try:
                # Get items by keys
                items = zot.items(itemKey=",".join(batch))

                for item in items:
                    key = item["key"]
                    data = item["data"]

                    # Store attachment metadata
                    attachment_metadata = {
                        "attachment_title": data.get("title", "Unknown Title"),
                        "attachment_type": data.get("itemType", "unknown"),
                        "parent_item_key": data.get("parentItem", ""),
                        "filename": data.get("filename", ""),
                        "content_type": data.get("contentType", ""),
                        "date_added": data.get("dateAdded", ""),
                        "date_modified": data.get("dateModified", ""),
                    }

                    metadata[key] = attachment_metadata

                    # If this is an attachment with a parent, we'll fetch parent metadata
                    if data.get("parentItem"):
                        parent_keys_to_fetch.add(data["parentItem"])

            except Exception as e:
                print(f"   ⚠️  Batch failed: {e}")
                continue

        # STEP 2: Get parent item metadata (the actual books/articles/etc.)
        if parent_keys_to_fetch:
            print(f"   📚 Fetching {len(parent_keys_to_fetch)} parent items...")
            parent_metadata = {}

            parent_keys_list = list(parent_keys_to_fetch)
            for i in range(0, len(parent_keys_list), batch_size):
                batch = parent_keys_list[i : i + batch_size]
                print(
                    f"   📥 Parent batch {i//batch_size + 1}/{(len(parent_keys_list)-1)//batch_size + 1}"
                )

                try:
                    parent_items = zot.items(itemKey=",".join(batch))

                    for item in parent_items:
                        parent_key = item["key"]
                        data = item["data"]

                        parent_metadata[parent_key] = {
                            "title": data.get("title", "Unknown Title"),
                            "item_type": data.get("itemType", "unknown"),
                            "language": data.get("language", "unknown"),
                            "authors": [
                                creator.get("firstName", "")
                                + " "
                                + creator.get("lastName", "")
                                for creator in data.get("creators", [])
                            ],
                            "publication_title": data.get("publicationTitle", ""),
                            "publisher": data.get("publisher", ""),
                            "date": data.get("date", ""),
                            "pages": data.get("pages", ""),
                            "volume": data.get("volume", ""),
                            "issue": data.get("issue", ""),
                            "url": data.get("url", ""),
                            "abstract": (
                                data.get("abstractNote", "")[:200] + "..."
                                if data.get("abstractNote", "")
                                else ""
                            ),
                        }

                except Exception as e:
                    print(f"   ⚠️  Parent batch failed: {e}")
                    continue

            # STEP 3: Merge parent metadata with attachment metadata
            for attachment_key, attachment_data in metadata.items():
                parent_key = attachment_data.get("parent_item_key")
                if parent_key and parent_key in parent_metadata:
                    # Add parent metadata to attachment record
                    attachment_data.update(parent_metadata[parent_key])

                    # Override title with parent title (more useful than attachment filename)
                    attachment_data["title"] = parent_metadata[parent_key]["title"]
                else:
                    # No parent found - use attachment data as fallback
                    attachment_data["title"] = attachment_data.get(
                        "attachment_title", "Unknown Title"
                    )
                    attachment_data["item_type"] = attachment_data.get(
                        "attachment_type", "attachment"
                    )
                    attachment_data["language"] = "unknown"
                    attachment_data["authors"] = []

        print(f"   ✅ Retrieved metadata for {len(metadata)} attachments")
        if parent_keys_to_fetch:
            print(
                f"   📚 Retrieved parent metadata for {len(parent_keys_to_fetch)} items"
            )
        return metadata

    except Exception as e:
        print(f"❌ Metadata lookup failed: {e}")
        return {}


print("✅ Storage scanner functions ready!")

✅ Storage scanner functions ready!


## Run Storage Analysis

Scan your entire Zotero storage directory (this is much faster than API downloads!).

In [17]:
# Run the storage directory scan
print("🚀 Starting Zotero storage analysis...")
print("   This scans local files only - no downloads, no writes!")
print()

start_time = time.time()
storage_results = scan_zotero_storage_directory()
scan_time = time.time() - start_time

print(f"\n⏱️  Scan completed in {scan_time:.1f} seconds")
print(
    f"📊 Processing rate: {storage_results['total_pdfs']/scan_time:.1f} PDFs per second"
)

# Save raw results
raw_results_file = OUTPUT_DIR / "storage_scan_results.json"
with open(raw_results_file, "w") as f:
    json.dump(storage_results, f, indent=2, default=str)

print(f"💾 Raw results saved to: {raw_results_file}")

🚀 Starting Zotero storage analysis...
   This scans local files only - no downloads, no writes!

🔍 Scanning Zotero storage directory...
📁 Found 3143 storage folders to analyze
   📊 Progress: 0/3143 folders...
   ⚡ FAST: Clearly has text (618.0 words/page)
   ⚡ FAST: Clearly has text (523.7 words/page)
   ⚡ FAST: Clearly has text (382.3 words/page)
   ⚡ FAST: Clearly has text (380.0 words/page)
   ⚡ FAST: Clearly has text (422.0 words/page)
   ⚡ FAST: Clearly has text (475.3 words/page)
   ⚡ FAST: Clearly has text (267.7 words/page)
   ⚡ FAST: No text detected
   ⚡ FAST: Clearly has text (447.0 words/page)
   ⚡ FAST: Clearly has text (348.0 words/page)
   ⚡ FAST: Clearly has text (435.0 words/page)
   ⚡ FAST: Clearly has text (297.0 words/page)
   ⚡ FAST: Clearly has text (162.3 words/page)
   ⚡ FAST: Clearly has text (430.0 words/page)
   🔍 BORDERLINE: Verifying with PDFMiner (71.0 words/page)...
   🔍 PDFMiner verification...
      ✅ PDFMiner confirms: 101 words, 50.5 words/page
   ⚡ F

In [28]:
scan_results = "./ocr_analysis/storage_scan_results.json"
with open(scan_results, "r") as f:
    storage_results = json.load(f)
print(f"📥 Loaded existing scan results from: {scan_results}")
print(f"📊 Found {len(storage_results['needs_ocr'])} PDFs needing OCR")
for item in storage_results["needs_ocr"]:
    print(item)

📥 Loaded existing scan results from: ./ocr_analysis/storage_scan_results.json
📊 Found 193 PDFs needing OCR
ZUVZNVA3
MCQRDT29
C6QQX25V
JIC38VUU
PVCTI8ZP
8IF4KLRD
R2GIIFIV
JWZ7SGZE
8F7FTAYQ
8EN5CRZ4
H57Q8F5M
XSNGVWXD
4G888FDC
S4YPN4RV
3HBMR8NQ
VRNCPXU7
E8RWP3FA
3LXTWQYH
M722SCS2
6T4TWQGD
9X3K4NSM
WFBC8JM2
EV4ICQST
4ZUW433W
6LAGL93T
HNS77AT9
AGX42NX7
27Q8EZ55
4BWWCABL
WK5SPVN4
ITQG8UQD
WQPWABRQ
RCQSFD5W
B4ZII4VX
EWB5G63T
E7QUBUE3
WEWRBZKI
EX6X34ZR
2NSJ2AFB
SGBYINIE
XWNEY3VS
RR27EW2V
CB2R2ZG6
GJMJ8Q4V
T4XQ4QKK
YK3PE754
GYFIXBGZ
E5JNN4Z6
FFIBP2NA
MVZAR3EP
UPEMRQ2N
SHHITQCB
9AVIX9QT
Q54SLTSI
T7I848CX
NQ7E2N7W
F2IGQ6ND
ZMBZ7N8B
MD7KBY55
9DEFUPBB
RQ4VBP6B
VZEJ8UNN
EKKRC49J
85XR9QK8
P4RHKKHL
XZKGVQSA
3MHUC7ET
23WYT9X8
G2VI7RFW
K3JLL2GT
9MGLNI3B
XAWTQXYS
EF7CTRYN
THBFU9JY
C7PIXBT8
K954RUH2
78PPQAIY
9PLDEPEZ
AUFJHBJF
V3RT8IFJ
B9CHN9XR
GQZ3KV5R
VIWEG2SN
93QXHL5F
S7LUWMCR
J82TM6Z4
53T2GUZ3
PI4AINWG
TK3AB8KU
6NRLQX8D
QKP67EI3
TN5EITB3
JG2TS3WB
F3PCQRK6
YUW6MMFA
9SM8J6C5
UJFI55VP
IZTFBBPY
FD6HZ6T7
RL

## Generate OCR Task Lists

Create actionable lists of items that need OCR, with optional metadata lookup.

In [39]:
# My Zotero Checker
import json
import shlex

scan_file = "./ocr_analysis/storage_scan_results.json"
with open(scan_file, "r") as f:
    storage_results = json.load(f)
print("🔍 Getting metadata for items that need OCR...")
ocr_metadata = lookup_zotero_metadata(storage_results["needs_ocr"])

needs_ocr = storage_results["needs_ocr"]
has_text = storage_results["has_text"]
errors = storage_results["errors"]
total_pdfs = storage_results["total_pdfs"]

OCR_DIR = "./ocr_folders"
if not os.path.exists(OCR_DIR):
    os.mkdir(OCR_DIR)

print(f"📚 Total PDFs analyzed: {total_pdfs}")
print(f"✅ PDFs with good text: {len(has_text)} ({len(has_text)/total_pdfs*100:.1f}%)")
print(f"🔍 PDFs needing OCR: {len(needs_ocr)} ({len(needs_ocr)/total_pdfs*100:.1f}%)")
print(f"❌ PDFs with errors: {len(errors)} ({len(errors)/total_pdfs*100:.1f}%)")

csv_lines = []
headers = ["key", "title", "type", "language", "size_mb", "words_per_page", "filename"]

if not Path("./ocr_scripts").exists():
    os.mkdir("./ocr_scripts")

# ...existing code...
with open("./ocr_scripts/ocr_bash.sh", "w") as f:
    f.write("#!/bin/bash\n\n")
    f.write("set -e  # Exit on any error\n")
    f.write("set -u  # Exit on undefined variables\n\n")

    # Progress tracking
    f.write("TOTAL_FILES={}\n".format(len(ocr_metadata.keys())))
    f.write("CURRENT=0\n")
    f.write("SUCCESS=0\n")
    f.write("FAILED=0\n\n")

    log_dir = Path("./ocr_scripts").resolve()
    if not log_dir.exists():
        os.mkdir(log_dir)
    # Log file
    logfile = Path("./ocr_scripts/ocr_log.txt").resolve()
    f.write(f"LOG_FILE={logfile}\n")
    f.write('echo "OCR Processing Started: $(date)" > "$LOG_FILE"\n\n')

    dir = Path("~/Desktop/zotero-storage-copy").expanduser()
    target_dir = Path("./ocr_folders").resolve()

    for idx, item in enumerate(ocr_metadata.items()):
        key, meta = item
        title = meta.get("title", "Unknown Title")
        language = meta.get("language", "unknown")

        if "turc 130" in meta["title"].lower():
            continue

        # Get the ORIGINAL filename from your storage results
        original_filename = [
            f for f in os.listdir(f"{dir}/{key}") if f.endswith(".pdf")
        ][0]
        safe_filename = shlex.quote(original_filename)
        # Clean filename for bash - escape ALL problematic characters
        original_path = f"{dir}/{key}/{original_filename}"
        safe_path = shlex.quote(original_path)

        target_path = f"{target_dir}/{key}/{original_filename}"
        safe_target_path = shlex.quote(target_path)
        # Progress counter
        f.write("CURRENT=$((CURRENT + 1))\n")
        f.write(f"""echo "[$CURRENT/$TOTAL_FILES] Processing {key} - {title}..."\n""")

        # Check if source file exists
        # f.write(f"if [ ! -f {safe_path} ]; then\n")
        # f.write(
        #     f"""    echo "ERROR: Source file not found: {safe_path}" | tee -a "$LOG_FILE"\n"""
        # )
        # f.write(f"    FAILED=$((FAILED + 1))\n")
        # f.write(f"else\n")  # ← ADD THIS
        # f.write(f"""    echo "Found source file: {safe_path}"\n""")
        # f.write(f"fi\n\n")

        # f.write(f"""if [ ! -d "{target_dir}/{key}" ]; then\n""")
        # f.write(f"""    mkdir -p "{target_dir}/{key}"\n""")
        # f.write(f"""    echo "Created target directory: {target_dir}/{key}"\n""")
        # f.write(f"else\n")
        # f.write(f"""    echo "Target directory already exists: {target_dir}/{key}"\n""")
        # f.write(f"fi\n\n")  # ← This closes the if statement properly!
        # # Check if output already exists (skip if exists)
        # f.write(f"""if [ -f {safe_target_path} ]; then\n""")
        # f.write(
        #     f"""    echo "SKIP: Output already exists: {original_filename}" | tee -a "$LOG_FILE"\n"""
        # )
        # f.write(f"    SUCCESS=$((SUCCESS + 1))\n")
        # f.write(f"""else\n""")
        # f.write(f"""    echo "Processing file: {safe_path}"\n""")
        # f.write(f"fi\n\n")

        f.write(f"""cd "{dir}/{key}"\n""")

        # Start timing
        f.write("START_TIME=$(date +%s)\n")

        # Choose language and run OCR with error handling
        if (
            language == "tr"
            or "muhim" in title.lower()
            or "defter" in title.lower()
            or "sefaretnamesi" in title.lower()
        ):
            lang_code = "tur"
        elif language == "fr":
            lang_code = "fra"
        else:
            lang_code = "eng"

        f.write(
            f"""
        source_pdf{idx}=$(find . -name "*.pdf" -type f | head -1)
        source_file{idx}="{dir}/{key}/$source_pdf{idx}"
        target_file{idx}="{target_dir}/{key}/$source_pdf{idx}"
        file_name{idx}="$source_pdf{idx}"
        if [ -z "$source_pdf{idx}" ]; then
            echo "ERROR: Source file not found: $source_file{idx}" | tee -a "$LOG_FILE"
            FAILED=$((FAILED + 1))
        else
            echo "Found source file: $source_file{idx}"
            
            # Create target directory
            if [ ! -d "{target_dir}/{key}" ]; then
                mkdir -p "{target_dir}/{key}"
                echo "Created target directory: {target_dir}/{key}"
            else
                echo "Target directory already exists: {target_dir}/{key}"
            fi
            
            # OCR processing
            if [ -f "$target_file{idx}" ]; then
                echo "SKIP: Output already exists: $file_name{idx}" | tee -a "$LOG_FILE"
                SUCCESS=$((SUCCESS + 1))
            else
                echo "Processing file: $source_file{idx}"
                cd "{dir}/{key}"
                START_TIME=$(date +%s)
                
                if ocrmypdf --rotate-pages --deskew --clean --optimize 3 --force-ocr --oversample 300 --language {lang_code} "$source_file{idx}" "$target_file{idx}"; then
                    END_TIME=$(date +%s)
                    DURATION=$((END_TIME - START_TIME))
                    echo "SUCCESS: $file_name{idx} completed in ${{DURATION}}s" | tee -a "$LOG_FILE"
                    SUCCESS=$((SUCCESS + 1))
                else
                    echo "ERROR: Failed to process $source_file{idx}" | tee -a "$LOG_FILE"
                    FAILED=$((FAILED + 1))
                fi
            fi
        fi

        """
        )
    # Final summary
    f.write("""echo "\\n=== OCR PROCESSING COMPLETE ==="\n""")
    f.write("""echo "Total files: $TOTAL_FILES"\n""")
    f.write("""echo "Successful: $SUCCESS"\n""")
    f.write("""echo "Failed: $FAILED"\n""")
    f.write("""echo "Completion: $(date)"\n""")
    f.write("""echo "\\nProcessing Summary: $(date)" >> "$LOG_FILE"\n""")
    f.write("""echo "Success: $SUCCESS, Failed: $FAILED" >> "$LOG_FILE"\n""")
# ...existing code...

🔍 Getting metadata for items that need OCR...
🔍 Looking up metadata for 193 items...
   📥 Fetching batch 1/4 (attachments)
   📥 Fetching batch 2/4 (attachments)
   📥 Fetching batch 3/4 (attachments)
   📥 Fetching batch 4/4 (attachments)
   📚 Fetching 98 parent items...
   📥 Parent batch 1/2
   📥 Parent batch 2/2
   ✅ Retrieved metadata for 164 attachments
   📚 Retrieved parent metadata for 98 items
📚 Total PDFs analyzed: 2165
✅ PDFs with good text: 1971 (91.0%)
🔍 PDFs needing OCR: 193 (8.9%)
❌ PDFs with errors: 1 (0.0%)


In [10]:
for folder in storage_results["needs_ocr"]:
    l = os.listdir(f"{dir}/{folder}")
    l = [f for f in l if f.lower().endswith(".pdf")]
    if len(l) > 1:
        print(f"⚠️  WARNING: More than one file in {folder} - skipping copy")
        continue

In [24]:
for i in range(5):
    item = ocr_metadata[needs_ocr[i]]
    print(item)

{'attachment_title': 'Fleischer_1986_Bureaucrat and intellectual in the Ottoman Empire.pdf', 'attachment_type': 'attachment', 'parent_item_key': 'KTH5RHWP', 'filename': 'Fleischer_1986_Bureaucrat and intellectual in the Ottoman Empire.pdf', 'content_type': 'application/pdf', 'date_added': '2018-04-10T13:16:15Z', 'date_modified': '2018-04-10T13:17:13Z', 'title': 'Bureaucrat and intellectual in the Ottoman Empire: the historian Mustafa Ali (1541-1600)', 'item_type': 'book', 'language': '', 'authors': ['Cornell H. Fleischer'], 'publication_title': '', 'publisher': 'Princeton University Press', 'date': '1986', 'pages': '', 'volume': '', 'issue': '', 'url': '', 'abstract': ''}
{'attachment_title': 'Muhimme 141 (1148).pdf', 'attachment_type': 'attachment', 'parent_item_key': '', 'filename': 'Muhimme 141 (1148).pdf', 'content_type': 'application/pdf', 'date_added': '2022-09-29T01:21:50Z', 'date_modified': '2022-09-29T01:21:50Z', 'title': 'Muhimme 141 (1148).pdf', 'item_type': 'attachment', 'l

In [26]:
len(ocr_metadata.keys())

164

d'Arbaleste - 2009 - Escape from the Massacre, 1572.pdf
Yvelise - 1991 - D'Alexandrie à Istanbul Jean Palerne - Pérégrinat.pdf
Tebeau - 2010 - Sculpted Landscapes Art & Place in Cleveland's Cu.pdf
Collins - 1979 - Sur l'histoire fiscale du XVIIe siècle les impôts.pdf
Hamilton - 2017 - Pierre de L'Estoile and his World in the Wars of R.pdf
Ladurie - 1974 - L'histoire immobile.pdf
L''Histoire universelle du sieur d''Aubigné....pdf
Tinguely_2000_L'écriture du Levant à la Renaissance.pdf
Smither - 1991 - The St. Bartholomew's Day Massacre and Images of K.pdf
Watt - 2020 - The Consistory and Social Discipline in Calvin's Geneva.pdf
Bremond d'Ars - 1884 - Le père de Madame de Rambouillet. Jean de Vivonne,.pdf
Lesure_1986_Les Relations Franco-Ottomanes a L'Épreuve des Guerres de Religion (1560-1594).pdf
Isom-Verhaaren - 2006 - Royal French Women in the Ottoman Sultans' Harem .pdf
Garnier - 2008 - L'Alliance impie François Ier et Soliman le Magni.pdf
Auchterlonie - 2000 - A Turk of the West Si

108