# PaperChecker - Medical Research Paper Extraction

Automated pipeline for extracting structured data from medical research PDFs.

**Quick Start:**
1. Run **Cell 1** (Setup) - installs dependencies and clones repo
2. Run **Cell 2** (Configure) - set your PDF folder path
3. Run **Cell 3** (Run) - extract data from PDFs
4. Run **Cell 4** (Download) - get your results

**First time?** Add your API keys to Colab Secrets (key icon in sidebar):
- `OPENAI_API_KEY`
- `GOOGLE_API_KEY`

In [None]:
#@title **1. SETUP** (run once per session)
#@markdown Installs dependencies, clones repo, mounts Drive, loads API keys.

# Install dependencies
!pip install -U openai google-genai pymupdf python-docx openpyxl jsonschema -q

# Clone PaperChecker from GitHub
import os
if not os.path.exists('/content/paperchecker'):
    !git clone https://github.com/maxrusse/paperchecker.git /content/paperchecker
else:
    !cd /content/paperchecker && git pull

%cd /content/paperchecker

# Mount Google Drive (for PDF access)
from google.colab import drive
drive.mount('/content/drive')

# Load API keys from Colab Secrets
from google.colab import userdata
try:
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    if OPENAI_API_KEY and GOOGLE_API_KEY:
        print("[OK] API keys loaded from Colab Secrets")
    else:
        print("[!] Missing API keys - add OPENAI_API_KEY and GOOGLE_API_KEY to Colab Secrets")
except Exception:
    OPENAI_API_KEY = None
    GOOGLE_API_KEY = None
    print("[!] Add OPENAI_API_KEY and GOOGLE_API_KEY to Colab Secrets (key icon in sidebar)")

print("[OK] Setup complete!")

In [None]:
#@title **2. CONFIGURE** (set your PDF folder)
#@markdown Set the path to your PDF folder in Google Drive.

#@markdown **Option A:** Use Google Drive folder
PDF_FOLDER = "/content/drive/MyDrive/paperchecker/pdfs"  #@param {type:"string"}

#@markdown **Option B:** Upload PDFs directly (uncomment below)
# from google.colab import files
# uploaded = files.upload()
# PDF_FOLDER = "/content"  # PDFs uploaded to current directory

import os, glob
pdf_paths = sorted(glob.glob(os.path.join(PDF_FOLDER, "*.pdf")))
print(f"\nFound {len(pdf_paths)} PDF(s) in {PDF_FOLDER}:")
for i, p in enumerate(pdf_paths, 1):
    print(f"  {i}. {os.path.basename(p)}")

if not pdf_paths:
    print("\n[!] No PDFs found. Check your PDF_FOLDER path.")
    print("    Example: /content/drive/MyDrive/your_folder")

In [None]:
#@title **3. RUN EXTRACTION** (main cell)
#@markdown Runs the extraction pipeline on all PDFs.

import os, glob
from datetime import datetime, UTC

# Reload script module (picks up any code changes)
import importlib
import script
importlib.reload(script)

# Verify PDFs
if not pdf_paths:
    raise RuntimeError("No PDFs found! Run Cell 2 first and check PDF_FOLDER path.")

# Setup output paths
os.makedirs("output", exist_ok=True)
timestamp = datetime.now(UTC).strftime('%Y%m%d_%H%M%S')
OUTPUT_XLSX = f"output/extraction_{timestamp}.xlsx"
OUTPUT_DOCX = f"output/review_log_{timestamp}.docx"

print(f"Processing {len(pdf_paths)} PDF(s)...")
print(f"Output: {OUTPUT_XLSX}")
print("=" * 50)

# Run pipeline
results = script.run_pipeline(
    pdf_paths=pdf_paths,
    out_xlsx=OUTPUT_XLSX,
    out_docx=OUTPUT_DOCX,
    openai_api_key=OPENAI_API_KEY,
    google_api_key=GOOGLE_API_KEY,
    progress_fn=print,
)

print("\n" + "=" * 50)
print(f"DONE! Processed {len(results)} paper(s)")
for i, r in enumerate(results, 1):
    pid = r.get('paper_id', {})
    study_type = r.get('study_type', 'Unknown')
    needs_review = r.get('validation', {}).get('needs_human_review', False)
    print(f"  {i}. PMID={pid.get('pmid', 'N/A')} | {study_type} | review={needs_review}")

In [None]:
#@title **4. DOWNLOAD RESULTS**
#@markdown Downloads Excel, Word, and audit files.

from google.colab import files
import glob

print("Downloading results...\n")

# Download main outputs
for f in [OUTPUT_XLSX, OUTPUT_DOCX]:
    if os.path.exists(f):
        print(f"  {f}")
        files.download(f)

# Download audit files
audit_files = glob.glob("output/*.audit_*.json")
if audit_files:
    print(f"\n  + {len(audit_files)} audit file(s)")
    for f in audit_files:
        files.download(f)

print("\nDone!")

---
## Troubleshooting

**API Keys:** Add `OPENAI_API_KEY` and `GOOGLE_API_KEY` in Colab sidebar > Secrets (key icon)

**No PDFs found:** Check your `PDF_FOLDER` path - it should point to a folder in Google Drive
- Example: `/content/drive/MyDrive/my_papers`

**Model errors:** The default models are `gpt-5.2` and `gemini-3-pro-preview`. Edit `script.py` lines 58-59 to change.

**Update code:** Run `!cd /content/paperchecker && git pull` to get latest version