# PaperChecker - Medical Research Paper Extraction

Automated pipeline for extracting structured data from medical research PDFs.

**Quick Start:**
1. Run **Cell 1** (Setup) - once per session
2. Run **Cell 2** (Load ZIP) - upload your code ZIP
3. Run **Cell 3** (Run) - set PDF folder and extract!
4. Run **Cell 4** (Download) - get your results

In [None]:
#@title **1. SETUP** (run once per session)
#@markdown Installs dependencies, mounts Google Drive, loads API keys.

# Install dependencies
!pip install -U openai google-genai pymupdf python-docx openpyxl jsonschema -q

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Load API keys from Colab Secrets
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

if OPENAI_API_KEY and GOOGLE_API_KEY:
    print("[OK] Setup complete! API keys loaded.")
else:
    print("[!] WARNING: Add OPENAI_API_KEY and GOOGLE_API_KEY to Colab Secrets!")

In [None]:
#@title **2. LOAD ZIP** (run when code changes)
#@markdown Upload and extract the PaperChecker ZIP file.

from google.colab import files
import os, shutil, zipfile

project_dir = "/content/paperchecker"

# Clean previous install
if os.path.exists(project_dir):
    shutil.rmtree(project_dir)

print("Upload the PaperChecker ZIP file...")
uploaded = files.upload()

if not uploaded:
    raise RuntimeError("No file uploaded!")

zip_path = next(iter(uploaded.keys()))

with zipfile.ZipFile(zip_path, "r") as zf:
    zf.extractall("/content")
    top_levels = {n.split("/")[0] for n in zf.namelist() if n.strip()}

# Handle nested folder from GitHub ZIP
if len(top_levels) == 1:
    extracted = os.path.join("/content", next(iter(top_levels)))
    if os.path.isdir(extracted) and extracted != project_dir:
        os.rename(extracted, project_dir)

%cd /content/paperchecker
print("[OK] Code loaded!")

In [None]:
#@title **3. RUN** (main cell - re-run for new PDFs)
#@markdown Set your PDF folder path and run extraction.

PDF_FOLDER = "/content/drive/MyDrive/paperchecker/pdf"  #@param {type:"string"}

import os, glob
from datetime import datetime, UTC

# Reload script module (picks up any code changes)
import importlib
import script
importlib.reload(script)

# Find PDFs
pdf_paths = sorted(glob.glob(os.path.join(PDF_FOLDER, "*.pdf")))
print(f"Found {len(pdf_paths)} PDF(s) in {PDF_FOLDER}")
for i, p in enumerate(pdf_paths, 1):
    print(f"  {i}. {os.path.basename(p)}")

if not pdf_paths:
    raise RuntimeError("No PDFs found! Check PDF_FOLDER path.")

# Setup output paths
os.makedirs("output", exist_ok=True)
timestamp = datetime.now(UTC).strftime('%Y%m%d_%H%M%S')
OUTPUT_XLSX = f"output/mronj_extraction_{timestamp}.xlsx"
OUTPUT_DOCX = f"output/mronj_review_log_{timestamp}.docx"

print(f"\nOutput: {OUTPUT_XLSX}")
print("="*50)

# Run pipeline (template auto-generated from EXCEL_MAP)
results = script.run_pipeline(
    pdf_paths=pdf_paths,
    out_xlsx=OUTPUT_XLSX,
    out_docx=OUTPUT_DOCX,
    openai_api_key=OPENAI_API_KEY,
    google_api_key=GOOGLE_API_KEY,
    progress_fn=print,
)

print("\n" + "="*50)
print(f"DONE! Processed {len(results)} paper(s)")
for i, r in enumerate(results, 1):
    pid = r.get('paper_id', {})
    print(f"  {i}. PMID={pid.get('pmid')} | {r.get('study_type')} | review={r.get('validation',{}).get('needs_human_review')}")

In [None]:
#@title **4. DOWNLOAD** (get results)
#@markdown Download Excel, Word, and audit files.

from google.colab import files
import glob

print("Downloading results...\n")

for f in [OUTPUT_XLSX, OUTPUT_DOCX] + glob.glob("output/*.audit_*.json"):
    if os.path.exists(f):
        print(f"  {f}")
        files.download(f)

print("\nDone!")

---
## Troubleshooting

**API Keys:** Add `OPENAI_API_KEY` and `GOOGLE_API_KEY` in Colab sidebar > Secrets

**No PDFs:** Check `PDF_FOLDER` path (e.g., `/content/drive/MyDrive/your_folder`)

**Model errors:** Edit `script.py` lines 58-59 to change models