# PaperChecker - Medical Research Paper Extraction

This notebook runs **PaperChecker v2** - an automated pipeline for extracting structured data from medical research PDFs (focused on MRONJ prevention studies).

## What it does:
- Extracts metadata, population data, drug information, interventions, and outcomes from PDFs
- Fills an Excel template with standardized data
- Verifies extracted information using a second LLM pass
- Generates a Word review log documenting all decisions

## Requirements:
- OpenAI API key
- Google AI (Gemini) API key
- PDF files to process

---
## Step 1: Clone Repository & Install Dependencies

In [None]:
# Clone the PaperChecker repository
!git clone https://github.com/maxrusse/paperchecker.git
%cd paperchecker

# Install required dependencies
!pip install -U openai google-genai pymupdf python-docx openpyxl jsonschema -q

print("\n Setup complete!")

---
## Step 2: Enter Your API Keys

Enter your API keys below. They will be stored securely in this session only.

In [None]:
from getpass import getpass

# Securely input API keys (won't be displayed)
OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")
GOOGLE_API_KEY = getpass("Enter your Google AI (Gemini) API Key: ")

if OPENAI_API_KEY and GOOGLE_API_KEY:
    print(" API keys saved for this session!")
else:
    print(" Warning: One or both API keys are missing!")

---
## Step 3: Mount Google Drive & Set PDF Folder Path

Mount your Google Drive to access PDF files stored there.

In [None]:
from google.colab import drive
import os
import glob

# Mount Google Drive
drive.mount('/content/drive')

# ============================================================
# SET YOUR PDF FOLDER PATH HERE
# ============================================================
PDF_FOLDER = "/content/drive/MyDrive/papers"  # <-- Change this to your folder path
# ============================================================

# Find all PDFs in the folder
pdf_paths = sorted(glob.glob(os.path.join(PDF_FOLDER, "*.pdf")))

print(f"PDF Folder: {PDF_FOLDER}")
print(f"Found {len(pdf_paths)} PDF file(s):\n")

for i, p in enumerate(pdf_paths, 1):
    print(f"  {i}. {os.path.basename(p)}")

if not pdf_paths:
    print("  No PDFs found! Check your folder path.")

---
## Step 4: Configure & Run PaperChecker

Run the extraction pipeline on your uploaded PDFs.

In [None]:
from datetime import datetime, UTC

# Import the PaperChecker script
import script

# Configuration
TEMPLATE_XLSX = "Prevention of MRONJ_Extraction Sheet (Oli).xlsx"
timestamp = datetime.now(UTC).strftime('%Y%m%d_%H%M%S')
OUTPUT_XLSX = f"output/mronj_extraction_{timestamp}.xlsx"
OUTPUT_DOCX = f"output/mronj_review_log_{timestamp}.docx"

# Create output directory
os.makedirs("output", exist_ok=True)

print("="*60)
print("PaperChecker v2 - Starting Pipeline")
print("="*60)
print(f"PDFs to process: {len(pdf_paths)}")
print(f"Template: {TEMPLATE_XLSX}")
print(f"Output Excel: {OUTPUT_XLSX}")
print(f"Output Word: {OUTPUT_DOCX}")
print("="*60 + "\n")

# Run the pipeline
try:
    results = script.run_pipeline(
        pdf_paths=pdf_paths,
        template_xlsx=TEMPLATE_XLSX,
        out_xlsx=OUTPUT_XLSX,
        out_docx=OUTPUT_DOCX,
        openai_api_key=OPENAI_API_KEY,
        google_api_key=GOOGLE_API_KEY,
        progress_fn=print,
        use_gemini_driver=False,    # Use OpenAI for extraction
        use_openai_verifier=False,  # Use Gemini for verification
    )
    
    print("\n" + "="*60)
    print(" Pipeline completed successfully!")
    print("="*60)
    print(f"\nProcessed {len(results)} paper(s)")
    
    # Summary
    for i, r in enumerate(results):
        pid = r.get('paper_id', {})
        val = r.get('validation', {})
        print(f"\n  Paper {i+1}:")
        print(f"    PMID: {pid.get('pmid', 'N/A')}")
        print(f"    Study Type: {r.get('study_type', 'N/A')}")
        print(f"    Needs Human Review: {val.get('needs_human_review', 'N/A')}")

except Exception as e:
    print(f"\n Error: {e}")
    raise

---
## Step 5: Download Results

Download the generated Excel and Word files.

In [None]:
from google.colab import files
import glob

print("Downloading output files...\n")

# Download Excel file
if os.path.exists(OUTPUT_XLSX):
    print(f"  Downloading: {OUTPUT_XLSX}")
    files.download(OUTPUT_XLSX)

# Download Word file
if os.path.exists(OUTPUT_DOCX):
    print(f"  Downloading: {OUTPUT_DOCX}")
    files.download(OUTPUT_DOCX)

# Download audit JSON files
audit_files = glob.glob("output/*.audit_*.json")
for audit_file in audit_files:
    print(f"  Downloading: {audit_file}")
    files.download(audit_file)

print("\n Download complete!")

---
## Optional: View Output Files in Colab

In [None]:
# List all generated output files
import os

print("Generated files in output/ directory:\n")
for f in os.listdir("output"):
    filepath = os.path.join("output", f)
    size_kb = os.path.getsize(filepath) / 1024
    print(f"  {f} ({size_kb:.1f} KB)")

---
## Troubleshooting

**API Key Errors:**
- Make sure you have valid API keys for both OpenAI and Google AI
- OpenAI: https://platform.openai.com/api-keys
- Google AI: https://makersuite.google.com/app/apikey

**Model Access:**
- The default models are `gpt-5.2` and `gemini-3-pro-preview`
- If you don't have access, edit `script.py` lines 47-48 to use models you have access to

**No PDFs Found:**
- Check that `PDF_FOLDER` path is correct
- Path should be like `/content/drive/MyDrive/your_folder`
- Make sure files have `.pdf` extension

**Google Drive Not Mounted:**
- Re-run Step 3 and authorize access when prompted

**Processing Errors:**
- Check the error message for specific issues
- Ensure PDFs are valid and contain readable text