# Dataset Extraction from Physics Papers

This notebook extracts dataset information from PDF papers using Claude AI.

## Requirements
```
pip install anthropic pdfplumber pandas python-dotenv
```

## 1. Configuration

Edit these settings for your needs:

In [1]:
# ============================================================
# CONFIGURATION - Edit these values for your setup
# ============================================================
CONFIG = {
    "input_folder": r"C:/Users/ejren/OneDrive/DPOA_papers",
    "output_file": "dataset_info.csv",
    "model": "claude-sonnet-4-20250514",
    "max_pages": None,  # Set to a number to limit pages per PDF, None for all pages
    "verbose": True  # Set to False for less logging
}
# ============================================================

## 2. Import Libraries

In [None]:
import os
import json
import logging
from pathlib import Path
from typing import Optional
import csv
import re

import anthropic
import pdfplumber
import pandas as pd
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("✓ Libraries imported successfully")

## 3. Setup Logging

In [None]:
# Configure logging
log_level = logging.DEBUG if CONFIG["verbose"] else logging.INFO
logging.basicConfig(
    level=log_level,
    format='%(asctime)s - %(levelname)s - %(message)s',
    force=True  # This ensures logging reconfigures in Jupyter
)
logger = logging.getLogger(__name__)

print("✓ Logging configured")

## 4. Define Extraction Prompt

In [None]:
# Prompt template for Claude to extract dataset information
EXTRACTION_PROMPT = """You are a scientific data extraction assistant specializing in high-energy physics papers.

Analyze the following physics paper text and extract ALL dataset information mentioned. Focus on:
- CMS Open Data datasets
- Monte Carlo simulation samples
- Real collision data samples
- Any datasets with DOIs or official citation paths

For EACH dataset found, extract these fields (use "null" if not available):
1. dataset_name: The name or identifier of the dataset
2. dataset_type: "Real Data" or "Simulated MC" 
3. official_path: The official citation path (e.g., /Jet/Run2010B-Apr21ReReco-v1/AOD)
4. events_total: Total number of events in the dataset
5. events_used: Number of events actually used in the analysis
6. collision_energy_tev: Center-of-mass energy in TeV
7. generator: MC generator used (e.g., Pythia, Madgraph) or "N/A (Real Data)"
8. doi: The DOI identifier (just the DOI, not the full URL)
9. size_bytes: Dataset size in bytes (convert from TB, GB, MB if needed: 1TB=1e12, 1GB=1e9, 1MB=1e6). If multiple sizes given (e.g., raw vs compressed), use the original/raw size.
10. notes: Any other important details (luminosity, run period, selection criteria, etc.)

Return your response as a JSON array of objects. Each object represents one dataset.
If the paper mentions multiple pT bins or variants of the same base dataset, list each separately.

Example output format:
[
    {
        "dataset_name": "Jet Primary Dataset",
        "dataset_type": "Real Data",
        "official_path": "/Jet/Run2010B-Apr21ReReco-v1/AOD",
        "events_total": "20022826",
        "events_used": "768687",
        "collision_energy_tev": "7",
        "generator": "N/A (Real Data)",
        "doi": "10.7483/OPENDATA.CMS.3S7F.2E9W",
        "size_bytes": "2000000000000",
        "notes": "Run 2010B, 31.8 pb-1 integrated luminosity, 2.0 TB original size"
    }
]

CRITICAL: Return ONLY the JSON array, no other text, no markdown formatting, no code blocks, no explanation. Just the raw JSON array starting with [ and ending with ].

Paper text to analyze:
---
{paper_text}
---
"""

print("✓ Extraction prompt defined")

## 5. Define Helper Functions

In [None]:
def extract_text_from_pdf(pdf_path: str, max_pages: Optional[int] = None) -> str:
    """
    Extract text content from a PDF file.
    
    Args:
        pdf_path: Path to the PDF file
        max_pages: Maximum number of pages to extract (None for all)
    
    Returns:
        Extracted text as a string
    """
    logger.info(f"Extracting text from: {pdf_path}")
    
    text_parts = []
    
    try:
        with pdfplumber.open(pdf_path) as pdf:
            pages_to_process = pdf.pages[:max_pages] if max_pages else pdf.pages
            
            for i, page in enumerate(pages_to_process):
                page_text = page.extract_text()
                if page_text:
                    text_parts.append(f"--- Page {i+1} ---\n{page_text}")
                    
                # Also try to extract tables as they often contain dataset info
                tables = page.extract_tables()
                for j, table in enumerate(tables):
                    if table:
                        table_text = "\n".join(["\t".join([str(cell) if cell else "" for cell in row]) for row in table])
                        text_parts.append(f"--- Table {j+1} on Page {i+1} ---\n{table_text}")
        
        full_text = "\n\n".join(text_parts)
        logger.info(f"Extracted {len(full_text)} characters from {len(pages_to_process)} pages")
        return full_text
        
    except Exception as e:
        logger.error(f"Error extracting text from {pdf_path}: {e}")
        raise


def clean_json_response(response_text: str) -> str:
    """
    Clean up Claude's response to extract valid JSON.
    
    Args:
        response_text: Raw response from Claude
        
    Returns:
        Cleaned JSON string
    """
    # Remove any leading/trailing whitespace
    response_text = response_text.strip()
    
    # Remove markdown code blocks (```json ... ``` or ``` ... ```)
    if response_text.startswith("```"):
        lines = response_text.split("\n")
        # Remove first line (```json or ```)
        lines = lines[1:]
        # Remove last line if it's ```
        if lines and lines[-1].strip() == "```":
            lines = lines[:-1]
        response_text = "\n".join(lines).strip()
    
    # Try to extract JSON array if there's surrounding text
    if not response_text.startswith("["):
        # Use regex to find JSON array
        match = re.search(r'\[.*\]', response_text, re.DOTALL)
        if match:
            response_text = match.group(0)
    
    return response_text


print("✓ Helper functions defined")

## 6. Claude API Extraction Function

In [None]:
def extract_datasets_with_claude(
    paper_text: str, 
    paper_name: str,
    api_key: Optional[str] = None,
    model: str = "claude-sonnet-4-20250514"
) -> list[dict]:
    """
    Use Claude API to extract dataset information from paper text.
    
    Args:
        paper_text: The extracted text from the paper
        paper_name: Name of the paper (for logging and output)
        api_key: Anthropic API key (uses env var if not provided)
        model: Claude model to use
    
    Returns:
        List of dictionaries containing dataset information
    """
    logger.info(f"Sending paper to Claude for analysis: {paper_name}")
    
    # Initialize client
    try:
        client = anthropic.Anthropic(api_key=api_key) if api_key else anthropic.Anthropic()
    except Exception as e:
        logger.error(f"Failed to initialize Anthropic client: {e}")
        logger.error("Make sure ANTHROPIC_API_KEY is set in your environment or .env file")
        raise
    
    # Truncate text if too long (keeping first and last parts for context)
    max_chars = 150000  # Leave room for prompt and response
    if len(paper_text) > max_chars:
        half = max_chars // 2
        paper_text = paper_text[:half] + "\n\n[... middle section truncated ...]\n\n" + paper_text[-half:]
        logger.warning(f"Paper text truncated to {max_chars} characters")
    
    prompt = EXTRACTION_PROMPT.format(paper_text=paper_text)
    
    try:
        message = client.messages.create(
            model=model,
            max_tokens=4096,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        
        response_text = message.content[0].text
        
        # Log the raw response for debugging (first 500 chars)
        logger.debug(f"Raw Claude response (first 500 chars): {response_text[:500]}")
        
        # Clean up the response
        cleaned_response = clean_json_response(response_text)
        
        logger.debug(f"Cleaned response (first 500 chars): {cleaned_response[:500]}")
        
        # Parse JSON
        datasets = json.loads(cleaned_response)
        
        # Validate it's a list
        if not isinstance(datasets, list):
            logger.error(f"Expected JSON array, got {type(datasets)}")
            logger.error(f"Response: {cleaned_response[:1000]}")
            return []
        
        # Add paper name to each dataset
        for dataset in datasets:
            if isinstance(dataset, dict):
                dataset["paper"] = paper_name
            else:
                logger.warning(f"Unexpected dataset format: {dataset}")
        
        logger.info(f"Successfully extracted {len(datasets)} datasets from {paper_name}")
        return datasets
        
    except json.JSONDecodeError as e:
        logger.error(f"Failed to parse Claude response as JSON: {e}")
        logger.error(f"Problematic response (first 1000 chars): {cleaned_response[:1000]}")
        
        # Save the problematic response to a file for inspection
        error_file = f"error_response_{paper_name}.txt"
        try:
            with open(error_file, "w", encoding="utf-8") as f:
                f.write(f"Original response:\n{response_text}\n\n")
                f.write(f"Cleaned response:\n{cleaned_response}\n\n")
                f.write(f"Error: {e}\n")
            logger.error(f"Full response saved to {error_file} for debugging")
        except Exception as write_error:
            logger.error(f"Could not save error file: {write_error}")
        
        return []
        
    except Exception as e:
        logger.error(f"Error calling Claude API: {e}")
        return []


print("✓ Claude extraction function defined")

## 7. Main Processing Function

In [None]:
def process_papers_folder(
    input_folder: str,
    output_file: str,
    api_key: Optional[str] = None,
    model: str = "claude-sonnet-4-20250514",
    max_pages: Optional[int] = None
) -> pd.DataFrame:
    """
    Process all PDF papers in a folder and extract dataset information.
    
    Args:
        input_folder: Path to folder containing PDF papers
        output_file: Path for output CSV file
        api_key: Anthropic API key
        model: Claude model to use
        max_pages: Maximum pages to process per PDF
    
    Returns:
        DataFrame with all extracted datasets
    """
    input_path = Path(input_folder)
    
    if not input_path.exists():
        raise FileNotFoundError(f"Input folder not found: {input_folder}")
    
    # Find all PDF files
    pdf_files = list(input_path.glob("*.pdf")) + list(input_path.glob("*.PDF"))
    
    if not pdf_files:
        logger.warning(f"No PDF files found in {input_folder}")
        return pd.DataFrame()
    
    logger.info(f"Found {len(pdf_files)} PDF files to process")
    
    all_datasets = []
    successful_papers = 0
    failed_papers = 0
    
    for pdf_file in pdf_files:
        try:
            logger.info(f"\n{'='*60}")
            logger.info(f"Processing: {pdf_file.name}")
            logger.info(f"{'='*60}")
            
            # Extract text from PDF
            paper_text = extract_text_from_pdf(str(pdf_file), max_pages)
            
            # Get paper name (filename without extension)
            paper_name = pdf_file.stem
            
            # Extract datasets using Claude
            datasets = extract_datasets_with_claude(
                paper_text=paper_text,
                paper_name=paper_name,
                api_key=api_key,
                model=model
            )
            
            if datasets:
                all_datasets.extend(datasets)
                successful_papers += 1
            else:
                logger.warning(f"No datasets extracted from {pdf_file.name}")
                failed_papers += 1
            
        except Exception as e:
            logger.error(f"Failed to process {pdf_file.name}: {e}")
            failed_papers += 1
            continue
    
    logger.info(f"\n{'='*60}")
    logger.info(f"Processing Summary:")
    logger.info(f"  Successfully processed: {successful_papers} papers")
    logger.info(f"  Failed: {failed_papers} papers")
    logger.info(f"  Total datasets extracted: {len(all_datasets)}")
    logger.info(f"{'='*60}\n")
    
    if not all_datasets:
        logger.warning("No datasets extracted from any papers")
        return pd.DataFrame()
    
    # Create DataFrame with consistent column order
    columns = [
        "paper",
        "dataset_name", 
        "dataset_type",
        "official_path",
        "events_total",
        "events_used",
        "collision_energy_tev",
        "generator",
        "doi",
        "size_bytes",
        "notes"
    ]
    
    df = pd.DataFrame(all_datasets)
    
    # Ensure all columns exist
    for col in columns:
        if col not in df.columns:
            df[col] = "N/A"
    
    # Reorder columns
    df = df[columns]
    
    # Save to CSV
    df.to_csv(output_file, index=False, quoting=csv.QUOTE_ALL)
    logger.info(f"Saved {len(df)} datasets to {output_file}")
    
    return df


print("✓ Main processing function defined")

## 8. Check API Key

In [None]:
# Check for API key
api_key = os.environ.get("ANTHROPIC_API_KEY")

if not api_key:
    print("=" * 60)
    print("⚠️  WARNING: No API key found!")
    print("=" * 60)
    print("Please set your ANTHROPIC_API_KEY in one of these ways:")
    print("1. Create a .env file with: ANTHROPIC_API_KEY=your-key-here")
    print("2. Run this in a cell: os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'")
    print("3. Set it as a system environment variable")
    print("=" * 60)
else:
    print("✓ API key found")
    print(f"   Key starts with: {api_key[:15]}...")

## 9. Run the Extraction

Execute this cell to start processing all PDFs in your folder:

In [None]:
print("=" * 60)
print("STARTING DATASET EXTRACTION")
print("=" * 60)
print(f"Input folder: {CONFIG['input_folder']}")
print(f"Output file: {CONFIG['output_file']}")
print(f"Model: {CONFIG['model']}")
print(f"Max pages per PDF: {CONFIG['max_pages'] or 'All'}")
print("=" * 60)
print()

try:
    df = process_papers_folder(
        input_folder=CONFIG["input_folder"],
        output_file=CONFIG["output_file"],
        api_key=api_key,
        model=CONFIG["model"],
        max_pages=CONFIG["max_pages"]
    )
    
    if not df.empty:
        print()
        print("=" * 60)
        print("✅ EXTRACTION COMPLETE!")
        print("=" * 60)
        print(f"Total datasets extracted: {len(df)}")
        print(f"Papers processed: {df['paper'].nunique()}")
        print(f"Output saved to: {CONFIG['output_file']}")
        print()
        print("Dataset types found:")
        print(df['dataset_type'].value_counts().to_string())
        print()
        print("Collision energies:")
        print(df['collision_energy_tev'].value_counts().to_string())
        print()
        print("=" * 60)
    else:
        print()
        print("=" * 60)
        print("⚠️  WARNING: No datasets were extracted.")
        print("=" * 60)
        print("Check the log messages above for errors.")
        df = None
        
except Exception as e:
    print()
    print("=" * 60)
    print(f"❌ EXTRACTION FAILED: {e}")
    print("=" * 60)
    import traceback
    traceback.print_exc()
    df = None

## 10. Display Results

View the extracted datasets:

In [None]:
if df is not None and not df.empty:
    print(f"\nShowing first 10 rows of {len(df)} total datasets:\n")
    display(df.head(10))
    
    print(f"\n\nDataFrame info:")
    df.info()
else:
    print("No data to display. Please check the extraction step above.")

## 11. Optional: Filter or Analyze Results

You can further analyze the extracted data:

In [None]:
if df is not None and not df.empty:
    # Example: Filter for Real Data only
    real_data = df[df['dataset_type'] == 'Real Data']
    print(f"Real Data datasets: {len(real_data)}")
    
    # Example: Filter for Simulated MC
    simulated = df[df['dataset_type'] == 'Simulated MC']
    print(f"Simulated MC datasets: {len(simulated)}")
    
    # Example: Group by paper
    print("\nDatasets per paper:")
    print(df.groupby('paper').size().sort_values(ascending=False))
    
    # Example: Show datasets with DOIs
    with_doi = df[df['doi'].notna() & (df['doi'] != 'N/A') & (df['doi'] != 'null')]
    print(f"\nDatasets with DOIs: {len(with_doi)}")
else:
    print("No data to analyze.")

## 12. Optional: Export to Excel

If you prefer Excel format:

In [None]:
if df is not None and not df.empty:
    excel_file = CONFIG['output_file'].replace('.csv', '.xlsx')
    df.to_excel(excel_file, index=False, engine='openpyxl')
    print(f"✓ Also saved to Excel: {excel_file}")
else:
    print("No data to export.")