# Validation Dataset Construction Workflow

**How the ground truth dataset was built and filtered to Public Health reviews**

This notebook documents the complete pipeline for constructing the validation dataset used to evaluate LLM performance on systematic review screening.

---

## Pipeline Overview

The validation dataset was constructed through a **6-stage pipeline**, each handled by a dedicated notebook:

| Stage | Notebook | Input | Output |
|-------|----------|-------|--------|
| 1 | `00_obtain_cochrane_abstracts` | PubMed API | 17,298 Cochrane review records |
| 2 | `02_fetch_cochrane_pdfs` | Wiley TDM API | 16,588 PDFs downloaded |
| 3 | `03_extract_metadata_and_references` | Local PDFs | 629,561 categorized references |
| 4 | `04_fetch_referenced_abstracts` | CrossRef + PubMed APIs | 47,518 matched papers |
| 5 | `05_build_ground_truth` | All above datasets | 41,692 validation records |
| 6 | `06_evaluate_llms` | Filtered to Public Health | 4,089 records for evaluation |

---

## Stage 1: Obtain Cochrane Review Abstracts

**Notebook**: `00_obtain_cochrane_abstracts.ipynb`

**Source**: PubMed API (NCBI Entrez)

**Process**:
1. Query PubMed for all publications from the "Cochrane Database of Systematic Reviews"
2. Fetch metadata including: PMID, DOI, title, abstract, publication date, authors
3. Parse DOIs to extract Cochrane review identifiers

**Output**: `cochrane_pubmed_abstracts.csv` (17,298 reviews)

---

## Stage 2: Download Cochrane PDFs

**Notebook**: `02_fetch_cochrane_pdfs.ipynb`

**Source**: Wiley Text and Data Mining (TDM) API

**Process**:
1. For each Cochrane DOI, construct the Wiley TDM download URL
2. Authenticate with TDM token (requires institutional access)
3. Download full-text PDFs to local storage

**Output**: `cochrane_pdfs/` directory (16,588 PDFs, ~50GB)

**Note**: PDFs are not committed to git due to size and licensing.

---

## Stage 3: Extract Metadata and References from PDFs

**Notebook**: `03_extract_metadata_and_references.ipynb`

**Source**: Local PDF processing using PyMuPDF (fitz)

**Process**:
1. Parse each PDF to extract full text
2. Extract metadata: title, authors, abstract, **Cochrane review group**
3. Identify reference sections: "References to studies included", "References to studies excluded"
4. Parse references to extract: author, year, title, DOI (if present), PMID (if present)
5. Categorize each reference as: `included`, `excluded`, `awaiting`, or `ongoing`

**Key Data Point**: The **Cochrane review group** (e.g., "Public Health") is extracted from the PDF header text using regex patterns like:
- `Cochrane ([A-Za-z\s&]+?) Group`
- `Cochrane ([A-Za-z\s&]+?) Review Group`

**Output**: 
- `review_metadata.csv` (16,588 reviews with cochrane_group)
- `categorized_references.csv` (629,561 references with category labels)

---

## Stage 4: Match References to PubMed and Fetch Abstracts

**Notebook**: `04_fetch_referenced_abstracts.ipynb`

**Source**: CrossRef API + PubMed API (NCBI Entrez)

**Process**:
1. For references with DOI: directly fetch from PubMed
2. For references without DOI: query CrossRef with bibliographic search (author + title + year)
3. Validate matches by comparing titles (fuzzy matching)
4. Fetch abstracts from PubMed using matched PMIDs

**Match Methods**:
- 99.2% matched via CrossRef bibliographic search
- 0.8% matched via direct DOI extraction

**Output**: `referenced_paper_abstracts.csv` (47,518 papers with abstracts)

---

## Stage 5: Build Ground Truth Validation Dataset

**Notebook**: `05_build_ground_truth.ipynb`

**Source**: Joins data from all previous stages

**Process**:
1. Join `review_metadata.csv` with `cochrane_pubmed_abstracts.csv` to get review title/abstract
2. Join with `referenced_paper_abstracts.csv` to get paper title/abstract
3. Map category to binary label:
   - `included` → **label = 1** (INCLUDE)
   - `excluded` → **label = 0** (EXCLUDE)
4. Filter to records with both review abstract AND paper abstract (required for LLM evaluation)
5. Add `cochrane_group` from review metadata

**Final Schema**:
| Column | Description |
|--------|-------------|
| `review_doi` | Cochrane review DOI |
| `review_title` | Review title (defines inclusion criteria) |
| `review_abstract` | Review abstract |
| `cochrane_group` | Cochrane review group (e.g., "Public Health") |
| `study_id` | Unique identifier for the referenced paper |
| `paper_title` | Title of the candidate paper |
| `paper_abstract` | Abstract of the candidate paper |
| `category` | Original category (included/excluded) |
| `label` | Binary label: 1=INCLUDE, 0=EXCLUDE |

**Output**: `ground_truth_validation_dataset.csv` (41,692 records)

---

## Stage 6: Filter to Public Health Reviews

**Notebook**: `06_evaluate_llms.ipynb`

**Filtering Criteria**: `cochrane_group == 'Public Health'`

**Why Public Health?**
1. **Domain relevance**: Aligns with UKHSA's public health mission
2. **Manageable size**: 4,089 records suitable for local LLM inference
3. **Representative imbalance**: ~80% EXCLUDE mirrors real screening scenarios
4. **Focused evaluation**: Tests LLM performance on a coherent domain

**Evaluation Subset Statistics**:
- Total records: 4,089
- INCLUDE (label=1): ~20%
- EXCLUDE (label=0): ~80%

---

## LLM Evaluation Prompts

Two prompt strategies were evaluated:

### 1. Zero-Shot Prompt

```
You are screening papers for a Cochrane systematic review. Your job is to EXCLUDE papers that don't match.

CRITICAL CALIBRATION:
- In this dataset, only ~20% of papers should be INCLUDED
- Most papers (~80%) should be EXCLUDED
- If in doubt, EXCLUDE - false positives waste reviewer time

EXCLUSION CRITERIA (if ANY apply → EXCLUDE):
1. WRONG POPULATION - Paper studies different population than the review targets
2. WRONG INTERVENTION - Paper doesn't evaluate the intervention/exposure the review examines
3. WRONG STUDY DESIGN - Paper is observational/descriptive when review needs trials
4. WRONG TOPIC - Paper addresses a tangentially related but different question

INCLUSION CRITERIA (ALL must apply → INCLUDE):
✓ Population matches the review's target population
✓ Intervention/exposure matches what the review examines
✓ Study design is appropriate (usually trials for Cochrane reviews)
✓ Outcomes are relevant to the review question

{few_shot_examples}

=== REVIEW BEING CONDUCTED ===
Title: {review_title}
Criteria: {review_abstract}

=== PAPER TO SCREEN ===
Title: {paper_title}
Abstract: {paper_abstract}

=== YOUR DECISION ===
Does this paper match ALL inclusion criteria? Most papers do NOT.

Respond with exactly one word: INCLUDE or EXCLUDE
```

### 2. Chain-of-Thought (CoT) Prompt

```
You are screening papers for a Cochrane systematic review. Your job is to EXCLUDE papers that don't match.

CRITICAL CALIBRATION:
- In this dataset, only ~20% of papers should be INCLUDED
- Most papers (~80%) should be EXCLUDED  
- If in doubt, EXCLUDE - false positives waste reviewer time

EXCLUSION CHECKLIST (if ANY = NO → EXCLUDE):
□ Population match? Paper studies the same population as the review?
□ Intervention match? Paper evaluates the intervention/exposure the review examines?
□ Study design match? Paper is the right type (trials for most Cochrane reviews)?

{few_shot_examples}

=== REVIEW BEING CONDUCTED ===
Title: {review_title}
Criteria: {review_abstract}

=== PAPER TO SCREEN ===
Title: {paper_title}
Abstract: {paper_abstract}

=== EXCLUSION-FIRST ANALYSIS ===

Step 1 - POPULATION CHECK:
- Review targets: [identify the population]
- Paper studies: [identify the population]  
- Match? [YES/NO] - If NO → EXCLUDE

Step 2 - INTERVENTION CHECK:
- Review examines: [identify intervention/exposure]
- Paper examines: [identify what the paper studies]
- Match? [YES/NO] - If NO → EXCLUDE

Step 3 - STUDY DESIGN CHECK:
- Review requires: [usually RCTs or intervention trials]
- Paper design: [identify: RCT, cohort, cross-sectional, survey, review, etc.]
- Match? [YES/NO] - If observational when trials needed → EXCLUDE

Step 4 - FINAL DECISION:
- If ANY check = NO → EXCLUDE
- If ALL checks = YES → INCLUDE

DECISION: [INCLUDE or EXCLUDE]
```

### Few-Shot Examples (included in both prompts)

3 EXCLUDE examples + 2 INCLUDE examples to calibrate the expected 80/20 class distribution:

| Type | Review | Paper | Decision | Reason |
|------|--------|-------|----------|--------|
| EXCLUDE | "Interventions for preventing obesity in children" | "Association between screen time and childhood obesity: systematic review" | EXCLUDE | This is a SYSTEMATIC REVIEW of associations, not an intervention study |
| EXCLUDE | "Interventions for preventing obesity in children" | "Workplace wellness programs for obese adults" | EXCLUDE | WRONG POPULATION (adults) and WRONG SETTING (workplace) |
| EXCLUDE | "Interventions for preventing obesity in children" | "Prevalence of childhood obesity: cross-sectional survey" | EXCLUDE | PREVALENCE SURVEY - no intervention tested |
| INCLUDE | "Interventions for preventing obesity in children" | "School-based nutrition education program: RCT" | INCLUDE | Intervention + correct population + RCT design |
| INCLUDE | "School-based physical activity programs" | "Daily PE classes on fitness: cluster RCT" | INCLUDE | Intervention + correct population + correct setting |

---

## Verify Dataset Statistics

In [1]:
import pandas as pd
from pathlib import Path

# Setup paths
notebook_dir = Path.cwd()
project_root = notebook_dir.parent if 'Jupyter Notebooks' in str(notebook_dir) else notebook_dir
DATA_DIR = project_root / 'Data'

# Load validation dataset
gt = pd.read_csv(DATA_DIR / 'ground_truth_validation_dataset.csv', dtype=str, low_memory=False)
gt['label'] = gt['label'].astype(int)

print("=" * 70)
print("VALIDATION DATASET OVERVIEW")
print("=" * 70)
print(f"Total records:           {len(gt):,}")
print(f"Unique Cochrane reviews: {gt['review_doi'].nunique():,}")
print()

print("Distribution by Cochrane Group:")
print("-" * 50)
for group, count in gt['cochrane_group'].value_counts().items():
    include_rate = (gt[gt['cochrane_group']==group]['label']==1).mean()*100
    print(f"  {group:30s} {count:6,} records ({include_rate:5.1f}% INCLUDE)")

print()
print("=" * 70)
print("PUBLIC HEALTH SUBSET (used for LLM evaluation)")
print("=" * 70)
ph = gt[gt['cochrane_group'] == 'Public Health']
print(f"Total records:    {len(ph):,}")
print(f"  INCLUDE:        {(ph['label']==1).sum():,} ({(ph['label']==1).mean()*100:.1f}%)")
print(f"  EXCLUDE:        {(ph['label']==0).sum():,} ({(ph['label']==0).mean()*100:.1f}%)")

VALIDATION DATASET OVERVIEW
Total records:           41,692
Unique Cochrane reviews: 1,228

Distribution by Cochrane Group:
--------------------------------------------------
  Acute Respiratory Infections   11,455 records ( 35.6% INCLUDE)
  Tobacco Addiction              10,198 records ( 42.8% INCLUDE)
  Infectious Diseases             8,516 records ( 33.5% INCLUDE)
  Drugs and Alcohol               6,754 records ( 34.1% INCLUDE)
  Public Health                   4,089 records ( 20.7% INCLUDE)
  STI                               680 records ( 42.8% INCLUDE)

PUBLIC HEALTH SUBSET (used for LLM evaluation)
Total records:    4,089
  INCLUDE:        848 (20.7%)
  EXCLUDE:        3,241 (79.3%)
