# 05: Build Ground Truth Validation Dataset

## Objective
Build a validation dataset to evaluate whether an LLM can correctly determine if a paper should be **included** or **excluded** from a Cochrane systematic review.

## What the LLM needs to decide
Given:
- **Review context** (title and abstract of the Cochrane systematic review)
- **Paper abstract** (a candidate paper being screened)

Predict:
- **Label = 1** → Paper should be INCLUDED in the review
- **Label = 0** → Paper should be EXCLUDED from the review

## Data structure
Each row in the validation set contains:
1. `review_title` - Title of the Cochrane systematic review
2. `review_abstract` - Abstract/objective of the review (screening criteria)
3. `paper_title` - Title of the candidate paper
4. `paper_abstract` - Abstract of the candidate paper
5. `label` - Human reviewer decision (1=included, 0=excluded)

## Input Files
- `Data/referenced_paper_abstracts.csv` - Matched candidate papers with abstracts
- `Data/cochrane_pubmed_abstracts.csv` - Cochrane review abstracts
- `Data/review_metadata.csv` - Review metadata including Cochrane group

## Output
- `Data/ground_truth_validation_dataset.csv` - Full validation dataset with `cochrane_group` column for filtering

In [40]:
# Install required packages
%pip install -q pandas

Note: you may need to restart the kernel to use updated packages.


In [46]:
# Load data
from pathlib import Path
import pandas as pd

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"

# Input files
PAPERS_CSV = DATA_DIR / "referenced_paper_abstracts.csv"
REVIEWS_CSV = DATA_DIR / "cochrane_pubmed_abstracts.csv"
META_CSV = DATA_DIR / "review_metadata.csv"
OUTPUT_CSV = DATA_DIR / "ground_truth_validation_dataset.csv"

# Load candidate papers (from notebook 04 - now includes review_doi!)
print("Loading candidate papers...")
papers = pd.read_csv(PAPERS_CSV)
print(f"Total papers: {len(papers):,}")
print(f"Columns: {papers.columns.tolist()}")

# Verify review_doi is present (critical fix from notebook 04)
assert 'review_doi' in papers.columns, "ERROR: review_doi missing! Re-run notebook 04."
print(f"✓ review_doi present - {papers['review_doi'].notna().sum():,} papers have review_doi")

# Load Cochrane review abstracts (the screening criteria context)
print("\nLoading Cochrane review abstracts...")
reviews = pd.read_csv(REVIEWS_CSV)
print(f"Total review abstracts: {len(reviews):,}")

Loading candidate papers...
Total papers: 47,518
Columns: ['study_id', 'review_doi', 'category', 'original_title', 'original_authors', 'original_year', 'pmid', 'doi', 'matched_title', 'matched_authors', 'matched_year', 'abstract', 'match_method']
✓ review_doi present - 47,518 papers have review_doi

Loading Cochrane review abstracts...
Total review abstracts: 17,298


In [47]:
# Filter papers to included/excluded with abstracts
print("Filtering to included/excluded papers with abstracts...")

# Only include/exclude (not awaiting/ongoing)
papers_filtered = papers[papers['category'].isin(['included', 'excluded'])].copy()
print(f"Included + Excluded: {len(papers_filtered):,}")

# Must have paper abstract
papers_filtered = papers_filtered[
    papers_filtered['abstract'].notna() & 
    (papers_filtered['abstract'].str.len() > 50)
]
print(f"With paper abstracts (>50 chars): {len(papers_filtered):,}")

# Add binary label
papers_filtered['label'] = (papers_filtered['category'] == 'included').astype(int)

print(f"\nLabel distribution:")
print(f"  Included (label=1): {(papers_filtered['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(papers_filtered['label'] == 0).sum():,}")

Filtering to included/excluded papers with abstracts...
Included + Excluded: 46,305
With paper abstracts (>50 chars): 41,720

Label distribution:
  Included (label=1): 14,747
  Excluded (label=0): 26,973


In [49]:
# =============================================================================
# Join with review context (FIXED - using review_doi from papers directly!)
# =============================================================================
# Previous bug: Joined categorized_references on study_id alone, causing cartesian product
# Fix: papers now include review_doi directly from notebook 04

print("Joining papers with review context...")
print(f"Papers with review_doi: {papers_filtered['review_doi'].notna().sum():,}")

# Load metadata to get cochrane_group for each review
review_metadata = pd.read_csv(META_CSV)
review_groups = review_metadata[['doi', 'cochrane_group']].rename(columns={'doi': 'review_doi'})
print(f"Review metadata loaded: {len(review_groups):,} reviews with cochrane_group")

# Clean up review abstracts for joining
# IMPORTANT: Deduplicate reviews to avoid cartesian product!
reviews_clean = reviews[['doi', 'title', 'abstract']].rename(columns={
    'doi': 'review_doi',
    'title': 'review_title', 
    'abstract': 'review_abstract'
}).drop_duplicates(subset='review_doi', keep='first')

# Add cochrane_group to reviews
reviews_clean = reviews_clean.merge(review_groups, on='review_doi', how='left')

print(f"Unique reviews: {len(reviews_clean):,}")

# Also deduplicate papers (some duplicates from notebook 04 dedup logic)
papers_dedup = papers_filtered.drop_duplicates(subset=['study_id', 'review_doi'], keep='first')
print(f"Unique (study_id, review_doi) in papers: {len(papers_dedup):,}")

# Simple inner join - no longer need the intermediate step!
# The review_doi in papers_filtered links directly to the correct review
validation_df = papers_dedup.merge(reviews_clean, on='review_doi', how='inner')

print(f"\nPapers matched to reviews: {len(validation_df):,}")
print(f"Unique (study_id, review_doi) pairs: {validation_df.groupby(['study_id', 'review_doi']).ngroups:,}")

# Verify 1:1 mapping
assert len(validation_df) == validation_df.groupby(['study_id', 'review_doi']).ngroups, \
    "ERROR: Still have duplicates!"
print(f"\n✓ Join successful - exactly 1 row per (study_id, review_doi) pair")

Joining papers with review context...
Papers with review_doi: 41,720
Review metadata loaded: 16,588 reviews with cochrane_group
Unique reviews: 16,617
Unique (study_id, review_doi) in papers: 41,692

Papers matched to reviews: 41,692
Unique (study_id, review_doi) pairs: 41,692

✓ Join successful - exactly 1 row per (study_id, review_doi) pair


In [50]:
# Build final validation dataset (includes cochrane_group for filtering)
ground_truth = validation_df[[
    'review_doi',
    'review_title',
    'review_abstract',
    'cochrane_group',
    'study_id',
    'original_title',
    'abstract',
    'category',
    'label'
]].rename(columns={
    'original_title': 'paper_title',
    'abstract': 'paper_abstract'
}).copy()

# Filter to rows where review has an abstract (context for LLM)
ground_truth = ground_truth[
    ground_truth['review_abstract'].notna() & 
    (ground_truth['review_abstract'].str.len() > 100)
]

print("VALIDATION DATASET STRUCTURE")
print("=" * 60)
print(f"Total examples: {len(ground_truth):,}")
print(f"\nColumns:")
for col in ground_truth.columns:
    print(f"  • {col}")

print(f"\nLabel distribution:")
print(f"  Included (label=1): {(ground_truth['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(ground_truth['label'] == 0).sum():,}")
print(f"\nUnique reviews: {ground_truth['review_doi'].nunique():,}")

print("\n" + "=" * 60)
print("SAMPLE ROW (what the LLM will see):")
print("=" * 60)
sample = ground_truth.iloc[0]
print(f"\nREVIEW TITLE: {sample['review_title'][:100]}...")
print(f"\nREVIEW ABSTRACT (context): {sample['review_abstract'][:300]}...")
print(f"\nPAPER TITLE: {sample['paper_title']}")
print(f"\nPAPER ABSTRACT: {sample['paper_abstract'][:300]}...")
print(f"\nLABEL: {sample['label']} ({'INCLUDED' if sample['label']==1 else 'EXCLUDED'})")

VALIDATION DATASET STRUCTURE
Total examples: 41,692

Columns:
  • review_doi
  • review_title
  • review_abstract
  • cochrane_group
  • study_id
  • paper_title
  • paper_abstract
  • category
  • label

Label distribution:
  Included (label=1): 14,738
  Excluded (label=0): 26,954

Unique reviews: 1,228

SAMPLE ROW (what the LLM will see):

REVIEW TITLE: Acupuncture for smoking cessation....

REVIEW ABSTRACT (context): Acupuncture is promoted as a treatment for smoking cessation, and is believed to reduce withdrawal symptoms. The objective of this review is to determine the effectiveness of acupuncture in smoking cessation in comparison with: a) sham acupuncture b) other interventions c) no intervention. We search...

PAPER TITLE: Laser acupuncture for adolescent smokers - a randomized double-blind controlled trial

PAPER ABSTRACT: A double blind, randomized, placebo-controlled clinical study was conducted to evaluate the efficacy of laser acupuncture treatment in adolescent smokers. 

In [51]:
# =============================================================================
# Save Validation Dataset (all categories with cochrane_group for filtering)
# =============================================================================

ground_truth.to_csv(OUTPUT_CSV, index=False)

print("GROUND TRUTH VALIDATION DATASET COMPLETE")
print("=" * 60)
print(f"\n✓ Saved to {OUTPUT_CSV.name}")
print(f"  Total examples: {len(ground_truth):,}")
print(f"  Included (label=1): {(ground_truth['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(ground_truth['label'] == 0).sum():,}")
print(f"  Unique reviews: {ground_truth['review_doi'].nunique():,}")

print(f"\nCochrane groups (filter using 'cochrane_group' column):")
print(ground_truth['cochrane_group'].value_counts().to_string())
print(f"\n✓ Ready for LLM evaluation!")

GROUND TRUTH VALIDATION DATASET COMPLETE

✓ Saved to ground_truth_validation_dataset.csv
  Total examples: 41,692
  Included (label=1): 14,738
  Excluded (label=0): 26,954
  Unique reviews: 1,228

Cochrane groups (filter using 'cochrane_group' column):
cochrane_group
Acute Respiratory Infections    11455
Tobacco Addiction               10198
Infectious Diseases              8516
Drugs and Alcohol                6754
Public Health                    4089
STI                               680

✓ Ready for LLM evaluation!
