# 06: Build Ground Truth Validation Dataset

## Objective
Build a validation dataset to evaluate whether an LLM can correctly determine if a paper should be **included** or **excluded** from a Cochrane systematic review.

## Revised Label Logic
- **Included (label=1)**: ALL papers referenced in any Cochrane review, regardless of their
  internal categorization (included, excluded, awaiting, ongoing). If human reviewers
  mentioned it, it was relevant enough to consider.
- **Excluded (label=0)**: Papers retrieved by the review's PubMed search strategy but
  NOT referenced in any Cochrane review. These are true negatives — the search found them
  but reviewers did not include them.

## What the LLM needs to decide
Given:
- **Review context** (title and abstract of the Cochrane systematic review)
- **Paper abstract** (a candidate paper being screened)

Predict:
- **Label = 1** → Paper should be INCLUDED in the review
- **Label = 0** → Paper should be EXCLUDED from the review

## Data structure
Each row in the validation set contains:
1. `review_doi` - DOI of the Cochrane systematic review
2. `review_title` - Title of the Cochrane systematic review
3. `review_abstract` - Abstract/objective of the review (screening criteria)
4. `cochrane_group` - Cochrane Review Group (for filtering)
5. `paper_title` - Title of the candidate paper
6. `paper_abstract` - Abstract of the candidate paper
7. `label` - Ground truth (1=included, 0=excluded)
8. `source` - Origin of the paper ("cochrane_ref" or "pubmed_search")

## Input Files
- `Data/referenced_paper_abstracts.csv` — Papers referenced in Cochrane reviews (notebook 04) → **all label=1**
- `Data/pubmed_excluded_abstracts.csv` — PubMed search-only papers (notebook 05) → **all label=0**
- `Data/cochrane_pubmed_abstracts.csv` — Cochrane review abstracts (notebook 00)
- `Data/review_metadata.csv` — Review metadata including Cochrane group

## Output
- `Data/ground_truth_validation_dataset.csv` - Full validation dataset with `cochrane_group` column for filtering

In [1]:
# Install required packages
%pip install -q pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# =============================================================================
# Load All Data Sources
# =============================================================================
from pathlib import Path
import pandas as pd
import re

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"

# Input files
INCLUDED_CSV  = DATA_DIR / "referenced_paper_abstracts.csv"   # from notebook 04
EXCLUDED_CSV  = DATA_DIR / "pubmed_excluded_abstracts.csv"     # from notebook 05
REVIEWS_CSV   = DATA_DIR / "cochrane_pubmed_abstracts.csv"     # from notebook 00
META_CSV      = DATA_DIR / "review_metadata.csv"               # from notebook 03
OUTPUT_CSV    = DATA_DIR / "ground_truth_validation_dataset.csv"

# --- Review metadata (filter to latest versions) ---
print("Loading review metadata...")
meta = pd.read_csv(META_CSV)
print(f"  Total reviews: {len(meta):,}")

# Version deduplication: keep only latest version of each Cochrane review
if 'cd_number' not in meta.columns or 'version' not in meta.columns:
    _vp = meta['doi'].str.extract(r'(CD\d+)(?:\.pub(\d+))?', flags=re.I)
    meta['cd_number'] = _vp[0].str.upper()
    meta['version'] = _vp[1].fillna(1).astype(int)

_has_cd = meta[meta['cd_number'].notna()]
_latest_idx = _has_cd.groupby('cd_number')['version'].idxmax()
_no_cd = meta[meta['cd_number'].isna()]
_before = len(meta)
meta = pd.concat([meta.loc[_latest_idx], _no_cd], ignore_index=True)
print(f"  After version dedup: {len(meta):,} (removed {_before - len(meta):,} superseded)")

# Set of latest-version DOIs (used to filter included/excluded)
latest_review_dois = set(meta['doi'].dropna())

# --- Included papers (ALL Cochrane references) ---
print("\nLoading included papers (Cochrane references)...")
included_raw = pd.read_csv(INCLUDED_CSV)
print(f"  Total rows: {len(included_raw):,}")
assert 'review_doi' in included_raw.columns, "ERROR: review_doi missing! Re-run notebook 04."
included_raw = included_raw[included_raw['review_doi'].isin(latest_review_dois)]
print(f"  After filtering to latest reviews: {len(included_raw):,}")

# --- Excluded papers (PubMed search-only) ---
print("\nLoading excluded papers (PubMed search results)...")
excluded_raw = pd.read_csv(EXCLUDED_CSV)
print(f"  Total rows: {len(excluded_raw):,}")
excluded_raw = excluded_raw[excluded_raw['review_doi'].isin(latest_review_dois)]
print(f"  After filtering to latest reviews: {len(excluded_raw):,}")

# --- Review abstracts ---
print("\nLoading Cochrane review abstracts...")
reviews = pd.read_csv(REVIEWS_CSV)
print(f"  Total review abstracts: {len(reviews):,}")
print(f"  Review metadata (latest): {len(meta):,} reviews")

Loading review metadata...
  Total reviews: 16,618
  After version dedup: 9,968 (removed 6,650 superseded)

Loading included papers (Cochrane references)...
  Total rows: 3,241
  After filtering to latest reviews: 3,241

Loading excluded papers (PubMed search results)...
  Total rows: 148,822
  After filtering to latest reviews: 148,822

Loading Cochrane review abstracts...
  Total review abstracts: 17,328
  Review metadata (latest): 9,968 reviews


In [3]:
# =============================================================================
# Prepare Included Papers (label=1)
# =============================================================================
# ALL papers referenced in Cochrane reviews — regardless of internal category
# (included, excluded, awaiting, ongoing) — are label=1.

print("INCLUDED PAPERS (label=1)")
print("=" * 60)

# Filter to rows with usable abstracts
included = included_raw[
    included_raw['abstract'].notna() &
    (included_raw['abstract'].str.len() > 50)
].copy()
print(f"With abstracts (>50 chars): {len(included):,}")

# Deduplicate (study_id, review_doi)
included = included.drop_duplicates(subset=['study_id', 'review_doi'], keep='first')
print(f"After dedup: {len(included):,}")

# Standardize columns for merge
included_clean = included[['review_doi', 'original_title', 'abstract']].rename(columns={
    'original_title': 'paper_title',
    'abstract': 'paper_abstract',
})
included_clean['label'] = 1
included_clean['source'] = 'cochrane_ref'

# Also keep PMID as a unique paper ID where available
if 'pmid' in included.columns:
    included_clean['pmid'] = included['pmid'].values

print(f"\nIncluded papers prepared: {len(included_clean):,}")
print(f"  Unique reviews: {included_clean['review_doi'].nunique():,}")
print(f"  Original categories: {dict(included['category'].value_counts())}")

INCLUDED PAPERS (label=1)
With abstracts (>50 chars): 3,135
After dedup: 3,134

Included papers prepared: 3,134
  Unique reviews: 58
  Original categories: {'excluded': np.int64(2257), 'included': np.int64(638), 'awaiting': np.int64(156), 'ongoing': np.int64(83)}


In [4]:
# =============================================================================
# Prepare Excluded Papers (label=0)
# =============================================================================
# Papers found by PubMed search strategies but NOT in any Cochrane review.

print("EXCLUDED PAPERS (label=0)")
print("=" * 60)

# Excluded already filtered to abstract > 50 chars in notebook 05
excluded = excluded_raw.copy()
print(f"Excluded papers loaded: {len(excluded):,}")

# Standardize columns
excluded_clean = excluded[['review_doi', 'title', 'abstract']].rename(columns={
    'title': 'paper_title',
    'abstract': 'paper_abstract',
})
excluded_clean['label'] = 0
excluded_clean['source'] = 'pubmed_search'

if 'pmid' in excluded.columns:
    excluded_clean['pmid'] = excluded['pmid'].values

print(f"Excluded papers prepared: {len(excluded_clean):,}")
print(f"  Unique reviews: {excluded_clean['review_doi'].nunique():,}")

# =============================================================================
# Combine Included + Excluded
# =============================================================================

print("\n" + "=" * 60)
print("COMBINING DATASETS")
print("=" * 60)

combined = pd.concat([included_clean, excluded_clean], ignore_index=True)
print(f"Combined total: {len(combined):,}")
print(f"  Included (label=1): {(combined['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(combined['label'] == 0).sum():,}")

EXCLUDED PAPERS (label=0)
Excluded papers loaded: 148,822
Excluded papers prepared: 148,822
  Unique reviews: 18

COMBINING DATASETS
Combined total: 151,956
  Included (label=1): 3,134
  Excluded (label=0): 148,822


In [5]:
# =============================================================================
# Join with Review Context (title, abstract, cochrane_group)
# =============================================================================

print("JOINING WITH REVIEW CONTEXT")
print("=" * 60)

# Prepare review context (deduplicated)
review_groups = meta[['doi', 'cochrane_group']].rename(columns={'doi': 'review_doi'})

reviews_clean = reviews[['doi', 'title', 'abstract']].rename(columns={
    'doi': 'review_doi',
    'title': 'review_title',
    'abstract': 'review_abstract',
}).drop_duplicates(subset='review_doi', keep='first')

reviews_clean = reviews_clean.merge(review_groups, on='review_doi', how='left')
print(f"Unique reviews with context: {len(reviews_clean):,}")

# Join combined papers with review context
ground_truth = combined.merge(reviews_clean, on='review_doi', how='inner')
print(f"Papers matched to reviews: {len(ground_truth):,}")

# ── Filter to Public Health group only ─────────────────────────────────────────
ground_truth = ground_truth[ground_truth['cochrane_group'] == 'Public Health']
print(f"After Public Health filter: {len(ground_truth):,}")

# Filter to reviews with meaningful abstracts
ground_truth = ground_truth[
    ground_truth['review_abstract'].notna() &
    (ground_truth['review_abstract'].str.len() > 100)
]
print(f"With review abstract > 100 chars: {len(ground_truth):,}")

# Final column selection
ground_truth = ground_truth[[
    'review_doi',
    'review_title',
    'review_abstract',
    'cochrane_group',
    'paper_title',
    'paper_abstract',
    'label',
    'source',
]].copy()

print(f"\nVALIDATION DATASET STRUCTURE")
print("=" * 60)
print(f"Total examples: {len(ground_truth):,}")
print(f"\nLabel distribution:")
print(f"  Included (label=1): {(ground_truth['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(ground_truth['label'] == 0).sum():,}")
print(f"  Ratio (excl/incl):  {(ground_truth['label'] == 0).sum() / max((ground_truth['label'] == 1).sum(), 1):.1f}x")
print(f"\nUnique reviews: {ground_truth['review_doi'].nunique():,}")

print("\n" + "=" * 60)
print("SAMPLE ROW (what the LLM will see):")
print("=" * 60)
sample = ground_truth.iloc[0]
print(f"\nREVIEW TITLE: {sample['review_title'][:100]}...")
print(f"\nREVIEW ABSTRACT: {sample['review_abstract'][:300]}...")
print(f"\nPAPER TITLE: {sample['paper_title']}")
print(f"\nPAPER ABSTRACT: {sample['paper_abstract'][:300]}...")
print(f"\nLABEL: {sample['label']} ({'INCLUDED' if sample['label']==1 else 'EXCLUDED'})")
print(f"SOURCE: {sample['source']}")

JOINING WITH REVIEW CONTEXT
Unique reviews with context: 16,647
Papers matched to reviews: 151,956
After Public Health filter: 151,956
With review abstract > 100 chars: 151,956

VALIDATION DATASET STRUCTURE
Total examples: 151,956

Label distribution:
  Included (label=1): 3,134
  Excluded (label=0): 148,822
  Ratio (excl/incl):  47.5x

Unique reviews: 59

SAMPLE ROW (what the LLM will see):

REVIEW TITLE: Interventions for preventing obesity in children....

REVIEW ABSTRACT: See https://doi.org/10.1002/14651858.CD015328.pub2, https://doi.org/10.1002/14651858.CD015330.pub2 and https://doi.org/10.1002/14651858.CD015326.pub2 for more recent reviews that cover this topic. Prevention of childhood obesity is an international public health priority given the significant impact...

PAPER TITLE: Kalèdo, a new educational board-game, gives nutritional rudiments and encourages healthy eating in children: a pilot cluster randomized trial

PAPER ABSTRACT: Prevention of obesity and overweight is an

In [6]:
# =============================================================================
# Save Validation Dataset
# =============================================================================

ground_truth.to_csv(OUTPUT_CSV, index=False)

print("GROUND TRUTH VALIDATION DATASET COMPLETE")
print("=" * 60)
print(f"\n✓ Saved to {OUTPUT_CSV.name}")
print(f"  Total examples:    {len(ground_truth):,}")
print(f"  Included (label=1): {(ground_truth['label'] == 1).sum():,}")
print(f"  Excluded (label=0): {(ground_truth['label'] == 0).sum():,}")
print(f"  Unique reviews:    {ground_truth['review_doi'].nunique():,}")

print(f"\nCochrane groups:")
print(ground_truth['cochrane_group'].value_counts().to_string())

print(f"\nSource breakdown:")
print(ground_truth['source'].value_counts().to_string())

print(f"\n✓ Ready for notebook 07 (LLM evaluation)!")

GROUND TRUTH VALIDATION DATASET COMPLETE

✓ Saved to ground_truth_validation_dataset.csv
  Total examples:    151,956
  Included (label=1): 3,134
  Excluded (label=0): 148,822
  Unique reviews:    59

Cochrane groups:
cochrane_group
Public Health    151956

Source breakdown:
source
pubmed_search    148822
cochrane_ref       3134

✓ Ready for notebook 07 (LLM evaluation)!
