# Building the Ground Truth Validation Set

**Summary:** In this notebook, I create an annotated dataset to evaluate how well LLMs can screen papers for systematic reviews. I need both positive examples (papers that were included in a review) and negative examples (papers that were not included).

**What I do:**
1. I load the Cochrane reviews, their references, and the fetched paper abstracts
2. I extract inclusion/exclusion criteria from each Cochrane review abstract
3. For positive examples: I sample papers that appear in a review's reference list (these passed screening)
4. For negative examples: I sample papers from related reviews that were NOT included in this review (realistic "near-miss" papers)
5. I create a balanced dataset of 1,000 records (500 included, 500 excluded)

**Output:** `ground_truth_validation_set.csv` with review criteria, paper abstracts, and include/exclude labels

In [1]:
# I load the required libraries and set up file paths
import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict
import re
import random

random.seed(42)
np.random.seed(42)

DATA_DIR = Path.cwd().parent / "Data" if not (Path.cwd() / "Data").exists() else Path.cwd() / "Data"
ABSTRACTS_CSV = DATA_DIR / "cochrane_pubmed_abstracts.csv"
REFERENCES_CSV = DATA_DIR / "cochrane_pubmed_references.csv"
REF_ABSTRACTS_CSV = DATA_DIR / "referenced_paper_abstracts.csv"
OUTPUT_CSV = DATA_DIR / "ground_truth_validation_set.csv"

print(f"Data directory: {DATA_DIR}")

Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data


In [2]:
# I load all three datasets: Cochrane reviews, references, and paper abstracts
print("Loading datasets...")
cochrane = pd.read_csv(ABSTRACTS_CSV, dtype={"pmid": str, "year": str})
refs = pd.read_csv(REFERENCES_CSV, dtype={"citing_pmid": str, "ref_pmid": str})
ref_abstracts = pd.read_csv(REF_ABSTRACTS_CSV, dtype={"pmid": str, "year": str})

print(f"Cochrane reviews: {len(cochrane):,}")
print(f"Reference edges: {len(refs):,}")
print(f"Referenced paper abstracts: {len(ref_abstracts):,}")

Loading datasets...
Cochrane reviews: 17,092
Reference edges: 1,182,678
Referenced paper abstracts: 491,529


In [3]:
# I build mappings: which papers are included in each review, and which reviews include each paper
refs_valid = refs[refs["ref_pmid"].notna() & (refs["ref_pmid"] != "")].copy()
print(f"Valid reference edges (with PMID): {len(refs_valid):,}")

review_to_included = refs_valid.groupby("citing_pmid")["ref_pmid"].apply(set).to_dict()
print(f"Reviews with valid references: {len(review_to_included):,}")

paper_to_reviews = defaultdict(set)
for review, papers in review_to_included.items():
    for paper in papers:
        paper_to_reviews[paper].add(review)

print(f"Unique papers in reference graph: {len(paper_to_reviews):,}")

Valid reference edges (with PMID): 848,607
Reviews with valid references: 10,077
Unique papers in reference graph: 491,531


In [4]:
# I filter to only keep papers that have abstracts available
papers_with_abstracts = set(ref_abstracts[ref_abstracts["abstract"].notna() & (ref_abstracts["abstract"] != "")]["pmid"])
print(f"Papers with abstracts: {len(papers_with_abstracts):,}")

review_to_included_filtered = {
    review: papers & papers_with_abstracts
    for review, papers in review_to_included.items()
}

MIN_INCLUDED = 5
eligible_reviews = {r: p for r, p in review_to_included_filtered.items() if len(p) >= MIN_INCLUDED}
print(f"Reviews with >= {MIN_INCLUDED} included papers (with abstracts): {len(eligible_reviews):,}")

Papers with abstracts: 443,977
Reviews with >= 5 included papers (with abstracts): 10,013


In [5]:
# I define a function to extract inclusion criteria from Cochrane abstract text
def extract_criteria_from_abstract(abstract: str) -> dict:
    """Extract the selection criteria section from a Cochrane abstract."""
    if not abstract or pd.isna(abstract):
        return {"objectives": "", "criteria": "", "full_context": ""}
    
    result = {"objectives": "", "criteria": "", "full_context": abstract}
    text = abstract.upper()
    
    criteria_match = re.search(
        r"(SELECTION CRITERIA|ELIGIBILITY CRITERIA|TYPES OF STUDIES|INCLUSION CRITERIA)[:\s]*(.*?)(?=(SEARCH METHODS|DATA COLLECTION|MAIN RESULTS|AUTHORS|$))",
        text, re.DOTALL
    )
    if criteria_match:
        start, end = criteria_match.start(), criteria_match.end()
        result["criteria"] = abstract[start:end].strip()
    
    obj_match = re.search(
        r"(OBJECTIVE[S]?|RATIONALE|BACKGROUND)[:\s]*(.*?)(?=(SELECTION CRITERIA|SEARCH METHODS|ELIGIBILITY|TYPES OF|DATA COLLECTION|$))",
        text, re.DOTALL
    )
    if obj_match:
        start, end = obj_match.start(), obj_match.end()
        result["objectives"] = abstract[start:end].strip()
    
    return result

In [6]:
# I extract criteria for all eligible reviews
cochrane_lookup = cochrane.set_index("pmid").to_dict("index")

review_criteria = {}
for review_pmid in eligible_reviews.keys():
    if review_pmid in cochrane_lookup:
        review_data = cochrane_lookup[review_pmid]
        criteria = extract_criteria_from_abstract(review_data.get("abstract", ""))
        review_criteria[review_pmid] = {
            "title": review_data.get("title", ""),
            "objectives": criteria["objectives"],
            "criteria": criteria["criteria"],
            "full_abstract": review_data.get("abstract", ""),
        }

print(f"Extracted criteria for {len(review_criteria):,} reviews")
has_criteria = sum(1 for v in review_criteria.values() if v["criteria"])
print(f"Reviews with parseable criteria section: {has_criteria:,} ({100*has_criteria/len(review_criteria):.1f}%)")

Extracted criteria for 10,013 reviews
Reviews with parseable criteria section: 9,660 (96.5%)


In [7]:
# I define a function to get negative examples: papers from related reviews that weren't included in this one
def get_negative_candidates(review_pmid: str, included_papers: set) -> list:
    """Get papers included in related reviews but NOT in this review (realistic near-misses)."""
    related_papers = set()
    for paper in included_papers:
        other_reviews = paper_to_reviews.get(paper, set()) - {review_pmid}
        for other_review in other_reviews:
            related_papers.update(review_to_included_filtered.get(other_review, set()))
    
    negative_candidates = related_papers - included_papers
    negative_candidates = negative_candidates & papers_with_abstracts
    return list(negative_candidates)

In [8]:
# I set up sampling parameters and select which reviews to use
N_REVIEWS = 100
POSITIVES_PER_REVIEW = 5
NEGATIVES_PER_REVIEW = 5

reviews_with_criteria = [r for r in eligible_reviews.keys() if r in review_criteria and review_criteria[r]["criteria"]]
reviews_with_enough = [r for r in reviews_with_criteria if len(eligible_reviews[r]) >= POSITIVES_PER_REVIEW]
sampled_reviews = random.sample(reviews_with_enough, min(N_REVIEWS, len(reviews_with_enough)))

print(f"Eligible reviews with criteria: {len(reviews_with_criteria):,}")
print(f"Reviews with >= {POSITIVES_PER_REVIEW} included papers: {len(reviews_with_enough):,}")
print(f"Sampled {len(sampled_reviews)} reviews for validation set")

Eligible reviews with criteria: 9,660
Reviews with >= 5 included papers: 9,660
Sampled 100 reviews for validation set


In [9]:
# I generate the validation set by sampling positive and negative examples for each review
paper_lookup = ref_abstracts.set_index("pmid").to_dict("index")
validation_records = []

for review_pmid in sampled_reviews:
    included = list(eligible_reviews[review_pmid])
    criteria_data = review_criteria[review_pmid]
    
    pos_sample = random.sample(included, min(POSITIVES_PER_REVIEW, len(included)))
    neg_candidates = get_negative_candidates(review_pmid, set(included))
    neg_sample = random.sample(neg_candidates, min(NEGATIVES_PER_REVIEW, len(neg_candidates))) if neg_candidates else []
    
    for paper_pmid in pos_sample:
        if paper_pmid in paper_lookup:
            paper = paper_lookup[paper_pmid]
            validation_records.append({
                "review_pmid": review_pmid,
                "review_title": criteria_data["title"],
                "review_objectives": criteria_data["objectives"],
                "review_criteria": criteria_data["criteria"],
                "paper_pmid": paper_pmid,
                "paper_title": paper.get("title", ""),
                "paper_abstract": paper.get("abstract", ""),
                "label": 1,
            })
    
    for paper_pmid in neg_sample:
        if paper_pmid in paper_lookup:
            paper = paper_lookup[paper_pmid]
            validation_records.append({
                "review_pmid": review_pmid,
                "review_title": criteria_data["title"],
                "review_objectives": criteria_data["objectives"],
                "review_criteria": criteria_data["criteria"],
                "paper_pmid": paper_pmid,
                "paper_title": paper.get("title", ""),
                "paper_abstract": paper.get("abstract", ""),
                "label": 0,
            })

print(f"Generated {len(validation_records):,} validation records")

Generated 1,000 validation records


In [10]:
# I convert to DataFrame and check the label balance
validation_df = pd.DataFrame(validation_records)

print("Validation Set Summary:")
print(f"  Total records: {len(validation_df):,}")
print(f"  Unique reviews: {validation_df['review_pmid'].nunique()}")
print(f"  Unique papers: {validation_df['paper_pmid'].nunique()}")
print(f"\nLabel distribution:")
print(validation_df["label"].value_counts().rename({1: "Included", 0: "Excluded"}))
print(f"\nBalance: {validation_df['label'].mean():.1%} positive")

Validation Set Summary:
  Total records: 1,000
  Unique reviews: 100
  Unique papers: 993

Label distribution:
label
Included    500
Excluded    500
Name: count, dtype: int64

Balance: 50.0% positive


In [12]:
# I save the validation set to CSV (skip if file already exists)
if OUTPUT_CSV.exists():
    print(f"Validation set already exists at: {OUTPUT_CSV}")
    print(f"File size: {OUTPUT_CSV.stat().st_size / 1024 / 1024:.2f} MB")
    print("Skipping save. Delete the file to regenerate.")
else:
    validation_df.to_csv(OUTPUT_CSV, index=False)
    print(f"Saved validation set to: {OUTPUT_CSV}")
    print(f"File size: {OUTPUT_CSV.stat().st_size / 1024 / 1024:.2f} MB")

Validation set already exists at: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\ground_truth_validation_set.csv
File size: 2.89 MB
Skipping save. Delete the file to regenerate.


In [13]:
# I preview sample records to verify the data looks correct
print("Sample INCLUDED record:")
sample_pos = validation_df[validation_df["label"] == 1].iloc[0]
print(f"Review: {sample_pos['review_title'][:80]}...")
print(f"Criteria: {sample_pos['review_criteria'][:200]}...")
print(f"Paper: {sample_pos['paper_title']}")
print(f"Abstract: {sample_pos['paper_abstract'][:200]}...")

print("\n" + "="*60)
print("\nSample EXCLUDED record:")
sample_neg = validation_df[validation_df["label"] == 0].iloc[0]
print(f"Review: {sample_neg['review_title'][:80]}...")
print(f"Criteria: {sample_neg['review_criteria'][:200]}...")
print(f"Paper: {sample_neg['paper_title']}")
print(f"Abstract: {sample_neg['paper_abstract'][:200]}...")

Sample INCLUDED record:
Review: Dynamic compression plating versus locked intramedullary nailing for humeral sha...
Criteria: SELECTION CRITERIA: Randomised and quasi-randomised controlled trials comparing compression plates and locked intramedullary nail fixation for humeral shaft fractures in adults....
Paper: Operative treatment of humeral shaft fractures.
Abstract: The results of the operative treatment of 27 humeral shaft fractures treated at the University of Louisville during a 2-year period were reviewed. The aim of this study was to analyze 1) the indicatio...


Sample EXCLUDED record:
Review: Dynamic compression plating versus locked intramedullary nailing for humeral sha...
Criteria: SELECTION CRITERIA: Randomised and quasi-randomised controlled trials comparing compression plates and locked intramedullary nail fixation for humeral shaft fractures in adults....
Paper: Functional results following fractures of the proximal humerus. A controlled clinical study comparing two pe