# Understanding the "Near-Miss" Approach for Negative Examples

This notebook provides a detailed, step-by-step explanation of how the validation dataset selects **negative examples** (papers that should be excluded) for evaluating LLM screening performance.

## The Problem

To evaluate how well an LLM can screen papers for systematic reviews, we need:
- **Positive examples**: Papers that SHOULD be included (easy — these are cited in the review)
- **Negative examples**: Papers that SHOULD be excluded (harder — we don't have a list of "rejected" papers)

## The Near-Miss Solution

Instead of using random papers (too easy to distinguish), we use papers from **related reviews** that were NOT included in the target review. These are "near-misses" — papers that are topically similar but don't meet the specific inclusion criteria.

---

In [None]:
# Setup: Load the same data used in the ground truth construction
import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict
import random

random.seed(42)
np.random.seed(42)

DATA_DIR = Path.cwd().parent / "Data" if not (Path.cwd() / "Data").exists() else Path.cwd() / "Data"
ABSTRACTS_CSV = DATA_DIR / "cochrane_pubmed_abstracts.csv"
REFERENCES_CSV = DATA_DIR / "cochrane_pubmed_references.csv"
REF_ABSTRACTS_CSV = DATA_DIR / "referenced_paper_abstracts.csv"

# Load the datasets
cochrane = pd.read_csv(ABSTRACTS_CSV, dtype={"pmid": str, "year": str})
refs = pd.read_csv(REFERENCES_CSV, dtype={"citing_pmid": str, "ref_pmid": str})
ref_abstracts = pd.read_csv(REF_ABSTRACTS_CSV, dtype={"pmid": str, "year": str})

print(f"Loaded {len(cochrane):,} Cochrane reviews")
print(f"Loaded {len(refs):,} reference edges")
print(f"Loaded {len(ref_abstracts):,} paper abstracts")

---

## Step 1: Build the Review-Paper Graph

The key data structure is a **bipartite graph** connecting:
- **Cochrane Reviews** (nodes on one side)
- **Papers** (nodes on the other side)
- **Edges**: A review cites a paper (meaning the paper was "included" in that review)

### Visual Representation

```
    REVIEWS                    PAPERS
    ═══════                    ══════
    
    Review A ─────────────────► Paper 1
        │                          │
        ├─────────────────────► Paper 2 ◄──────── Review B
        │                                              │
        └─────────────────────► Paper 3                │
                                   │                   │
    Review B ─────────────────► Paper 4                │
        │                                              │
        └─────────────────────► Paper 5 ◄─────────────┘
                                   │
    Review C ─────────────────► Paper 6
        │
        └─────────────────────► Paper 2  (shared with Review A!)
```

Notice that **Paper 2** is included in both Review A and Review C. This creates a "relatedness" connection between the reviews.

In [None]:
# Build the two key mappings:
# 1. review_to_included: For each review, which papers are included?
# 2. paper_to_reviews: For each paper, which reviews include it?

refs_valid = refs[refs["ref_pmid"].notna() & (refs["ref_pmid"] != "")].copy()

# Mapping 1: Review → Set of included papers
review_to_included = refs_valid.groupby("citing_pmid")["ref_pmid"].apply(set).to_dict()

# Mapping 2: Paper → Set of reviews that include it
paper_to_reviews = defaultdict(set)
for review, papers in review_to_included.items():
    for paper in papers:
        paper_to_reviews[paper].add(review)

print(f"Built graph with:")
print(f"  - {len(review_to_included):,} reviews")
print(f"  - {len(paper_to_reviews):,} unique papers")
print(f"  - {len(refs_valid):,} edges (citations)")

---

## Step 2: Finding Related Reviews

Two reviews are "related" if they **share at least one included paper**.

### Example

```
Review A: "Interventions for treating depression after stroke"
  └── Includes papers: {P1, P2, P3, P4, P5}

Review B: "Pharmacological interventions for depression in adults"
  └── Includes papers: {P2, P6, P7, P8, P9}  ← Shares P2 with Review A!

Review C: "Exercise for treating depression"
  └── Includes papers: {P3, P10, P11, P12}   ← Shares P3 with Review A!

Review D: "Treatments for chronic pain"
  └── Includes papers: {P20, P21, P22}       ← No overlap with Review A
```

For Review A:
- **Related reviews**: B (via P2), C (via P3)
- **Unrelated review**: D (no shared papers)

In [None]:
# Let's pick a real example and trace through the algorithm

# Find a review with a reasonable number of included papers
papers_with_abstracts = set(ref_abstracts[ref_abstracts["abstract"].notna()]["pmid"])
review_to_included_filtered = {
    r: p & papers_with_abstracts 
    for r, p in review_to_included.items()
}

# Pick an example review
example_reviews = [r for r, p in review_to_included_filtered.items() if 10 <= len(p) <= 20]
EXAMPLE_REVIEW = example_reviews[0] if example_reviews else list(review_to_included_filtered.keys())[0]

# Get the review details
review_info = cochrane[cochrane["pmid"] == EXAMPLE_REVIEW].iloc[0]
included_papers = review_to_included_filtered[EXAMPLE_REVIEW]

print("="*80)
print("EXAMPLE REVIEW")
print("="*80)
print(f"\nPMID: {EXAMPLE_REVIEW}")
print(f"Title: {review_info['title']}")
print(f"\nNumber of included papers: {len(included_papers)}")
print(f"\nSample of included paper PMIDs: {list(included_papers)[:5]}")

In [None]:
# Step 2a: Find which reviews share papers with our example review

related_reviews = set()
paper_connections = {}  # Track which papers create the connection

for paper in included_papers:
    # Which OTHER reviews also include this paper?
    other_reviews = paper_to_reviews.get(paper, set()) - {EXAMPLE_REVIEW}
    for other_review in other_reviews:
        related_reviews.add(other_review)
        if other_review not in paper_connections:
            paper_connections[other_review] = []
        paper_connections[other_review].append(paper)

print(f"Found {len(related_reviews)} related reviews that share papers with our example")
print("\n" + "-"*80)
print("Sample of related reviews and their connections:")
print("-"*80)

for i, (related_pmid, shared_papers) in enumerate(list(paper_connections.items())[:5]):
    related_info = cochrane[cochrane["pmid"] == related_pmid]
    if len(related_info) > 0:
        title = related_info.iloc[0]["title"][:70]
        print(f"\n{i+1}. {related_pmid}: {title}...")
        print(f"   Shared papers: {len(shared_papers)} (PMIDs: {shared_papers[:3]}...)")

---

## Step 3: The Near-Miss Algorithm

Now we can construct the **negative candidates** for our example review:

### Algorithm

```
For a target review R with included papers {P1, P2, P3}:

1. Find all papers included in R: included_papers = {P1, P2, P3}

2. For each included paper, find other reviews that also include it:
   - P1 is in: {R, Review_X}
   - P2 is in: {R, Review_Y, Review_Z}
   - P3 is in: {R, Review_Y}
   
3. Collect ALL papers from those related reviews:
   - Review_X includes: {P1, P10, P11, P12}
   - Review_Y includes: {P2, P3, P20, P21}
   - Review_Z includes: {P2, P30, P31, P32}
   
   related_papers = {P1, P2, P3, P10, P11, P12, P20, P21, P30, P31, P32}

4. Remove papers already included in R:
   negative_candidates = related_papers - included_papers
                       = {P10, P11, P12, P20, P21, P30, P31, P32}

5. Sample from negative_candidates for the validation set
```

### Why This Works

The negative candidates are papers that:
- ✅ Are from the same general **medical domain** (depression treatments, cancer screening, etc.)
- ✅ Were **included in some systematic review** (so they're legitimate research papers)
- ❌ Were **NOT included in our target review** (for some reason: wrong population, wrong intervention, wrong study design, etc.)

This creates realistic "hard negatives" that an LLM must carefully evaluate against the specific inclusion criteria.

In [None]:
# Implement the near-miss algorithm step by step

def get_negative_candidates_detailed(review_pmid: str, included_papers: set, verbose=True):
    """
    Get papers from related reviews that weren't included in this review.
    This version includes detailed logging for understanding the process.
    """
    if verbose:
        print(f"\n{'='*80}")
        print(f"FINDING NEGATIVE CANDIDATES FOR REVIEW {review_pmid}")
        print(f"{'='*80}")
        print(f"\nStep 1: This review includes {len(included_papers)} papers")
    
    # Step 2: Find related reviews through shared papers
    related_reviews = set()
    for paper in included_papers:
        other_reviews = paper_to_reviews.get(paper, set()) - {review_pmid}
        related_reviews.update(other_reviews)
    
    if verbose:
        print(f"\nStep 2: Found {len(related_reviews)} related reviews (share at least 1 paper)")
    
    # Step 3: Collect all papers from related reviews
    related_papers = set()
    for other_review in related_reviews:
        related_papers.update(review_to_included_filtered.get(other_review, set()))
    
    if verbose:
        print(f"\nStep 3: Related reviews collectively include {len(related_papers)} papers")
    
    # Step 4: Remove papers already in this review
    negative_candidates = related_papers - included_papers
    
    # Step 5: Keep only papers with abstracts
    negative_candidates = negative_candidates & papers_with_abstracts
    
    if verbose:
        print(f"\nStep 4: After removing included papers: {len(negative_candidates)} candidates")
        print(f"\n✓ These are the 'near-miss' papers: topically similar but NOT in this review")
    
    return list(negative_candidates)

# Run the algorithm on our example
negative_candidates = get_negative_candidates_detailed(EXAMPLE_REVIEW, included_papers)

In [None]:
# Let's look at some example negative candidates and compare them to included papers

print("="*80)
print("COMPARISON: INCLUDED vs NEAR-MISS PAPERS")
print("="*80)

# Get abstracts for comparison
abstract_lookup = ref_abstracts.set_index("pmid").to_dict("index")

print("\n" + "-"*80)
print("INCLUDED PAPERS (Label = 1, should be included)")
print("-"*80)

for i, pmid in enumerate(list(included_papers)[:3]):
    if pmid in abstract_lookup:
        paper = abstract_lookup[pmid]
        print(f"\n{i+1}. PMID {pmid}")
        print(f"   Title: {paper.get('title', 'N/A')[:80]}")
        abstract = paper.get('abstract', 'N/A')
        if abstract:
            print(f"   Abstract: {abstract[:150]}...")

print("\n" + "-"*80)
print("NEAR-MISS PAPERS (Label = 0, should be excluded)")
print("-"*80)

for i, pmid in enumerate(negative_candidates[:3]):
    if pmid in abstract_lookup:
        paper = abstract_lookup[pmid]
        print(f"\n{i+1}. PMID {pmid}")
        print(f"   Title: {paper.get('title', 'N/A')[:80]}")
        abstract = paper.get('abstract', 'N/A')
        if abstract:
            print(f"   Abstract: {abstract[:150]}...")

---

## Visual Diagram: The Complete Process

```
                         ┌─────────────────────────────────────────────────────────┐
                         │              TARGET REVIEW: "Depression after stroke"   │
                         │                                                         │
                         │  INCLUDED PAPERS: {P1, P2, P3, P4, P5}                  │
                         │                    ↓                                    │
                         │  Selection Criteria: "RCTs of pharmacological           │
                         │  interventions for depression in stroke patients"       │
                         └─────────────────────────────────────────────────────────┘
                                              │
                                              │ Papers P2 and P3 also appear in:
                                              ▼
     ┌────────────────────────────────────────┴────────────────────────────────────────┐
     │                                                                                  │
     ▼                                                                                  ▼
┌─────────────────────────────────────┐                    ┌─────────────────────────────────────┐
│  RELATED REVIEW A:                  │                    │  RELATED REVIEW B:                  │
│  "Antidepressants for elderly"      │                    │  "Exercise for depression"          │
│                                     │                    │                                     │
│  PAPERS: {P2, P10, P11, P12}        │                    │  PAPERS: {P3, P20, P21, P22}        │
│           ↑                         │                    │           ↑                         │
│      shared!                        │                    │      shared!                        │
└─────────────────────────────────────┘                    └─────────────────────────────────────┘
              │                                                           │
              │                                                           │
              └──────────────────────────┬────────────────────────────────┘
                                         │
                                         ▼
                    ┌────────────────────────────────────────────┐
                    │  NEGATIVE CANDIDATES (Near-Misses):        │
                    │                                            │
                    │  {P10, P11, P12, P20, P21, P22}            │
                    │                                            │
                    │  These papers are:                         │
                    │  ✓ From related medical topics             │
                    │  ✓ High-quality (included in SOME review)  │
                    │  ✗ NOT in target review (don't match       │
                    │    specific criteria like "stroke patients")│
                    └────────────────────────────────────────────┘
```

---

## Why Near-Misses Are "Hard" Negatives

Consider this example:

| Review | Criteria | Paper | Why Included/Excluded |
|--------|----------|-------|----------------------|
| "SSRIs for depression in stroke patients" | RCTs, stroke survivors, SSRI treatment | "Fluoxetine for post-stroke depression" | ✅ **INCLUDED** - RCT, stroke patients, SSRI |
| | | "Fluoxetine for depression in elderly" | ❌ **NEAR-MISS** - RCT, SSRI, but NOT stroke patients |
| | | "Exercise therapy for post-stroke depression" | ❌ **NEAR-MISS** - RCT, stroke patients, but NOT pharmacological |
| | | "A review of pizza recipes" | ❌ **RANDOM** - Obviously unrelated (too easy!) |

The near-miss papers require the LLM to **carefully read the criteria** and identify the specific reason for exclusion. Random papers would be too easy to reject.

In [None]:
# Summary statistics on the near-miss approach

print("="*80)
print("SUMMARY: NEAR-MISS STATISTICS")
print("="*80)

# Calculate statistics across all reviews
total_negatives = 0
reviews_with_negatives = 0
negative_counts = []

for review_pmid, included in review_to_included_filtered.items():
    if len(included) >= 5:  # Only reviews with enough included papers
        negatives = get_negative_candidates_detailed(review_pmid, included, verbose=False)
        negative_counts.append(len(negatives))
        total_negatives += len(negatives)
        if len(negatives) > 0:
            reviews_with_negatives += 1

print(f"\nReviews analyzed: {len(negative_counts):,}")
print(f"Reviews with ≥1 negative candidate: {reviews_with_negatives:,} ({100*reviews_with_negatives/len(negative_counts):.1f}%)")
print(f"\nNegative candidates per review:")
print(f"  Mean: {np.mean(negative_counts):.1f}")
print(f"  Median: {np.median(negative_counts):.1f}")
print(f"  Min: {np.min(negative_counts)}")
print(f"  Max: {np.max(negative_counts):,}")

---

## Limitations of the Near-Miss Approach

### What It Does Well ✅
- Creates **realistic negative examples** that are topically similar
- Uses papers that are **legitimate research** (included in some systematic review)
- Forces the LLM to **carefully evaluate criteria** rather than just topic matching

### Limitations ⚠️

1. **Not "True" Excluded Papers**
   - These papers weren't explicitly considered and rejected for the target review
   - We don't know the actual reason they're not included
   - Some might actually meet the criteria if evaluated

2. **Selection Bias**
   - Only papers included in OTHER reviews are candidates
   - Papers that were screened and rejected from ALL reviews are not captured

3. **Topic Drift**
   - Related reviews might be on somewhat different topics
   - E.g., a review on "stroke" might connect to a review on "cardiovascular disease" which connects to "diabetes"

### Better Alternative: Actual Excluded Studies

Cochrane reviews include a **"Characteristics of Excluded Studies"** table that lists:
- Papers that were **explicitly screened and rejected**
- The **reason for exclusion** (wrong population, wrong intervention, etc.)

This would provide true ground-truth negatives, but requires access to full Cochrane Library data.

In [None]:
# Final summary: The complete code used in the validation set construction

print("="*80)
print("THE FINAL FUNCTION USED IN 03_build_ground_truth.ipynb")
print("="*80)

code = '''
def get_negative_candidates(review_pmid: str, included_papers: set) -> list:
    """
    Get papers included in related reviews but NOT in this review.
    These are "near-miss" papers - topically similar but not matching
    the specific inclusion criteria of the target review.
    
    Algorithm:
    1. For each paper included in this review, find other reviews that include it
    2. Collect all papers from those related reviews
    3. Remove papers already included in this review
    4. Keep only papers with available abstracts
    
    Returns: List of PMIDs for negative candidate papers
    """
    # Step 1 & 2: Find all papers from related reviews
    related_papers = set()
    for paper in included_papers:
        # Find other reviews that also include this paper
        other_reviews = paper_to_reviews.get(paper, set()) - {review_pmid}
        # Add all papers from those related reviews
        for other_review in other_reviews:
            related_papers.update(review_to_included_filtered.get(other_review, set()))
    
    # Step 3: Remove papers already in this review
    negative_candidates = related_papers - included_papers
    
    # Step 4: Keep only papers with abstracts available
    negative_candidates = negative_candidates & papers_with_abstracts
    
    return list(negative_candidates)
'''

print(code)

---

## Summary

The **near-miss approach** creates negative examples by:

1. **Finding related reviews** through shared citations (bipartite graph traversal)
2. **Collecting papers from those reviews** that are NOT in the target review
3. **Using these as negative examples** for LLM evaluation

This creates **challenging test cases** where the LLM must carefully evaluate inclusion criteria rather than just matching topics.

---

*Notebook created to explain the methodology in `03_build_ground_truth.ipynb`*