# 05: Execute PubMed Searches & Identify Excluded Papers

## Objective
Execute the translated PubMed search queries (from notebook 03) against the PubMed API
to identify papers that were retrieved by systematic review searches but are **NOT** referenced
in any Cochrane review. These form the **excluded** set for ground truth evaluation.

## Revised Ground Truth Logic
- **Included (label=1)**: ALL papers referenced in any Cochrane review
  (regardless of internal categorization as included/excluded/awaiting/ongoing)
- **Excluded (label=0)**: Papers that appear in PubMed search results but are
  NOT referenced in any Cochrane review

## Pipeline
1. Load translated search strategies from `search_strategies.csv`
2. Build "known papers" set from Cochrane reviews (PMIDs from notebooks 03 + 04)
3. Execute each PubMed query via NCBI Entrez API
4. Collect PMIDs from search results, track source review
5. Filter out papers that appear in any Cochrane review
6. Fetch abstracts for excluded papers
7. Save results

## Input Files
- `Data/search_strategies.csv` — Translated PubMed queries (from notebook 03)
- `Data/categorized_references.csv` — References extracted from PDFs (notebook 03)
- `Data/referenced_paper_abstracts.csv` — Matched references with PMIDs (notebook 04)
- `Data/doi_pmid_cache.csv` — DOI→PMID mappings (notebook 04)

## Output Files
- `Data/pubmed_search_results.csv` — Search execution log (counts, errors, PMIDs per query)
- `Data/pubmed_excluded_abstracts.csv` — Abstracts for excluded-only papers

In [1]:
%pip install -q biopython pandas tqdm

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import os
import time
import json
import re
from pathlib import Path
from tqdm.notebook import tqdm
from Bio import Entrez
from collections import Counter, defaultdict

# =============================================================================
# Configuration
# =============================================================================

# NCBI API — with API key allows 10 requests/sec
Entrez.email = os.environ.get("NCBI_EMAIL", "")
Entrez.api_key = os.environ.get("NCBI_API_KEY", "")

NCBI_RATE = 0.11 if Entrez.api_key else 0.34  # seconds between requests
SEARCH_RETMAX = 10000  # max PMIDs to retrieve per query

# Paths
notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"

# Input files
STRATEGIES_CSV  = DATA_DIR / "search_strategies.csv"
REFS_CSV        = DATA_DIR / "categorized_references.csv"
ABSTRACTS_CSV   = DATA_DIR / "referenced_paper_abstracts.csv"
DOI_PMID_CACHE  = DATA_DIR / "doi_pmid_cache.csv"

# Output files
SEARCH_RESULTS_CSV    = DATA_DIR / "pubmed_search_results.csv"
SEARCH_PROGRESS_CSV   = DATA_DIR / "pubmed_search_progress.csv"
EXCLUDED_ABSTRACTS_CSV = DATA_DIR / "pubmed_excluded_abstracts.csv"

print(f"Data directory: {DATA_DIR}")
print(f"NCBI API key configured: {bool(Entrez.api_key)}")
print(f"Rate limit: {1/NCBI_RATE:.0f} requests/sec")
print(f"Max PMIDs per query: {SEARCH_RETMAX:,}")

Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data
NCBI API key configured: True
Rate limit: 9 requests/sec
Max PMIDs per query: 10,000


In [3]:
# =============================================================================
# Load Search Strategies
# =============================================================================

strategies = pd.read_csv(STRATEGIES_CSV)
print(f"Total strategies: {len(strategies):,}")
print(f"\nTranslation notes:")
print(strategies['translation_notes'].value_counts().head(10).to_string())

# Filter to successfully translated queries
translated = strategies[strategies['pubmed_query'].notna()].copy()
print(f"\nTranslated queries: {len(translated):,}")

# Deduplicate identical queries (different versions of the same review often share a search)
translated['query_hash'] = translated['pubmed_query'].apply(hash)
unique_queries = translated.drop_duplicates(subset='query_hash').copy()

# Build mapping: query_hash → list of review DOIs
query_to_reviews = translated.groupby('query_hash')['doi'].apply(list).to_dict()

print(f"Unique queries after dedup: {len(unique_queries):,}")
print(f"  (covers {len(translated):,} review-strategy pairs)")
print(f"\nQuery length stats:")
print(unique_queries['pubmed_query'].str.len().describe().to_string())

Total strategies: 16,906

Translation notes:
translation_notes
narrative_only                                       8785
success                                              1921
adjacency_converted_to_AND                           1657
filtered_not_a_query                                 1329
mp_approximated; adjacency_converted_to_AND           651
mp_approximated                                       561
no_numbered_lines                                     548
wildcard_approximated; adjacency_converted_to_AND     249
adjacency_converted_to_AND; wildcard_approximated     213
wildcard_approximated                                 187

Translated queries: 6,197
Unique queries after dedup: 4,869
  (covers 6,197 review-strategy pairs)

Query length stats:
count    4869.000000
mean      462.298008
std       485.262759
min         6.000000
25%       117.000000
50%       275.000000
75%       674.000000
max      3970.000000


In [4]:
# =============================================================================
# Build Known PMID Set (papers already in Cochrane reviews)
# =============================================================================
# Union of all PMIDs from: direct extraction (notebook 03), CrossRef matching
# (notebook 04), and DOI→PMID cache (notebook 04).

known_pmids = set()

# Source 1: Direct PMIDs from reference extraction
if REFS_CSV.exists():
    refs = pd.read_csv(REFS_CSV, usecols=['pmid'], dtype=str)
    direct = {p.strip() for p in refs['pmid'].dropna() if p.strip().isdigit()}
    known_pmids |= direct
    print(f"Source 1 — Direct extraction:    {len(direct):>8,} PMIDs")

# Source 2: Matched PMIDs from abstract fetching
if ABSTRACTS_CSV.exists():
    abs_df = pd.read_csv(ABSTRACTS_CSV, usecols=['pmid'], dtype=str)
    matched = {p.strip() for p in abs_df['pmid'].dropna() if p.strip().isdigit()}
    known_pmids |= matched
    print(f"Source 2 — Matched abstracts:    {len(matched):>8,} PMIDs")

# Source 3: DOI→PMID cache
if DOI_PMID_CACHE.exists():
    cache = pd.read_csv(DOI_PMID_CACHE, dtype=str)
    cached = {p.strip() for p in cache[cache['pmid'] != 'NO_PMID']['pmid'].dropna() if p.strip().isdigit()}
    known_pmids |= cached
    print(f"Source 3 — DOI→PMID cache:      {len(cached):>8,} PMIDs")

print(f"\n{'='*50}")
print(f"Total known PMIDs (in Cochrane reviews): {len(known_pmids):,}")

Source 1 — Direct extraction:      34,881 PMIDs
Source 2 — Matched abstracts:      23,182 PMIDs
Source 3 — DOI→PMID cache:        23,182 PMIDs

Total known PMIDs (in Cochrane reviews): 56,814


In [None]:
# =============================================================================
# PubMed Search & Abstract Fetch Functions
# =============================================================================

def execute_pubmed_search(query, retmax=10000, max_retries=3):
    """Execute a PubMed search and return PMIDs + total count.

    Returns:
        dict with keys: pmids (list[str]), count (int), error (str|None),
              retmax_hit (bool — True if count > retmax, results truncated)
    """
    for attempt in range(max_retries):
        try:
            handle = Entrez.esearch(
                db="pubmed",
                term=query,
                retmax=retmax,
                usehistory="y",
            )
            results = Entrez.read(handle)
            handle.close()

            count = int(results.get('Count', 0))
            pmids = [str(p) for p in results.get('IdList', [])]

            return {
                'pmids': pmids,
                'count': count,
                'error': None,
                'retmax_hit': count > retmax,
            }
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** (attempt + 1))
            else:
                return {'pmids': [], 'count': 0, 'error': str(e), 'retmax_hit': False}


def extract_pubmed_record(record):
    """Extract key fields from a PubMed XML article record."""
    try:
        cit = record['MedlineCitation']
        art = cit['Article']
        pmid = str(cit['PMID'])
        title = str(art.get('ArticleTitle', ''))

        # Abstract
        abstract = ''
        if 'Abstract' in art and 'AbstractText' in art['Abstract']:
            parts = art['Abstract']['AbstractText']
            abstract = ' '.join(str(p) for p in parts) if isinstance(parts, list) else str(parts)

        # Year
        year = ''
        if 'Journal' in art and 'JournalIssue' in art['Journal']:
            year = art['Journal']['JournalIssue'].get('PubDate', {}).get('Year', '')

        # Authors
        authors = []
        if 'AuthorList' in art:
            for auth in art['AuthorList']:
                if 'LastName' in auth:
                    name = auth['LastName']
                    if auth.get('Initials'):
                        name += ' ' + auth['Initials']
                    authors.append(name)

        # DOI
        doi = ''
        if 'ELocationID' in art:
            for loc in art['ELocationID']:
                if hasattr(loc, 'attributes') and loc.attributes.get('EIdType') == 'doi':
                    doi = str(loc)
                    break

        return {'pmid': pmid, 'title': title, 'abstract': abstract,
                'year': year, 'authors': '; '.join(authors), 'doi': doi}
    except Exception:
        return None


def fetch_abstracts_batch(pmids, batch_size=200, max_retries=3):
    """Batch-fetch PubMed records. Returns {pmid: record_dict}."""
    results = {}
    pmid_list = [str(p) for p in pmids if str(p).isdigit()]

    batches = range(0, len(pmid_list), batch_size)
    for i in tqdm(batches, desc="Fetching abstracts"):
        batch = pmid_list[i:i + batch_size]

        for attempt in range(max_retries):
            try:
                time.sleep(NCBI_RATE * (attempt + 1))
                handle = Entrez.efetch(
                    db="pubmed", id=",".join(batch),
                    rettype="xml", retmode="xml"
                )
                records = Entrez.read(handle)
                handle.close()

                for article in records.get('PubmedArticle', []):
                    data = extract_pubmed_record(article)
                    if data:
                        results[data['pmid']] = data
                break  # success
            except Exception as e:
                if attempt == max_retries - 1:
                    print(f"  Batch {i//batch_size + 1} failed after {max_retries} attempts: {e}")

    return results


print("Functions defined:")
print("  • execute_pubmed_search() — Run a PubMed query, get PMIDs")
print("  • fetch_abstracts_batch()  — Batch fetch abstracts by PMID")

Functions defined:
  • execute_pubmed_search() — Run a PubMed query, get PMIDs
  • fetch_abstracts_batch()  — Batch fetch abstracts by PMID


In [6]:
# =============================================================================
# Execute PubMed Searches (with checkpoint every 100 queries)
# =============================================================================

print("EXECUTING PUBMED SEARCHES")
print("=" * 60)

# Resume from checkpoint if available
if SEARCH_PROGRESS_CSV.exists():
    progress = pd.read_csv(SEARCH_PROGRESS_CSV)
    done_hashes = set(progress['query_hash'])
    print(f"Resuming from checkpoint: {len(done_hashes):,} queries already done")
else:
    progress = pd.DataFrame()
    done_hashes = set()

remaining = unique_queries[~unique_queries['query_hash'].isin(done_hashes)]
print(f"Queries to execute: {len(remaining):,}")
print(f"Estimated time: ~{len(remaining) * NCBI_RATE / 60:.0f} min")

# --- Main loop ---
batch_results = []
total_pmids = 0
errors = 0
capped = 0
start_time = time.time()

for idx, (_, row) in enumerate(tqdm(remaining.iterrows(),
                                     total=len(remaining), desc="PubMed Search")):
    query = row['pubmed_query']
    review_dois = query_to_reviews.get(row['query_hash'], [row['doi']])

    result = execute_pubmed_search(query, retmax=SEARCH_RETMAX)
    time.sleep(NCBI_RATE)

    batch_results.append({
        'query_hash': row['query_hash'],
        'review_doi': row['doi'],
        'all_review_dois': '|'.join(review_dois),
        'query_length': len(query),
        'result_count': result['count'],
        'pmids_retrieved': len(result['pmids']),
        'retmax_hit': result['retmax_hit'],
        'error': result['error'],
        'pmids_json': json.dumps(result['pmids']),
    })

    total_pmids += len(result['pmids'])
    if result['error']:
        errors += 1
    if result['retmax_hit']:
        capped += 1

    # Checkpoint every 100 queries
    if (idx + 1) % 100 == 0:
        batch_df = pd.DataFrame(batch_results)
        combined = pd.concat([progress, batch_df], ignore_index=True)
        combined.to_csv(SEARCH_PROGRESS_CSV, index=False)
        elapsed = time.time() - start_time
        rate = (idx + 1) / elapsed * 60
        print(f"  [{idx+1:,}/{len(remaining):,}] "
              f"{total_pmids:,} PMIDs | {errors} errors | "
              f"{capped} capped | {rate:.0f} q/min")

# Final save
if batch_results:
    batch_df = pd.DataFrame(batch_results)
    combined = pd.concat([progress, batch_df], ignore_index=True)
    combined.to_csv(SEARCH_PROGRESS_CSV, index=False)

elapsed = time.time() - start_time
print(f"\n{'='*60}")
print(f"Complete: {len(remaining):,} queries in {elapsed/60:.1f} min")
print(f"  Total PMIDs retrieved: {total_pmids:,}")
print(f"  Errors: {errors:,}")
print(f"  Queries exceeding retmax ({SEARCH_RETMAX:,}): {capped:,}")

EXECUTING PUBMED SEARCHES
Queries to execute: 4,869
Estimated time: ~9 min


PubMed Search:   0%|          | 0/4869 [00:00<?, ?it/s]

  [100/4,869] 576,029 PMIDs | 0 errors | 53 capped | 26 q/min
  [200/4,869] 1,107,753 PMIDs | 0 errors | 96 capped | 26 q/min
  [300/4,869] 1,563,219 PMIDs | 0 errors | 136 capped | 25 q/min
  [400/4,869] 2,075,111 PMIDs | 0 errors | 180 capped | 24 q/min
  [500/4,869] 2,442,827 PMIDs | 0 errors | 212 capped | 25 q/min
  [600/4,869] 2,922,256 PMIDs | 0 errors | 252 capped | 24 q/min
  [700/4,869] 3,491,760 PMIDs | 0 errors | 301 capped | 24 q/min
  [800/4,869] 4,074,792 PMIDs | 0 errors | 354 capped | 24 q/min
  [900/4,869] 4,621,398 PMIDs | 0 errors | 399 capped | 24 q/min
  [1,000/4,869] 5,035,869 PMIDs | 0 errors | 430 capped | 24 q/min
  [1,100/4,869] 5,551,841 PMIDs | 0 errors | 471 capped | 24 q/min
  [1,200/4,869] 5,979,161 PMIDs | 0 errors | 508 capped | 24 q/min
  [1,300/4,869] 6,459,396 PMIDs | 0 errors | 549 capped | 25 q/min
  [1,400/4,869] 6,982,257 PMIDs | 0 errors | 594 capped | 25 q/min
  [1,500/4,869] 7,493,611 PMIDs | 0 errors | 637 capped | 24 q/min
  [1,600/4,869] 8

In [8]:
# =============================================================================
# Filter Queries & Identify Excluded PMIDs
# =============================================================================

print("IDENTIFYING EXCLUDED PAPERS")
print("=" * 60)

# Load full search results
results_df = pd.read_csv(SEARCH_PROGRESS_CSV)
print(f"Total queries executed: {len(results_df):,}")
print(f"  With results: {(results_df['result_count'] > 0).sum():,}")
print(f"  Errors: {results_df['error'].notna().sum():,}")
print(f"  Hit retmax cap ({SEARCH_RETMAX:,}): {results_df['retmax_hit'].sum():,}")

# ---- QUALITY FILTER ----
# Queries that hit the retmax cap are overly broad (100K–35M results).
# Real systematic review searches typically return a few hundred to a few
# thousand results. Restrict to queries that returned ALL their results
# (i.e., result_count <= retmax) and have at least 1 result.
MAX_RESULTS = 10_000  # match SEARCH_RETMAX

successful = results_df[
    (results_df['error'].isna()) &
    (results_df['result_count'] > 0) &
    (~results_df['retmax_hit'])
].copy()

print(f"\nAfter quality filter (non-capped, with results):")
print(f"  Retained queries: {len(successful):,}")
print(f"  Dropped (capped/empty): {len(results_df) - len(successful):,}")
print(f"\nResult count stats for retained queries:")
print(successful['result_count'].describe().to_string())

# Collect all unique PMIDs and track source reviews
all_search_pmids = set()
pmid_to_reviews = defaultdict(set)  # pmid → set of review DOIs

for _, row in tqdm(successful.iterrows(), total=len(successful), desc="Collecting PMIDs"):
    if pd.isna(row['pmids_json']) or row['pmids_json'] == '[]':
        continue
    pmids = json.loads(row['pmids_json'])
    review_dois = row['all_review_dois'].split('|')
    for pmid in pmids:
        all_search_pmids.add(pmid)
        for rdoi in review_dois:
            pmid_to_reviews[pmid].add(rdoi)

print(f"\nTotal unique PMIDs from filtered searches: {len(all_search_pmids):,}")

# Partition into included (in Cochrane) and excluded (search-only)
overlap_pmids  = all_search_pmids & known_pmids
excluded_pmids = all_search_pmids - known_pmids

print(f"\nOverlap with Cochrane reviews:  {len(overlap_pmids):,} PMIDs")
print(f"Excluded (search-only):         {len(excluded_pmids):,} PMIDs")
print(f"Exclusion rate:                 {len(excluded_pmids)/max(len(all_search_pmids),1)*100:.1f}%")

# Build excluded (review_doi, pmid) pairs for ground truth
excluded_pairs = []
for pmid in excluded_pmids:
    for rdoi in pmid_to_reviews[pmid]:
        excluded_pairs.append({'review_doi': rdoi, 'pmid': pmid})

excluded_pairs_df = pd.DataFrame(excluded_pairs).drop_duplicates()
print(f"\nExcluded (review, pmid) pairs:  {len(excluded_pairs_df):,}")
print(f"Unique reviews with excluded:   {excluded_pairs_df['review_doi'].nunique():,}")

IDENTIFYING EXCLUDED PAPERS
Total queries executed: 4,869
  With results: 4,308
  Errors: 0
  Hit retmax cap (10,000): 2,159

After quality filter (non-capped, with results):
  Retained queries: 2,149
  Dropped (capped/empty): 2,720

Result count stats for retained queries:
count    2149.000000
mean     1666.276408
std      2319.338814
min         1.000000
25%        66.000000
50%       541.000000
75%      2363.000000
max      9995.000000


Collecting PMIDs:   0%|          | 0/2149 [00:00<?, ?it/s]


Total unique PMIDs from filtered searches: 2,661,996

Overlap with Cochrane reviews:  19,244 PMIDs
Excluded (search-only):         2,642,752 PMIDs
Exclusion rate:                 99.3%

Excluded (review, pmid) pairs:  4,146,832
Unique reviews with excluded:   2,402


In [9]:
# =============================================================================
# Fetch Abstracts for Excluded Papers
# =============================================================================

print("FETCHING ABSTRACTS FOR EXCLUDED PAPERS")
print("=" * 60)

# If the excluded set is very large, sample to keep manageable
MAX_ABSTRACT_FETCH = 200_000  # adjust as needed
pmids_to_fetch = list(excluded_pmids)

if len(pmids_to_fetch) > MAX_ABSTRACT_FETCH:
    print(f"Excluded PMIDs ({len(pmids_to_fetch):,}) exceeds cap ({MAX_ABSTRACT_FETCH:,}).")
    print(f"Randomly sampling {MAX_ABSTRACT_FETCH:,} for abstract fetching.")
    rng = np.random.default_rng(42)
    pmids_to_fetch = list(rng.choice(pmids_to_fetch, size=MAX_ABSTRACT_FETCH, replace=False))
else:
    print(f"Fetching abstracts for all {len(pmids_to_fetch):,} excluded PMIDs")

est_batches = len(pmids_to_fetch) // 200 + 1
est_min = est_batches * NCBI_RATE / 60
print(f"Estimated: {est_batches:,} batches, ~{max(est_min, 1):.0f} min\n")

start_time = time.time()
excluded_records = fetch_abstracts_batch(pmids_to_fetch, batch_size=200)
elapsed = time.time() - start_time

with_abstract = sum(1 for r in excluded_records.values() if r.get('abstract'))
print(f"\nFetched {len(excluded_records):,} records in {elapsed/60:.1f} min")
print(f"  With abstracts:    {with_abstract:,} ({with_abstract/max(len(excluded_records),1)*100:.1f}%)")
print(f"  Without abstracts: {len(excluded_records) - with_abstract:,}")

FETCHING ABSTRACTS FOR EXCLUDED PAPERS
Excluded PMIDs (2,642,752) exceeds cap (200,000).
Randomly sampling 200,000 for abstract fetching.
Estimated: 1,001 batches, ~2 min



Fetching abstracts:   0%|          | 0/1001 [00:00<?, ?it/s]


Fetched 199,645 records in 141.1 min
  With abstracts:    175,578 (87.9%)
  Without abstracts: 24,067


In [10]:
# =============================================================================
# Build & Save Final Output
# =============================================================================

print("BUILDING EXCLUDED PAPERS DATASET")
print("=" * 60)

# Join excluded pairs with fetched abstracts
output_rows = []
no_record = 0

for _, pair in tqdm(excluded_pairs_df.iterrows(), total=len(excluded_pairs_df),
                     desc="Building output"):
    pmid = pair['pmid']
    rec = excluded_records.get(pmid)
    if rec is None:
        no_record += 1
        continue

    output_rows.append({
        'review_doi': pair['review_doi'],
        'pmid': pmid,
        'title': rec.get('title', ''),
        'abstract': rec.get('abstract', ''),
        'authors': rec.get('authors', ''),
        'year': rec.get('year', ''),
        'doi': rec.get('doi', ''),
    })

excluded_df = pd.DataFrame(output_rows)

# Filter to rows that actually have an abstract (required for LLM evaluation)
excluded_with_abs = excluded_df[
    excluded_df['abstract'].notna() &
    (excluded_df['abstract'].str.len() > 50)
].copy()

print(f"\nTotal excluded pairs with records: {len(excluded_df):,}")
print(f"With abstracts (>50 chars):        {len(excluded_with_abs):,}")
print(f"Skipped (no PubMed record):        {no_record:,}")
print(f"\nUnique excluded PMIDs with abstract: {excluded_with_abs['pmid'].nunique():,}")
print(f"Unique reviews represented:          {excluded_with_abs['review_doi'].nunique():,}")

# Save
excluded_with_abs.to_csv(EXCLUDED_ABSTRACTS_CSV, index=False)
print(f"\n✓ Saved to {EXCLUDED_ABSTRACTS_CSV.name}")

# Also save the search results summary (without PMIDs JSON for smaller file)
search_summary = results_df.drop(columns=['pmids_json'], errors='ignore')
search_summary.to_csv(SEARCH_RESULTS_CSV, index=False)
print(f"✓ Search summary saved to {SEARCH_RESULTS_CSV.name}")

BUILDING EXCLUDED PAPERS DATASET


Building output:   0%|          | 0/4146832 [00:00<?, ?it/s]


Total excluded pairs with records: 313,024
With abstracts (>50 chars):        279,373
Skipped (no PubMed record):        3,833,808

Unique excluded PMIDs with abstract: 175,562
Unique reviews represented:          2,147

✓ Saved to pubmed_excluded_abstracts.csv
✓ Search summary saved to pubmed_search_results.csv


In [11]:
# =============================================================================
# Summary Statistics
# =============================================================================

print("=" * 60)
print("PIPELINE SUMMARY")
print("=" * 60)

print(f"\nInput:")
print(f"  Translated search queries:     {len(unique_queries):,}")
print(f"  Known PMIDs (Cochrane refs):   {len(known_pmids):,}")

print(f"\nPubMed Searches:")
print(f"  Queries executed:              {len(results_df):,}")
print(f"  Successful:                    {len(successful):,}")
print(f"  Unique PMIDs found:            {len(all_search_pmids):,}")

print(f"\nExcluded Papers:")
print(f"  Excluded PMIDs (total):        {len(excluded_pmids):,}")
print(f"  With abstracts fetched:        {excluded_with_abs['pmid'].nunique():,}")
print(f"  (review, paper) pairs:         {len(excluded_with_abs):,}")

print(f"\nOutput Files:")
print(f"  {EXCLUDED_ABSTRACTS_CSV.name}")
print(f"  {SEARCH_RESULTS_CSV.name}")

print(f"\n✓ Ready for notebook 06 (build ground truth)")

PIPELINE SUMMARY

Input:
  Translated search queries:     4,869
  Known PMIDs (Cochrane refs):   56,814

PubMed Searches:
  Queries executed:              4,869
  Successful:                    2,149
  Unique PMIDs found:            2,661,996

Excluded Papers:
  Excluded PMIDs (total):        2,642,752
  With abstracts fetched:        175,562
  (review, paper) pairs:         279,373

Output Files:
  pubmed_excluded_abstracts.csv
  pubmed_search_results.csv

✓ Ready for notebook 06 (build ground truth)
