# 04: Fetch Abstracts for Referenced Studies

## Objective
Match our extracted references to PubMed records and fetch abstracts for LLM evaluation.

## Matching Strategy (CrossRef-enhanced)

This workflow uses a 3-phase approach for maximum match rate:

1. **Direct Extraction** - Extract DOI/PMID directly from reference text using regex
2. **CrossRef API** - For references without DOI, query CrossRef's bibliographic search
3. **PubMed Lookup** - Use DOIs to fetch PMIDs, then batch-fetch abstracts

## Why CrossRef?
- CrossRef API (`query.bibliographic`) is specifically designed to match reference strings to DOIs
- Handles fuzzy matching internally (typos, ligatures, formatting issues)
- Has 130M+ DOIs - broader coverage than PubMed-only search
- DOI → PMID lookup is highly reliable (~95%+ when the paper exists in PubMed)

## Output
- `Data/referenced_paper_abstracts.csv` - Abstracts with match confidence scores

In [2]:
%pip install -q biopython pandas tqdm requests rapidfuzz

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import re
import time
import os
import requests
import unicodedata
from pathlib import Path
from tqdm.notebook import tqdm
from Bio import Entrez
from collections import Counter
from rapidfuzz import fuzz  # Fuzzy string matching for improved reference matching

# =============================================================================
# Configuration
# =============================================================================

# NCBI API configuration - with API key allows 10 requests/sec
# Credentials loaded from .env file (never hardcode!)
Entrez.email = os.environ.get("NCBI_EMAIL", "")
Entrez.api_key = os.environ.get("NCBI_API_KEY", "")

# Rate limits
NCBI_RATE = 0.11 if Entrez.api_key else 0.34  # seconds between NCBI requests
CROSSREF_RATE = 0.5  # seconds between CrossRef requests (polite)

# Fuzzy matching configuration
FUZZY_TITLE_THRESHOLD = 75  # Minimum title similarity score (0-100) to accept a CrossRef match
FUZZY_AUTHOR_WEIGHT = 0.3   # How much to weight author match vs title match

# Setup paths
notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / "Data").exists() else notebook_dir.parent
DATA_DIR = project_root / "Data"

# Input
REFS_CSV = DATA_DIR / "categorized_references.csv"
META_CSV = DATA_DIR / "review_metadata.csv"

# Output
OUTPUT_CSV = DATA_DIR / "referenced_paper_abstracts.csv"
PROGRESS_CSV = DATA_DIR / "crossref_matching_progress.csv"

print(f"Data directory: {DATA_DIR}")
print(f"NCBI API key configured: {bool(Entrez.api_key)}")
print(f"NCBI rate: {1/NCBI_RATE:.0f} requests/sec")
print(f"CrossRef rate: {1/CROSSREF_RATE:.0f} requests/sec")
print(f"Fuzzy match threshold: {FUZZY_TITLE_THRESHOLD}%")

Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data
NCBI API key configured: True
NCBI rate: 9 requests/sec
CrossRef rate: 2 requests/sec
Fuzzy match threshold: 75%


In [4]:
# =============================================================================
# Load and Filter to Public Health Reviews (Latest Versions Only)
# =============================================================================

# Load all references
refs_df = pd.read_csv(REFS_CSV)
print(f"Total references: {len(refs_df):,}")

# Load metadata for filtering
meta_df = pd.read_csv(META_CSV)
print(f"Total reviews in metadata: {len(meta_df):,}")

# ── Version deduplication ──────────────────────────────────────────────────────
# Cochrane DOIs: 10.1002/14651858.CD000004.pub2 → CD000004, version 2
# Keep only the most recent version of each review

if 'cd_number' not in meta_df.columns or 'version' not in meta_df.columns:
    # Compute version info if not already in the CSV (backward compatible)
    _vp = meta_df['doi'].str.extract(r'(CD\d+)(?:\.pub(\d+))?', flags=re.I)
    meta_df['cd_number'] = _vp[0].str.upper()
    meta_df['version'] = _vp[1].fillna(1).astype(int)

has_cd = meta_df[meta_df['cd_number'].notna()]
latest_idx = has_cd.groupby('cd_number')['version'].idxmax()
no_cd = meta_df[meta_df['cd_number'].isna()]

# Keep latest versions + any non-CD reviews
meta_df = pd.concat([meta_df.loc[latest_idx], no_cd], ignore_index=True)
print(f"After version dedup: {len(meta_df):,} reviews (removed {len(has_cd) - len(latest_idx):,} superseded)")

# ── Filter to Public Health group ─────────────────────────────────────────────
# Strict filter: only the Cochrane "Public Health" group, aligned with UKHSA's remit
PUBLIC_HEALTH_GROUPS = [
    'Public Health',
]

# Filter to PH reviews
ph_reviews = meta_df[meta_df['cochrane_group'].isin(PUBLIC_HEALTH_GROUPS)]
ph_review_dois = set(ph_reviews['doi'].dropna())

print(f"\nPublic health reviews (latest only): {len(ph_reviews):,}")
print("Groups:", ", ".join([f"{g} ({(ph_reviews['cochrane_group']==g).sum()})" 
                            for g in PUBLIC_HEALTH_GROUPS if (ph_reviews['cochrane_group']==g).sum() > 0]))

# Filter references
refs_df = refs_df[refs_df['review_doi'].isin(ph_review_dois)].copy()
print(f"\nReferences after PH filter: {len(refs_df):,}")

Total references: 630,032
Total reviews in metadata: 16,618
After version dedup: 9,968 reviews (removed 6,650 superseded)

Public health reviews (latest only): 61
Groups: Public Health (61)

References after PH filter: 5,876


In [5]:
# =============================================================================
# Deduplicate References
# =============================================================================

def create_ref_signature(row):
    """Create a normalized signature for deduplication."""
    title = str(row.get('title', '')).lower().strip()[:100]
    year = str(row.get('year', ''))
    first_author = str(row.get('authors', '')).split()[0].lower() if row.get('authors') else ''
    return f"{first_author}|{year}|{title[:50]}"

refs_df['signature'] = refs_df.apply(create_ref_signature, axis=1)
unique_refs = refs_df.drop_duplicates(subset='signature').copy()

print(f"Total references: {len(refs_df):,}")
print(f"Unique references: {len(unique_refs):,}")
print(f"Deduplication ratio: {len(refs_df)/len(unique_refs):.1f}x")
print(f"\nCategories:")
print(unique_refs['category'].value_counts().to_string())

Total references: 5,876
Unique references: 5,690
Deduplication ratio: 1.0x

Categories:
category
excluded    4173
included    1108
awaiting     268
ongoing      141


In [6]:
# =============================================================================
# PHASE 1: Extract DOI/PMID directly from reference text
# =============================================================================
# Many references already have DOI or PMID embedded in the text

def extract_pmid(text):
    """Extract PMID from reference text."""
    if pd.isna(text):
        return None
    match = re.search(r"(PMID|PUBMED|MEDLINE)[:\s]*(\d+)", str(text), flags=re.I)
    if match:
        return match.group(2)
    return None

def extract_doi(text):
    """Extract DOI from reference text."""
    if pd.isna(text):
        return None
    match = re.search(
        r"(?:DOI[:\s]*|HTTPS?://(?:DX\.)?DOI\.ORG/)?(10\.\d{4,9}/[-._;()/:A-Z0-9]+)", 
        str(text), flags=re.I
    )
    if match:
        doi = match.group(1)
        return doi.rstrip(".,;)")  # remove trailing punctuation
    return None

# Build full reference text for extraction
unique_refs['full_ref'] = (
    unique_refs['title'].fillna('') + ' ' + 
    unique_refs['authors'].fillna('') + ' ' +
    unique_refs['year'].fillna('').astype(str)
)

# Extract DOI and PMID
unique_refs['extracted_doi'] = unique_refs['full_ref'].apply(extract_doi)
unique_refs['extracted_pmid'] = unique_refs['full_ref'].apply(extract_pmid)

# Also use ref_doi and pmid columns if available
unique_refs['final_doi'] = unique_refs['extracted_doi'].combine_first(unique_refs.get('ref_doi'))
unique_refs['final_pmid'] = unique_refs['extracted_pmid'].combine_first(unique_refs.get('pmid'))

has_doi = unique_refs['final_doi'].notna().sum()
has_pmid = unique_refs['final_pmid'].notna().sum()

print("PHASE 1: Direct Extraction")
print("=" * 60)
print(f"References with DOI: {has_doi:,} ({has_doi/len(unique_refs)*100:.1f}%)")
print(f"References with PMID: {has_pmid:,} ({has_pmid/len(unique_refs)*100:.1f}%)")
print(f"Need CrossRef lookup: {len(unique_refs) - has_doi:,}")

PHASE 1: Direct Extraction
References with DOI: 742 (13.0%)
References with PMID: 32 (0.6%)
Need CrossRef lookup: 4,948


  unique_refs['final_pmid'] = unique_refs['extracted_pmid'].combine_first(unique_refs.get('pmid'))


In [7]:
# =============================================================================
# CrossRef and PubMed API Functions (Enhanced with fuzzy matching)
# =============================================================================

# ── Text Cleaning Utilities ────────────────────────────────────────────────────

def normalize_text(text):
    """Normalize unicode ligatures and special characters from PDFs."""
    if not text:
        return ""
    # NFKD normalization handles ligatures (fi→fi, fl→fl) and other composed chars
    text = unicodedata.normalize("NFKD", str(text))
    return text


def clean_text_for_matching(text):
    """Clean text for fuzzy matching comparison."""
    if not text:
        return ""
    text = normalize_text(text)
    # Remove punctuation except hyphens (keep compound words)
    text = re.sub(r"[^\w\s\-]", "", text.lower())
    return text.strip()


def clean_reference(ref):
    """Clean reference text for CrossRef query."""
    if pd.isna(ref):
        return ""
    ref = str(ref)
    # Unicode normalization (handles PDF ligatures)
    ref = normalize_text(ref)
    # Remove Cochrane-specific text
    ref = re.sub(r"\(REVIEW\).*", "", ref, flags=re.I)
    ref = re.sub(r"TRUSTED EVIDENCE.*", "", ref, flags=re.I)
    ref = re.sub(r"COCHRANE.*?LIBRARY", "", ref, flags=re.I)
    # Remove page markers
    ref = re.sub(r"---\s*Page\s*\d+\s*---", " ", ref)
    # Remove excessive whitespace
    ref = re.sub(r"\s{2,}", " ", ref)
    return ref.strip()


def clean_title_for_search(title):
    """Normalize title for CrossRef API search."""
    title = normalize_text(title)
    title = title.replace("/", " ")
    title = re.sub(r"\s+", " ", title).strip()
    return title


def extract_jats_abstract(text):
    """Strip JATS/HTML tags from CrossRef abstract text."""
    if not text:
        return None
    text = re.sub(r"<[^>]+>", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text if text else None


# ── CrossRef Utilities ─────────────────────────────────────────────────────────

def format_crossref_authors(work):
    """Format author names from CrossRef response."""
    authors = work.get("author", [])
    parts = []
    for a in authors:
        last = a.get("family", "")
        given = a.get("given", "")
        initials = "".join(w[0] for w in given.split()) if given else ""
        parts.append(f"{last} {initials}".strip())
    return ", ".join(parts) if parts else None


def score_crossref_candidates(candidates, query_title, query_authors):
    """
    Score CrossRef candidates using fuzzy matching.
    Returns (best_work, title_score, author_score) or (None, None, None).
    """
    clean_query_title = clean_text_for_matching(query_title)
    clean_query_authors = clean_text_for_matching(query_authors)
    
    scored = []
    for item in candidates:
        # Get title from CrossRef (usually a list with one element)
        cr_titles = item.get("title", [])
        cr_title = cr_titles[0] if cr_titles else ""
        
        # Score title similarity
        title_score = fuzz.token_sort_ratio(clean_query_title, clean_text_for_matching(cr_title))
        
        # Score author similarity
        cr_authors = format_crossref_authors(item) or ""
        author_score = fuzz.token_sort_ratio(clean_query_authors, clean_text_for_matching(cr_authors))
        
        scored.append((item, title_score, author_score))
    
    # Filter to candidates above threshold
    above_threshold = [(item, ts, aus) for item, ts, aus in scored if ts >= FUZZY_TITLE_THRESHOLD]
    
    if not above_threshold:
        return None, None, None
    
    # Return best match: highest title score, break ties with author score
    best = max(above_threshold, key=lambda x: (x[1], x[2]))
    return best


# ── Main CrossRef Function (Enhanced) ──────────────────────────────────────────

def get_doi_from_crossref(ref, title=None, authors=None, timeout=(5, 30)):
    """
    Query CrossRef API to get DOI from bibliographic reference.
    
    Enhanced with:
    1. Fuzzy matching to validate results
    2. Short-title fallback (searches before ':' or '-')
    3. Returns match metadata for quality assessment
    
    Returns: (doi, match_method, title_score, author_score)
    
    For backward compatibility, can also be called with just `ref` and will
    return just the DOI (legacy behavior).
    """
    # Legacy mode: if only ref is provided, return just the DOI
    legacy_mode = (title is None and authors is None)
    
    if not ref or len(ref) < 20:
        return None if legacy_mode else (None, None, None, None)
    
    headers = {
        "User-Agent": f"LSE-UKHSA-Project/1.0 (mailto:{Entrez.email})" if Entrez.email else "LSE-UKHSA-Project/1.0"
    }
    
    # Extract title/authors from ref if not provided
    if title is None:
        title = ref
    if authors is None:
        authors = ""
    
    clean_title = clean_title_for_search(title)
    
    def try_search(search_term, method_name):
        """Helper to search CrossRef and score results."""
        try:
            url = "https://api.crossref.org/works"
            params = {
                "query.bibliographic": search_term,
                "rows": 10,  # Get top 10 for fuzzy matching
                "select": "DOI,title,author,abstract"
            }
            r = requests.get(url, params=params, headers=headers, timeout=timeout)
            r.raise_for_status()
            
            items = r.json().get("message", {}).get("items", [])
            if not items:
                return None, None, None, None
            
            # In legacy mode, just return first result (backward compatible)
            if legacy_mode:
                return items[0].get("DOI"), None, None, None
            
            # Score candidates
            best_item, title_score, author_score = score_crossref_candidates(items, title, authors)
            
            if best_item:
                return best_item.get("DOI"), method_name, title_score, author_score
            
            # If no fuzzy match above threshold, check if first result is exact-ish match
            first_title = (items[0].get("title") or [""])[0]
            if clean_text_for_matching(first_title) == clean_text_for_matching(title):
                return items[0].get("DOI"), f"{method_name}_exact", 100, None
                
        except Exception:
            pass
        return None, None, None, None
    
    # Strategy 1: Full bibliographic search
    doi, method, ts, aus = try_search(ref, "crossref_fuzzy")
    if doi:
        return doi if legacy_mode else (doi, method, ts, aus)
    
    # Strategy 2: Title-only search (sometimes more effective)
    doi, method, ts, aus = try_search(clean_title, "crossref_title")
    if doi:
        return doi if legacy_mode else (doi, method, ts, aus)
    
    # Strategy 3: Short title (before : or -)
    short_title = re.split(r"\s*[:\-]\s*", clean_title)[0].strip()
    if short_title.lower() != clean_title.lower() and len(short_title) > 20:
        doi, method, ts, aus = try_search(short_title, "crossref_short")
        if doi:
            return doi if legacy_mode else (doi, method, ts, aus)
    
    return None if legacy_mode else (None, "no_match", None, None)


# ── CrossRef Abstract Fetch ────────────────────────────────────────────────────

def fetch_crossref_abstract(doi, timeout=(5, 30)):
    """
    Fetch abstract from CrossRef for a given DOI.
    Returns cleaned abstract text or None.
    """
    if not doi:
        return None
    
    headers = {
        "User-Agent": f"LSE-UKHSA-Project/1.0 (mailto:{Entrez.email})" if Entrez.email else "LSE-UKHSA-Project/1.0"
    }
    
    try:
        url = f"https://api.crossref.org/works/{doi.strip()}"
        r = requests.get(url, headers=headers, timeout=timeout)
        if r.status_code == 200:
            work = r.json().get("message", {})
            raw_abstract = work.get("abstract", "")
            return extract_jats_abstract(raw_abstract)
    except Exception:
        pass
    return None


# ── PubMed Functions ───────────────────────────────────────────────────────────

def get_pmid_from_doi(doi, api_key=None):
    """Look up PMID from DOI via PubMed."""
    if not doi:
        return None
    
    try:
        params = {
            "db": "pubmed",
            "term": f"{doi}[DOI]",
            "retmode": "json"
        }
        if api_key:
            params["api_key"] = api_key
            
        r = requests.get(
            "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
            params=params, timeout=(5, 30)
        )
        data = r.json()
        idlist = data.get("esearchresult", {}).get("idlist", [])
        return idlist[0] if idlist else None
    except Exception:
        return None


def fetch_abstracts_batch(pmids, batch_size=200, max_retries=3):
    """Batch fetch PubMed records (200 per request) with retry logic."""
    results = {}
    pmid_list = [str(int(p)) for p in pmids if pd.notna(p) and str(p).isdigit()]
    failed_batches = []
    
    for i in range(0, len(pmid_list), batch_size):
        batch = pmid_list[i:i + batch_size]
        success = False
        
        for attempt in range(max_retries):
            try:
                time.sleep(NCBI_RATE * (attempt + 1))  # Exponential backoff
                handle = Entrez.efetch(db="pubmed", id=",".join(batch), rettype="xml", retmode="xml")
                records = Entrez.read(handle)
                handle.close()
                
                for article in records.get('PubmedArticle', []):
                    data = extract_record_data(article)
                    if data:
                        results[data['pmid']] = data
                success = True
                break
            except Exception as e:
                if attempt < max_retries - 1:
                    print(f"Batch {i//batch_size + 1} attempt {attempt + 1} failed: {e}, retrying...")
                else:
                    print(f"Batch {i//batch_size + 1} failed after {max_retries} attempts: {e}")
                    failed_batches.append(batch)
    
    if failed_batches:
        print(f"Warning: {len(failed_batches)} batch(es) failed permanently ({sum(len(b) for b in failed_batches)} PMIDs)")
    
    return results


def extract_record_data(record):
    """Extract fields from PubMed record."""
    try:
        article = record['MedlineCitation']['Article']
        pmid = str(record['MedlineCitation']['PMID'])
        title = str(article.get('ArticleTitle', ''))
        
        # Abstract
        abstract = ''
        if 'Abstract' in article and 'AbstractText' in article['Abstract']:
            parts = article['Abstract']['AbstractText']
            abstract = ' '.join([str(p) for p in parts]) if isinstance(parts, list) else str(parts)
        
        # Year
        year = ''
        if 'Journal' in article and 'JournalIssue' in article['Journal']:
            year = article['Journal']['JournalIssue'].get('PubDate', {}).get('Year', '')
        
        # Authors
        authors = []
        if 'AuthorList' in article:
            for auth in article['AuthorList']:
                if 'LastName' in auth:
                    name = auth['LastName'] + (' ' + auth.get('Initials', ''))
                    authors.append(name.strip())
        
        # DOI
        doi = ''
        if 'ELocationID' in article:
            for loc in article['ELocationID']:
                if loc.attributes.get('EIdType') == 'doi':
                    doi = str(loc)
                    break
        
        return {
            'pmid': pmid, 'title': title, 'abstract': abstract,
            'year': year, 'authors': '; '.join(authors), 'doi': doi
        }
    except:
        return None


print("API functions defined (enhanced with fuzzy matching):")
print("  • get_doi_from_crossref() - CrossRef search with fuzzy matching + short-title fallback")
print("  • fetch_crossref_abstract() - CrossRef abstract fetch (for PubMed fallback)")
print("  • get_pmid_from_doi() - DOI to PMID lookup")
print("  • fetch_abstracts_batch() - Batch PubMed fetch")
print(f"  • Fuzzy threshold: {FUZZY_TITLE_THRESHOLD}% title similarity required")

API functions defined (enhanced with fuzzy matching):
  • get_doi_from_crossref() - CrossRef search with fuzzy matching + short-title fallback
  • fetch_crossref_abstract() - CrossRef abstract fetch (for PubMed fallback)
  • get_pmid_from_doi() - DOI to PMID lookup
  • fetch_abstracts_batch() - Batch PubMed fetch
  • Fuzzy threshold: 75% title similarity required


In [8]:
# =============================================================================
# PHASE 2: CrossRef lookup for missing DOIs
# =============================================================================
# Query CrossRef API for references without DOI

print("PHASE 2: CrossRef DOI Lookup")
print("=" * 60)

# References that need CrossRef lookup
refs_need_crossref = unique_refs[unique_refs['final_doi'].isna()].copy()
print(f"References needing CrossRef: {len(refs_need_crossref):,}")

# Time estimate
est_hours = len(refs_need_crossref) * CROSSREF_RATE / 3600
print(f"Estimated time: ~{est_hours:.1f} hours")

refs_to_process = refs_need_crossref.copy()
print(f"References to process: {len(refs_to_process):,}")

PHASE 2: CrossRef DOI Lookup
References needing CrossRef: 4,948
Estimated time: ~0.7 hours
References to process: 4,948


In [9]:
# =============================================================================
# PHASE 2 EXECUTION: CrossRef DOI Lookup
# =============================================================================

results = []
start_time = time.time()
matched_count = 0

for idx, (_, row) in enumerate(tqdm(refs_to_process.iterrows(), total=len(refs_to_process), desc="CrossRef")):
    # Build reference string for CrossRef query
    ref_text = clean_reference(f"{row['title']} {row['authors']} {row['year']}")
    
    # Query CrossRef with fuzzy matching (pass title + authors explicitly)
    crossref_doi, match_method, title_score, author_score = get_doi_from_crossref(
        ref_text, title=row['title'], authors=row.get('authors', '')
    )
    time.sleep(CROSSREF_RATE)
    
    # Record result
    results.append({
        'study_id': row['study_id'],
        'category': row['category'],
        'original_title': row['title'],
        'original_authors': row['authors'],
        'original_year': row['year'],
        'crossref_doi': crossref_doi,
        'match_method': match_method or 'no_match',
        'title_score': title_score,
        'author_score': author_score
    })
    
    if crossref_doi:
        matched_count += 1
    
    # Print progress every 100 refs
    if (idx + 1) % 100 == 0:
        elapsed = time.time() - start_time
        rate = (idx + 1) / elapsed * 3600
        print(f"\n[{idx+1:,}/{len(refs_to_process):,}] Match rate: {matched_count/(idx+1)*100:.1f}% ({rate:.0f}/hr)")

# Save final results
crossref_results_df = pd.DataFrame(results)
crossref_results_df.to_csv(PROGRESS_CSV, index=False)

print(f"\n✓ CrossRef complete: {matched_count:,} DOIs found ({matched_count/len(refs_to_process)*100:.1f}%)")

CrossRef:   0%|          | 0/4948 [00:00<?, ?it/s]


[100/4,948] Match rate: 83.0% (2635/hr)

[200/4,948] Match rate: 86.0% (2705/hr)

[300/4,948] Match rate: 86.3% (2754/hr)

[400/4,948] Match rate: 86.2% (2742/hr)

[500/4,948] Match rate: 85.8% (2731/hr)

[600/4,948] Match rate: 86.5% (2728/hr)

[700/4,948] Match rate: 86.6% (2741/hr)

[800/4,948] Match rate: 86.4% (2728/hr)

[900/4,948] Match rate: 85.8% (2692/hr)

[1,000/4,948] Match rate: 84.7% (2667/hr)

[1,100/4,948] Match rate: 83.9% (2650/hr)

[1,200/4,948] Match rate: 81.9% (2608/hr)

[1,300/4,948] Match rate: 82.5% (2629/hr)

[1,400/4,948] Match rate: 80.2% (2578/hr)

[1,500/4,948] Match rate: 79.1% (2558/hr)

[1,600/4,948] Match rate: 77.9% (2528/hr)

[1,700/4,948] Match rate: 78.0% (2531/hr)

[1,800/4,948] Match rate: 77.9% (2526/hr)

[1,900/4,948] Match rate: 78.2% (2525/hr)

[2,000/4,948] Match rate: 78.5% (2519/hr)

[2,100/4,948] Match rate: 78.6% (2516/hr)

[2,200/4,948] Match rate: 78.4% (2511/hr)

[2,300/4,948] Match rate: 78.5% (2518/hr)

[2,400/4,948] Match rate: 78

In [10]:
# =============================================================================
# PHASE 3: Convert DOIs to PMIDs
# =============================================================================

print("PHASE 3: DOI → PMID Conversion")
print("=" * 60)

# Reload unique_refs fresh (to avoid column conflicts from previous runs)
unique_refs_fresh = refs_df.drop_duplicates(subset='signature').copy()

# Re-extract DOIs (Phase 1 logic)
unique_refs_fresh['full_ref'] = (
    unique_refs_fresh['title'].fillna('') + ' ' + 
    unique_refs_fresh['authors'].fillna('') + ' ' +
    unique_refs_fresh['year'].fillna('').astype(str)
)
unique_refs_fresh['extracted_doi'] = unique_refs_fresh['full_ref'].apply(extract_doi)
unique_refs_fresh['final_doi'] = unique_refs_fresh['extracted_doi']

# Combine with CrossRef DOIs
crossref_results = pd.read_csv(PROGRESS_CSV) if PROGRESS_CSV.exists() else pd.DataFrame()

if len(crossref_results) > 0:
    # Merge on normalized title (not study_id!) to correctly match different papers
    crossref_results['title_normalized'] = crossref_results['original_title'].str.lower().str.strip()
    unique_refs_fresh['title_normalized'] = unique_refs_fresh['title'].str.lower().str.strip()
    
    crossref_doi_map = crossref_results[['title_normalized', 'crossref_doi']].drop_duplicates()
    unique_refs_fresh = unique_refs_fresh.merge(
        crossref_doi_map,
        on='title_normalized',
        how='left'
    )
    unique_refs_fresh['final_doi'] = unique_refs_fresh['final_doi'].combine_first(unique_refs_fresh['crossref_doi'])

# Update the main variable
unique_refs = unique_refs_fresh

# Get all unique DOIs
all_dois = unique_refs[unique_refs['final_doi'].notna()]['final_doi'].str.lower().unique()
print(f"Total unique DOIs: {len(all_dois):,}")

# DOI→PMID output file (used by notebook 05)
DOI_PMID_CACHE = DATA_DIR / "doi_pmid_cache.csv"

# Look up ALL DOIs fresh
doi_to_pmid = {}
print(f"DOIs to look up: {len(all_dois):,}")

if len(all_dois) > 0:
    print("Converting DOIs to PMIDs...")
    new_mappings = 0
    for doi in tqdm(all_dois, desc="DOI→PMID"):
        pmid = get_pmid_from_doi(doi, Entrez.api_key)
        if pmid:
            doi_to_pmid[doi.lower()] = pmid
            new_mappings += 1
        else:
            doi_to_pmid[doi.lower()] = None
        time.sleep(NCBI_RATE)
    
    # Save results for use by notebook 05
    cache_rows = [{'doi': k, 'pmid': v if v else 'NO_PMID'} for k, v in doi_to_pmid.items()]
    cache_df = pd.DataFrame(cache_rows)
    cache_df.to_csv(DOI_PMID_CACHE, index=False)
    print(f"\n✓ Found {new_mappings:,} PMIDs, saved to {DOI_PMID_CACHE.name}")

# Count PMIDs found
pmids_for_current = sum(1 for d in all_dois if doi_to_pmid.get(d.lower()) is not None)
print(f"\n✓ PMIDs found: {pmids_for_current:,} / {len(all_dois):,} ({pmids_for_current/len(all_dois)*100:.1f}%)")

PHASE 3: DOI → PMID Conversion
Total unique DOIs: 3,633
DOIs to look up: 3,633
Converting DOIs to PMIDs...


DOI→PMID:   0%|          | 0/3633 [00:00<?, ?it/s]


✓ Found 2,744 PMIDs, saved to doi_pmid_cache.csv

✓ PMIDs found: 2,744 / 3,633 (75.5%)


In [11]:
# =============================================================================
# PHASE 4: Batch Fetch Abstracts
# =============================================================================

print("PHASE 4: Fetch Abstracts")
print("=" * 60)

# Map DOIs to PMIDs
unique_refs['doi_lower'] = unique_refs['final_doi'].str.lower()
unique_refs['pmid_from_doi'] = unique_refs['doi_lower'].map(doi_to_pmid)

# Initialize final_pmid if it doesn't exist (fresh dataframe)
if 'final_pmid' not in unique_refs.columns:
    unique_refs['final_pmid'] = None
unique_refs['final_pmid'] = unique_refs['final_pmid'].combine_first(unique_refs['pmid_from_doi'])

# Get all unique PMIDs
all_pmids = unique_refs[unique_refs['final_pmid'].notna()]['final_pmid'].unique()
print(f"Total unique PMIDs: {len(all_pmids):,}")

# Batch fetch abstracts
print("Fetching abstracts in batches...")
start = time.time()
abstract_records = fetch_abstracts_batch(all_pmids, batch_size=200)
elapsed = time.time() - start

print(f"\n✓ Fetched {len(abstract_records):,} records in {elapsed:.0f}s")
with_abstract = sum(1 for r in abstract_records.values() if r.get('abstract'))
print(f"  With abstracts: {with_abstract:,} ({with_abstract/len(abstract_records)*100:.1f}%)")

PHASE 4: Fetch Abstracts
Total unique PMIDs: 2,744
Fetching abstracts in batches...

✓ Fetched 2,744 records in 98s
  With abstracts: 2,632 (95.9%)


In [12]:
# =============================================================================
# PHASE 4b: CrossRef Abstract Fallback + Compile Final Results
# =============================================================================
# For papers with DOI but no PubMed abstract, try fetching from CrossRef

print("PHASE 4b: CrossRef Abstract Fallback")
print("=" * 60)

# Find DOIs with no abstract
dois_needing_fallback = []
for _, row in unique_refs.iterrows():
    doi = str(row.get('final_doi', '')).lower() if pd.notna(row.get('final_doi')) else None
    if not doi or doi == 'nan':
        continue
    
    pmid = str(row.get('final_pmid')) if pd.notna(row.get('final_pmid')) else None
    record = abstract_records.get(pmid, {}) if pmid else {}
    
    # If no abstract from PubMed, this DOI needs CrossRef fallback
    if not record.get('abstract'):
        dois_needing_fallback.append(doi)

dois_needing_fallback = list(set(dois_needing_fallback))
print(f"DOIs needing CrossRef abstract lookup: {len(dois_needing_fallback):,}")

# Fetch abstracts from CrossRef
doi_to_crossref_abstract = {}
if dois_needing_fallback:
    print("Fetching abstracts from CrossRef...")
    for doi in tqdm(dois_needing_fallback, desc="CrossRef abstracts"):
        abstract = fetch_crossref_abstract(doi)
        if abstract and len(abstract) > 50:  # Only keep meaningful abstracts
            doi_to_crossref_abstract[doi.lower()] = abstract
        time.sleep(0.2)  # Polite rate limit
    
    print(f"\n✓ Found {len(doi_to_crossref_abstract):,} abstracts from CrossRef ({len(doi_to_crossref_abstract)/len(dois_needing_fallback)*100:.1f}%)")
else:
    print("✓ No references need CrossRef abstract fallback")

# =============================================================================
# Compile Final Results
# =============================================================================
# Re-expand abstract data to ALL (study_id, review_doi) pairs via signature

print("\nCompiling final results...")
print("=" * 60)

# Build abstract data keyed by signature (from unique_refs)
signature_to_abstract = {}

for _, row in unique_refs.iterrows():
    sig = row['signature']
    pmid = str(row['final_pmid']) if pd.notna(row['final_pmid']) else None
    record = abstract_records.get(pmid, {}) if pmid else {}
    doi = str(row['final_doi']).lower() if pd.notna(row['final_doi']) else None
    
    # Get abstract: prefer PubMed, fallback to CrossRef
    abstract = record.get('abstract', '')
    abstract_source = 'pubmed' if abstract else None
    
    if not abstract and doi and doi in doi_to_crossref_abstract:
        abstract = doi_to_crossref_abstract[doi]
        abstract_source = 'crossref'
    
    # Determine match method
    if pd.notna(row.get('extracted_pmid')):
        method = 'pmid_direct'
    elif pd.notna(row.get('extracted_doi')) or pd.notna(row.get('ref_doi')):
        method = 'doi_direct'
    elif pd.notna(row.get('crossref_doi')):
        method = 'crossref'
    else:
        method = 'no_match'
    
    signature_to_abstract[sig] = {
        'pmid': pmid,
        'doi': row['final_doi'],
        'matched_title': record.get('title', ''),
        'matched_authors': record.get('authors', ''),
        'matched_year': record.get('year', ''),
        'abstract': abstract,
        'abstract_source': abstract_source,
        'match_method': method if pmid or (doi and abstract) else 'no_match'
    }

print(f"Abstract data built for {len(signature_to_abstract):,} unique signatures")

# Count abstracts by source
pubmed_abstracts = sum(1 for v in signature_to_abstract.values() if v.get('abstract_source') == 'pubmed')
crossref_abstracts_found = sum(1 for v in signature_to_abstract.values() if v.get('abstract_source') == 'crossref')
print(f"  Abstracts from PubMed: {pubmed_abstracts:,}")
print(f"  Abstracts from CrossRef (fallback): {crossref_abstracts_found:,}")

# Re-expand to ALL original (study_id, review_doi) pairs
output_rows = []

for _, row in refs_df.iterrows():
    sig = row['signature']
    abstract_data = signature_to_abstract.get(sig, {})
    
    output_rows.append({
        'study_id': row['study_id'],
        'review_doi': row['review_doi'],
        'category': row['category'],
        'original_title': row['title'],
        'original_authors': row['authors'],
        'original_year': row['year'],
        'pmid': abstract_data.get('pmid'),
        'doi': abstract_data.get('doi'),
        'matched_title': abstract_data.get('matched_title', ''),
        'matched_authors': abstract_data.get('matched_authors', ''),
        'matched_year': abstract_data.get('matched_year', ''),
        'abstract': abstract_data.get('abstract', ''),
        'abstract_source': abstract_data.get('abstract_source', ''),
        'match_method': abstract_data.get('match_method', 'no_match')
    })

results_df = pd.DataFrame(output_rows)

print(f"\nTotal references: {len(results_df):,}")
print(f"Unique (study_id, review_doi) pairs: {results_df.groupby(['study_id', 'review_doi']).ngroups:,}")
print(f"\nMatch methods:")
print(results_df['match_method'].value_counts().to_string())

PHASE 4b: CrossRef Abstract Fallback
DOIs needing CrossRef abstract lookup: 1,001
Fetching abstracts from CrossRef...


CrossRef abstracts:   0%|          | 0/1001 [00:00<?, ?it/s]


✓ Found 228 abstracts from CrossRef (22.8%)

Compiling final results...
Abstract data built for 5,690 unique signatures
  Abstracts from PubMed: 2,781
  Abstracts from CrossRef (fallback): 235

Total references: 5,876
Unique (study_id, review_doi) pairs: 5,874

Match methods:
match_method
crossref      3237
no_match      2608
doi_direct      31


In [13]:
# =============================================================================
# Save Final Output (with review_doi preserved!)
# =============================================================================

# Filter to matched only
matched_refs = results_df[results_df['match_method'] != 'no_match'].copy()

print("FINAL RESULTS")
print("=" * 60)
print(f"Total matched: {len(matched_refs):,} / {len(results_df):,} ({len(matched_refs)/len(results_df)*100:.1f}%)")
print(f"Unique (study_id, review_doi) pairs: {matched_refs.groupby(['study_id', 'review_doi']).ngroups:,}")
print(f"\nBy category:")
print(matched_refs['category'].value_counts().to_string())

has_abstract = matched_refs['abstract'].str.len() > 0
print(f"\nWith abstracts: {has_abstract.sum():,} ({has_abstract.mean()*100:.1f}%)")

# Verify review_doi is in output
print(f"\n✓ Columns in output: {matched_refs.columns.tolist()}")

# Save
matched_refs.to_csv(OUTPUT_CSV, index=False)
print(f"\n✓ Saved to {OUTPUT_CSV}")

FINAL RESULTS
Total matched: 3,268 / 5,876 (55.6%)
Unique (study_id, review_doi) pairs: 3,267

By category:
category
excluded    2367
included     660
awaiting     157
ongoing       84

With abstracts: 3,152 (96.5%)

✓ Columns in output: ['study_id', 'review_doi', 'category', 'original_title', 'original_authors', 'original_year', 'pmid', 'doi', 'matched_title', 'matched_authors', 'matched_year', 'abstract', 'abstract_source', 'match_method']

✓ Saved to c:\Users\juanx\Documents\LSE-UKHSA Project\Data\referenced_paper_abstracts.csv


In [14]:
# =============================================================================
# Summary Statistics
# =============================================================================

print("=" * 60)
print("PIPELINE SUMMARY")
print("=" * 60)

print(f"\nInput: {len(unique_refs):,} unique references from public health reviews")

print(f"\nPhase 1 - Direct extraction:")
direct_doi = unique_refs['extracted_doi'].notna().sum() if 'extracted_doi' in unique_refs.columns else 0
print(f"  DOIs extracted: {direct_doi:,}")

if PROGRESS_CSV.exists():
    crossref_df = pd.read_csv(PROGRESS_CSV)
    crossref_found = crossref_df['crossref_doi'].notna().sum()
    print(f"\nPhase 2 - CrossRef DOI lookup (with fuzzy matching):")
    print(f"  DOIs found: {crossref_found:,} / {len(crossref_df):,} ({crossref_found/len(crossref_df)*100:.1f}%)")

print(f"\nPhase 3 - DOI → PMID:")
actual_pmids = sum(1 for v in doi_to_pmid.values() if v is not None)
print(f"  PMIDs found: {actual_pmids:,}")

print(f"\nPhase 4 - Abstract fetch:")
print(f"  PubMed records fetched: {len(abstract_records):,}")
pubmed_with_abstract = sum(1 for r in abstract_records.values() if r.get('abstract'))
print(f"  With PubMed abstracts: {pubmed_with_abstract:,}")

print(f"\nPhase 4b - CrossRef abstract fallback:")
print(f"  Additional abstracts from CrossRef: {len(doi_to_crossref_abstract):,}")

total_abstracts = pubmed_with_abstract + len(doi_to_crossref_abstract)
print(f"\nTotal abstracts recovered: {total_abstracts:,}")

print(f"\nFinal output: {OUTPUT_CSV.name}")
print(f"  Matched: {len(matched_refs):,} ({len(matched_refs)/len(unique_refs)*100:.1f}%)")
print(f"  By category: {dict(matched_refs['category'].value_counts())}")

# Abstract coverage
with_abstract = matched_refs['abstract'].str.len() > 0 
print(f"\nAbstract coverage:")
print(f"  References with abstracts: {with_abstract.sum():,} / {len(matched_refs):,} ({with_abstract.mean()*100:.1f}%)")
if 'abstract_source' in matched_refs.columns:
    print(f"  Source breakdown: {dict(matched_refs[matched_refs['abstract'].str.len() > 0]['abstract_source'].value_counts())}")

PIPELINE SUMMARY

Input: 5,694 unique references from public health reviews

Phase 1 - Direct extraction:
  DOIs extracted: 12

Phase 2 - CrossRef DOI lookup (with fuzzy matching):
  DOIs found: 3,795 / 4,948 (76.7%)

Phase 3 - DOI → PMID:
  PMIDs found: 2,744

Phase 4 - Abstract fetch:
  PubMed records fetched: 2,744
  With PubMed abstracts: 2,632

Phase 4b - CrossRef abstract fallback:
  Additional abstracts from CrossRef: 228

Total abstracts recovered: 2,860

Final output: referenced_paper_abstracts.csv
  Matched: 3,268 (57.4%)
  By category: {'excluded': np.int64(2367), 'included': np.int64(660), 'awaiting': np.int64(157), 'ongoing': np.int64(84)}

Abstract coverage:
  References with abstracts: 3,152 / 3,268 (96.5%)
  Source breakdown: {'pubmed': np.int64(2914), 'crossref': np.int64(238)}
