# 00: Obtain Cochrane Review Abstracts from PubMed

## Summary
This notebook fetches Cochrane Systematic Review abstracts from PubMed/NCBI using the Entrez API.

**Pipeline Position:** First notebook - obtains source data for the project.

**What this notebook does:**
1. Queries PubMed for Cochrane reviews using official publication type filters
2. Fetches full abstract metadata (title, abstract, authors, DOI, year)
3. Saves data to CSV for downstream processing

**Output:** `Data/cochrane_pubmed_abstracts.csv`

**Requirements:**
- NCBI API key (optional but recommended for rate limits)
- `.env` file with NCBI_EMAIL and optionally NCBI_API_KEY

**Note:** This notebook ONLY fetches abstracts. Reference extraction is handled separately via Wiley TDM PDFs because PubMed XML references are not categorised (included vs excluded studies).

In [1]:
# Install required packages
%pip install -q biopython python-dotenv pandas tqdm

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Set up paths and load environment variables
import os
from pathlib import Path
from dotenv import load_dotenv
from Bio import Entrez
import pandas as pd
from tqdm.notebook import tqdm
import time

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"

load_dotenv(env_path, override=True)

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

DATA_DIR = project_root / "Data"
DATA_DIR.mkdir(exist_ok=True)

print(f"Project root: {project_root}")
print(f"Data directory: {DATA_DIR}")
print(f"NCBI email configured: {bool(Entrez.email)}")
print(f"NCBI API key configured: {bool(Entrez.api_key)}")

Project root: c:\Users\juanx\Documents\LSE-UKHSA Project
Data directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data
NCBI email configured: True
NCBI API key configured: True


In [16]:
# Search PubMed for ALL Cochrane Database entries (reviews, protocols, overviews, etc.)
# Using only journal filter to get everything published in Cochrane Database of Systematic Reviews
SEARCH_QUERY = '"Cochrane Database Syst Rev"[Journal]'

# First get total count
handle = Entrez.esearch(db="pubmed", term=SEARCH_QUERY, retmax=0)
result = Entrez.read(handle)
handle.close()

total_count = int(result["Count"])
print(f"Total Cochrane Database entries found: {total_count:,}")

Total Cochrane Database entries found: 17,298


In [17]:
# Fetch PMIDs in batches of 9999 (API limit) using date ranges
# Then fetch full metadata for each PMID batch

all_pmids = []

# Get PMIDs by year to bypass the 9999 limit
years = list(range(1995, 2027))  # Cochrane started mid-1990s

for year in tqdm(years, desc="Fetching PMIDs by year"):
    year_query = f'{SEARCH_QUERY} AND {year}[pdat]'
    handle = Entrez.esearch(db="pubmed", term=year_query, retmax=9999)
    result = Entrez.read(handle)
    handle.close()
    all_pmids.extend(result["IdList"])
    time.sleep(0.1)

print(f"Retrieved {len(all_pmids):,} unique PMIDs")

Fetching PMIDs by year:   0%|          | 0/32 [00:00<?, ?it/s]

Retrieved 17,298 unique PMIDs


In [6]:
# Define function to parse PubMed record metadata

def parse_pubmed_record(article):
    """Extract key fields from a PubMed article record."""
    medline = article.get('MedlineCitation', {})
    article_data = medline.get('Article', {})
    
    pmid = str(medline.get('PMID', ''))
    title = article_data.get('ArticleTitle', '')
    
    # Extract abstract
    abstract_data = article_data.get('Abstract', {})
    abstract_texts = abstract_data.get('AbstractText', [])
    if isinstance(abstract_texts, list):
        abstract = ' '.join([str(t) for t in abstract_texts])
    else:
        abstract = str(abstract_texts)
    
    # Extract DOI
    doi = ''
    article_ids = article.get('PubmedData', {}).get('ArticleIdList', [])
    for aid in article_ids:
        if aid.attributes.get('IdType') == 'doi':
            doi = str(aid)
            break
    
    # Extract year
    pub_date = article_data.get('Journal', {}).get('JournalIssue', {}).get('PubDate', {})
    year = pub_date.get('Year', '')
    if not year:
        medline_date = pub_date.get('MedlineDate', '')
        if medline_date:
            year = medline_date[:4]
    
    return {
        'pmid': pmid,
        'title': title,
        'abstract': abstract,
        'doi': doi,
        'year': year
    }

In [18]:
# Fetch full metadata for all PMIDs in batches
FETCH_BATCH = 200

records = []
for i in tqdm(range(0, len(all_pmids), FETCH_BATCH), desc="Fetching metadata"):
    batch = all_pmids[i:i + FETCH_BATCH]
    
    try:
        handle = Entrez.efetch(db="pubmed", id=",".join(batch), rettype="xml")
        fetched = Entrez.read(handle)
        handle.close()
        
        for article in fetched.get('PubmedArticle', []):
            records.append(parse_pubmed_record(article))
    except Exception as e:
        print(f"Error at batch {i}: {e}")
    
    time.sleep(0.15)  # Rate limiting

print(f"Fetched metadata for {len(records):,} records")

Fetching metadata:   0%|          | 0/87 [00:00<?, ?it/s]

Fetched metadata for 17,298 records


In [19]:
# Create DataFrame and save to CSV
df = pd.DataFrame(records)

output_file = DATA_DIR / "cochrane_pubmed_abstracts.csv"
df.to_csv(output_file, index=False)

print(f"Saved {len(df):,} Cochrane review abstracts to {output_file.name}")
print(f"\nDataset preview:")
df.head()

Saved 17,298 Cochrane review abstracts to cochrane_pubmed_abstracts.csv

Dataset preview:


Unnamed: 0,pmid,title,abstract,doi,year
0,17636697,WITHDRAWN: Aldose reductase inhibitors for the...,Diabetic peripheral neuropathy is a common com...,10.1002/14651858.CD002182,1996
1,17636615,WITHDRAWN: Antiplatelet therapy for preventing...,People with nonrheumatic atrial fibrillation w...,10.1002/14651858.CD000186.pub2,1996
2,17636606,WITHDRAWN: Kinin-enhancing drugs for unexplain...,Oligo-astheno-teratospermia (sperm of low conc...,10.1002/14651858.CD000153,1996
3,17636589,WITHDRAWN: Human chorionic gonadotrophin for r...,There may be an association between recurrent ...,10.1002/14651858.CD000101.pub2,1996
4,17636588,WITHDRAWN: Gonadotrophin-releasing hormone ana...,Elevation of endogenous LH levels may result i...,10.1002/14651858.CD000097.pub2,1996


In [None]:
# Quick quality check
print(f"Records with abstracts: {df['abstract'].notna().sum():,}")
print(f"Records with DOIs: {df['doi'].notna().sum():,}")
print(f"Year range: {df['year'].min()} - {df['year'].max()}")
print(f"\nNext step: Run notebook 01 for exploratory data analysis of the reviews")

Records with abstracts: 17,298
Records with DOIs: 17,298
Year range: 1996 - 2026
