# 00: Obtain Cochrane Review Abstracts from PubMed

## Summary
This notebook fetches Cochrane Systematic Review abstracts from PubMed/NCBI using the Entrez API.

**Pipeline Position:** First notebook - obtains source data for the project.

**What this notebook does:**
1. Queries PubMed for Cochrane reviews using official publication type filters
2. Fetches full abstract metadata (title, abstract, authors, DOI, year)
3. Saves data to CSV for downstream processing

**Output:** `Data/cochrane_pubmed_abstracts.csv`

**Requirements:**
- NCBI API key (optional but recommended for rate limits)
- `.env` file with NCBI_EMAIL and optionally NCBI_API_KEY

**Note:** This notebook ONLY fetches abstracts. Reference extraction is handled separately via Wiley TDM PDFs because PubMed XML references are not categorised (included vs excluded studies).

In [None]:
# Install required packages
%pip install -q biopython python-dotenv pandas tqdm

In [None]:
# Set up paths and load environment variables
import os
from pathlib import Path
from dotenv import load_dotenv
from Bio import Entrez
import pandas as pd
from tqdm.notebook import tqdm
import time

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"

load_dotenv(env_path, override=True)

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

DATA_DIR = project_root / "Data"
DATA_DIR.mkdir(exist_ok=True)

print(f"Project root: {project_root}")
print(f"Data directory: {DATA_DIR}")
print(f"NCBI email configured: {bool(Entrez.email)}")
print(f"NCBI API key configured: {bool(Entrez.api_key)}")

In [None]:
# Define search query for Cochrane Systematic Reviews
SEARCH_QUERY = '"Cochrane Database Syst Rev"[Journal] AND systematic review[pt]'

# Perform search to get count
handle = Entrez.esearch(db="pubmed", term=SEARCH_QUERY, retmax=0)
result = Entrez.read(handle)
handle.close()

total_count = int(result["Count"])
print(f"Total Cochrane reviews found: {total_count:,}")

In [None]:
# Fetch all PMIDs in batches
BATCH_SIZE = 10000

all_pmids = []
for start in tqdm(range(0, total_count, BATCH_SIZE), desc="Fetching PMIDs"):
    handle = Entrez.esearch(
        db="pubmed", 
        term=SEARCH_QUERY, 
        retstart=start, 
        retmax=BATCH_SIZE
    )
    result = Entrez.read(handle)
    handle.close()
    all_pmids.extend(result["IdList"])
    time.sleep(0.1)

print(f"Retrieved {len(all_pmids):,} PMIDs")

In [None]:
# Define function to parse PubMed record metadata

def parse_pubmed_record(article):
    """Extract key fields from a PubMed article record."""
    medline = article.get('MedlineCitation', {})
    article_data = medline.get('Article', {})
    
    pmid = str(medline.get('PMID', ''))
    title = article_data.get('ArticleTitle', '')
    
    # Extract abstract
    abstract_data = article_data.get('Abstract', {})
    abstract_texts = abstract_data.get('AbstractText', [])
    if isinstance(abstract_texts, list):
        abstract = ' '.join([str(t) for t in abstract_texts])
    else:
        abstract = str(abstract_texts)
    
    # Extract DOI
    doi = ''
    article_ids = article.get('PubmedData', {}).get('ArticleIdList', [])
    for aid in article_ids:
        if aid.attributes.get('IdType') == 'doi':
            doi = str(aid)
            break
    
    # Extract year
    pub_date = article_data.get('Journal', {}).get('JournalIssue', {}).get('PubDate', {})
    year = pub_date.get('Year', '')
    if not year:
        medline_date = pub_date.get('MedlineDate', '')
        if medline_date:
            year = medline_date[:4]
    
    return {
        'pmid': pmid,
        'title': title,
        'abstract': abstract,
        'doi': doi,
        'year': year
    }

In [None]:
# Fetch full metadata in batches
FETCH_BATCH = 200

records = []
for i in tqdm(range(0, len(all_pmids), FETCH_BATCH), desc="Fetching metadata"):
    batch = all_pmids[i:i + FETCH_BATCH]
    
    handle = Entrez.efetch(
        db="pubmed", 
        id=",".join(batch), 
        rettype="xml"
    )
    fetched = Entrez.read(handle)
    handle.close()
    
    for article in fetched.get('PubmedArticle', []):
        records.append(parse_pubmed_record(article))
    
    time.sleep(0.1)

print(f"Fetched metadata for {len(records):,} records")

In [None]:
# Create DataFrame and save to CSV
df = pd.DataFrame(records)

output_file = DATA_DIR / "cochrane_pubmed_abstracts.csv"
df.to_csv(output_file, index=False)

print(f"Saved {len(df):,} Cochrane review abstracts to {output_file.name}")
print(f"\nDataset preview:")
df.head()

In [None]:
# Quick quality check
print(f"Records with abstracts: {df['abstract'].notna().sum():,}")
print(f"Records with DOIs: {df['doi'].notna().sum():,}")
print(f"Year range: {df['year'].min()} - {df['year'].max()}")

# 00: Download Cochrane Review Abstracts from PubMed

## Summary
This notebook downloads all Cochrane systematic review abstracts from PubMed. Cochrane reviews are gold-standard systematic reviews of health research that evaluate which studies should be included or excluded based on predefined criteria.

**Pipeline Position:** First notebook - obtains the Cochrane reviews that will be used to fetch included/excluded studies via Wiley TDM.

**What this notebook does:**
1. Searches PubMed for all Cochrane Database of Systematic Reviews articles with abstracts
2. Fetches abstracts and metadata for each review (~17,000 reviews)
3. Saves to CSV for use in downstream notebooks

**Output:** `Data/cochrane_pubmed_abstracts.csv`

**Note:** Reference extraction is NOT done here. The PubMed XML reference list does not properly categorize included/excluded studies. Use notebook 02 to fetch categorized references via Wiley TDM.

In [None]:
# Install required packages for PubMed access
%pip install -q biopython python-dotenv

In [None]:
# Set up environment: load credentials and configure paths
import os
from pathlib import Path
from dotenv import load_dotenv
from Bio import Entrez
import csv
import time

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"
load_dotenv(env_path, override=True)

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

print(f"Loaded .env from: {env_path}")
print(f"NCBI_EMAIL present: {'yes' if Entrez.email else 'no'}")

QUERY = '("Cochrane Database Syst Rev"[Journal]) AND hasabstract[text]'
OUT_CSV = project_root / "Data" / "cochrane_pubmed_abstracts.csv"
BATCH_SIZE = 50
SLEEP = 0.9
MAX_RECORDS = None  # Set to small number for testing

if not Entrez.email or "example.com" in Entrez.email:
    raise ValueError(f"NCBI_EMAIL not set. Create a .env file at {env_path} with your email.")

In [None]:
# Define helper functions for searching and fetching PubMed records
from urllib.error import HTTPError
from io import StringIO
from Bio import Medline

MAX_PUBMED_RETRIEVAL = 9500

def esearch_count(query: str) -> int:
    rec = Entrez.read(Entrez.esearch(db="pubmed", term=query, retmax=0))
    return int(rec["Count"])

def split_query_by_year(query: str, start_year: int, end_year: int) -> list:
    stack = [(start_year, end_year)]
    slices = []
    while stack:
        s, e = stack.pop()
        date_clause = f'("{s}"[PDAT] : "{e}"[PDAT])'
        q = f"({query}) AND {date_clause}"
        cnt = esearch_count(q)
        if cnt <= MAX_PUBMED_RETRIEVAL:
            slices.append((q, s, e))
        else:
            if e - s <= 1:
                slices.append((q, s, e))
            else:
                mid = (s + e) // 2
                stack.append((s, mid))
                stack.append((mid + 1, e))
    return slices

def esearch_all_ids(query: str, max_records=None, start_year: int = 1900, end_year: int = 2035):
    slices = split_query_by_year(query, start_year, end_year)
    pmids = []
    total = 0
    for q, s, e in slices:
        rec0 = Entrez.read(Entrez.esearch(db="pubmed", term=q, retmax=0))
        count_slice = int(rec0["Count"])
        limit_slice = count_slice if max_records is None else min(count_slice, max_records - total)
        for start in range(0, limit_slice, 1000):
            retmax = min(1000, limit_slice - start)
            rec = Entrez.read(Entrez.esearch(db="pubmed", term=q, retstart=start, retmax=retmax))
            pmids.extend(rec["IdList"])
            time.sleep(SLEEP)
        total += limit_slice
        if max_records is not None and total >= max_records:
            break
    return total, pmids

def efetch_medline(id_chunk):
    for attempt in range(3):
        try:
            handle = Entrez.efetch(db="pubmed", id=",".join(id_chunk), rettype="medline", retmode="text")
            return handle.read()
        except HTTPError:
            if attempt == 2:
                raise
            time.sleep(SLEEP * (attempt + 2))

def medline_to_rows(medline_text: str):
    for record in Medline.parse(StringIO(medline_text)):
        yield {
            "pmid": record.get("PMID", ""),
            "title": record.get("TI", ""),
            "abstract": record.get("AB", ""),
            "journal": record.get("JT", ""),
            "year": record.get("DP", "").split(" ")[0],
            "authors": "; ".join(record.get("AU", [])),
        }

def write_pubmed_to_csv(query: str, out_path: Path, batch_size: int, max_records=None):
    count, pmids = esearch_all_ids(query, max_records=max_records)
    print(f"Found {count} records; fetching {len(pmids)} IDs...")
    
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["pmid", "title", "abstract", "journal", "year", "authors"])
        writer.writeheader()
        for i in range(0, len(pmids), batch_size):
            chunk = pmids[i : i + batch_size]
            medline_chunk = efetch_medline(chunk)
            for row in medline_to_rows(medline_chunk):
                writer.writerow(row)
            time.sleep(SLEEP)
            if (i // batch_size) % 20 == 0:
                print(f"  Progress: {i + len(chunk)}/{len(pmids)}")
    
    print(f"Saved abstracts to {out_path.resolve()}")

In [None]:
# Execute the download (skips if file already exists)
import pandas as pd

if OUT_CSV.exists():
    print("Data file already exists - skipping download.")
else:
    write_pubmed_to_csv(QUERY, OUT_CSV, BATCH_SIZE, max_records=MAX_RECORDS)

print("\nAbstracts preview:")
df = pd.read_csv(OUT_CSV)
print(f"Total reviews: {len(df):,}")
print(df.head())

# Downloading Cochrane Reviews from PubMed

**Summary:** In this notebook, I download all Cochrane systematic reviews and their reference lists from PubMed. Cochrane reviews are high-quality systematic reviews of health research, and they cite the papers that were "included" in each review after screening.

**What I do:**
1. I search PubMed for all Cochrane Database of Systematic Reviews articles with abstracts
2. I fetch the abstracts and metadata for each review (~17,000 reviews)
3. I fetch the reference lists to get the cited papers (~1.2 million reference edges)
4. I save both datasets to CSV files

**Output files:**
- `cochrane_pubmed_abstracts.csv` - Cochrane review abstracts and metadata
- `cochrane_pubmed_references.csv` - Links between reviews and their cited papers

**Requirements:** You need to set up a `.env` file with your NCBI credentials (NCBI_EMAIL and optionally NCBI_API_KEY)

In [1]:
# I install the required packages for accessing PubMed
%pip install -q biopython python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
# I set up the environment, load credentials, and configure the PubMed query
import os
from pathlib import Path
from dotenv import load_dotenv
from Bio import Entrez
import csv
import time

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"
load_dotenv(env_path, override=True)

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

print(f"Loaded .env from: {env_path}")
print(f"NCBI_EMAIL present: {'yes' if Entrez.email else 'no'}")

QUERY = '("Cochrane Database Syst Rev"[Journal]) AND hasabstract[text]'
OUT_CSV = project_root / "Data" / "cochrane_pubmed_abstracts.csv"
OUT_REF_CSV = project_root / "Data" / "cochrane_pubmed_references.csv"

BATCH_SIZE = 50
SLEEP = 0.9
MAX_RECORDS = None

if not Entrez.email or "example.com" in Entrez.email:
    raise ValueError(f"NCBI_EMAIL not set. Create a .env file at {env_path} with your email.")

Loaded .env from: c:\Users\juanx\Documents\LSE-UKHSA Project\.env
NCBI_EMAIL present: yes


In [3]:
# I define all the helper functions to search PubMed, fetch records, and parse the results
from urllib.error import HTTPError
import xml.etree.ElementTree as ET
from io import StringIO
from Bio import Medline

MAX_PUBMED_RETRIEVAL = 9500

def esearch_count(query: str) -> int:
    rec = Entrez.read(Entrez.esearch(db="pubmed", term=query, retmax=0))
    return int(rec["Count"])

def split_query_by_year(query: str, start_year: int, end_year: int) -> list:
    stack = [(start_year, end_year)]
    slices = []
    while stack:
        s, e = stack.pop()
        date_clause = f'("{s}"[PDAT] : "{e}"[PDAT])'
        q = f"({query}) AND {date_clause}"
        cnt = esearch_count(q)
        if cnt <= MAX_PUBMED_RETRIEVAL:
            slices.append((q, s, e))
        else:
            if e - s <= 1:
                slices.append((q, s, e))
            else:
                mid = (s + e) // 2
                stack.append((s, mid))
                stack.append((mid + 1, e))
    return slices

def esearch_all_ids_with_slices(query: str, max_records=None, start_year: int = 1900, end_year: int = 2035):
    slices = split_query_by_year(query, start_year, end_year)
    pmids = []
    total = 0
    for q, s, e in slices:
        rec0 = Entrez.read(Entrez.esearch(db="pubmed", term=q, retmax=0))
        count_slice = int(rec0["Count"])
        limit_slice = count_slice if max_records is None else min(count_slice, max_records - total)
        for start in range(0, limit_slice, 1000):
            retmax = min(1000, limit_slice - start)
            rec = Entrez.read(Entrez.esearch(db="pubmed", term=q, retstart=start, retmax=retmax))
            pmids.extend(rec["IdList"])
            time.sleep(SLEEP)
        total += limit_slice
        if max_records is not None and total >= max_records:
            break
    return total, pmids

def efetch_medline_by_ids(id_chunk):
    for attempt in range(3):
        try:
            handle = Entrez.efetch(db="pubmed", id=",".join(id_chunk), rettype="medline", retmode="text")
            return handle.read()
        except HTTPError:
            if attempt == 2:
                raise
            time.sleep(SLEEP * (attempt + 2))

def efetch_xml_by_ids(id_chunk):
    for attempt in range(3):
        try:
            handle = Entrez.efetch(db="pubmed", id=",".join(id_chunk), rettype="xml", retmode="xml")
            return handle.read()
        except HTTPError:
            if attempt == 2:
                raise
            time.sleep(SLEEP * (attempt + 2))

def medline_to_rows(medline_text: str):
    for record in Medline.parse(StringIO(medline_text)):
        yield {
            "pmid": record.get("PMID", ""),
            "title": record.get("TI", ""),
            "abstract": record.get("AB", ""),
            "journal": record.get("JT", ""),
            "year": record.get("DP", "").split(" ")[0],
            "authors": "; ".join(record.get("AU", [])),
        }

def parse_references_from_xml(xml_text: str):
    root = ET.fromstring(xml_text)
    for article in root.findall(".//PubmedArticle"):
        citing_pmid = article.findtext(".//MedlineCitation/PMID") or ""
        for ref in article.findall(".//ReferenceList/Reference"):
            ref_pmid = ref.findtext(".//ArticleIdList/ArticleId[@IdType='pubmed']") or ""
            ref_doi = ref.findtext(".//ArticleIdList/ArticleId[@IdType='doi']") or ""
            ref_title = ref.findtext("Citation") or ""
            if citing_pmid and (ref_pmid or ref_doi or ref_title):
                yield {"citing_pmid": citing_pmid, "ref_pmid": ref_pmid, "ref_doi": ref_doi, "ref_title": ref_title}

def write_references_from_ids(pmids, batch_size: int, out_path: Path):
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["citing_pmid", "ref_pmid", "ref_doi", "ref_title"])
        writer.writeheader()
        for i in range(0, len(pmids), batch_size):
            chunk = pmids[i : i + batch_size]
            xml_chunk = efetch_xml_by_ids(chunk)
            for row in parse_references_from_xml(xml_chunk):
                writer.writerow(row)
            time.sleep(SLEEP)

def write_pubmed_to_csv(query: str, out_path: Path, batch_size: int, max_records=None, refs_out_path: Path = None):
    count, pmids = esearch_all_ids_with_slices(query, max_records=max_records)
    print(f"Found {count} records; fetching {len(pmids)} IDs...")
    
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["pmid", "title", "abstract", "journal", "year", "authors"])
        writer.writeheader()
        for i in range(0, len(pmids), batch_size):
            chunk = pmids[i : i + batch_size]
            medline_chunk = efetch_medline_by_ids(chunk)
            for row in medline_to_rows(medline_chunk):
                writer.writerow(row)
            time.sleep(SLEEP)
    
    if refs_out_path:
        print("Fetching reference lists (XML)...")
        write_references_from_ids(pmids, batch_size, refs_out_path)
    
    print(f"Saved abstracts to {out_path.resolve()}")
    if refs_out_path:
        print(f"Saved references to {refs_out_path.resolve()}")

In [4]:
# I run the download - this fetches all Cochrane reviews and their references (skip if files exist)
import pandas as pd

if OUT_CSV.exists() and OUT_REF_CSV.exists():
    print("Data files already exist - skipping download.")
else:
    write_pubmed_to_csv(
        QUERY,
        OUT_CSV,
        BATCH_SIZE,
        max_records=MAX_RECORDS,
        refs_out_path=OUT_REF_CSV,
    )

print("\nAbstracts preview:")
print(pd.read_csv(OUT_CSV).head())

if OUT_REF_CSV.exists():
    print("\nReferences preview:")
    print(pd.read_csv(OUT_REF_CSV).head())

Data files already exist - skipping download.

Abstracts preview:
       pmid                                              title  \
0  41527994  Surgical interventions for treating vesicovagi...   
1  41524153  Physiology- versus angiography-guided percutan...   
2  41510790     Cladribine for people with multiple sclerosis.   
3  41510785  Oral iron supplements for children in malaria-...   
4  41500513                           Exercise for depression.   

                                            abstract  \
0  This is a protocol for a Cochrane Review (inte...   
1  This is a protocol for a Cochrane Review (inte...   
2  RATIONALE: Multiple sclerosis (MS) is a chroni...   
3  RATIONALE: Iron deficiency anaemia is a common...   
4  RATIONALE: Depression is a common cause of mor...   

                                       journal  year  \
0  The Cochrane database of systematic reviews  2026   
1  The Cochrane database of systematic reviews  2026   
2  The Cochrane database of syst