# Fetching Abstracts for Referenced Papers

**Summary:** In this notebook, I download the abstracts for all papers cited in Cochrane reviews. These cited papers are "included" studies that passed the systematic review screening process. I need their abstracts so I can use them as positive examples when evaluating how well LLMs can screen papers.

**What I do:**
1. I load the reference edges from the previous step (~1.2M edges, ~491K unique papers with PMIDs)
2. I fetch abstracts from PubMed in batches of 200, using exponential backoff for errors
3. I save results incrementally so I can resume if interrupted
4. The full download takes about 2.5-3 hours due to PubMed rate limits

**Output:** `referenced_paper_abstracts.csv` with ~491K papers (about 90% have abstracts)

In [1]:
# I set up the environment, load credentials, and configure paths
import os
import csv
import time
from pathlib import Path
from dotenv import load_dotenv
from Bio import Entrez, Medline
from io import StringIO
from urllib.error import HTTPError, URLError
from http.client import RemoteDisconnected
import pandas as pd
from datetime import datetime

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"
load_dotenv(env_path, override=True)

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

print(f"NCBI_EMAIL present: {'yes' if Entrez.email else 'no'}")
print(f"API key present: {'yes' if Entrez.api_key else 'no'}")

DATA_DIR = project_root / "Data"
REFERENCES_CSV = DATA_DIR / "cochrane_pubmed_references.csv"
OUTPUT_CSV = DATA_DIR / "referenced_paper_abstracts.csv"

BATCH_SIZE = 200
SLEEP = 0.35 if Entrez.api_key else 0.9

NCBI_EMAIL present: yes
API key present: yes


In [2]:
# I load the references and extract unique PMIDs that I need to fetch
refs = pd.read_csv(REFERENCES_CSV, dtype={"citing_pmid": str, "ref_pmid": str})
print(f"Total reference edges: {len(refs):,}")

ref_pmids = refs["ref_pmid"].dropna()
ref_pmids = ref_pmids[ref_pmids != ""]
unique_pmids = ref_pmids.unique().tolist()
print(f"Unique referenced PMIDs to fetch: {len(unique_pmids):,}")

Total reference edges: 1,182,678
Unique referenced PMIDs to fetch: 491,531


In [3]:
# I check if I've already fetched some PMIDs (so I can resume if interrupted)
already_fetched = set()
if OUTPUT_CSV.exists():
    existing = pd.read_csv(OUTPUT_CSV, dtype={"pmid": str}, usecols=["pmid"])
    already_fetched = set(existing["pmid"].dropna().unique())
    print(f"Already fetched: {len(already_fetched):,} PMIDs")

pmids_to_fetch = [p for p in unique_pmids if p not in already_fetched]
print(f"Remaining to fetch: {len(pmids_to_fetch):,} PMIDs")

Already fetched: 490,929 PMIDs
Remaining to fetch: 602 PMIDs


In [4]:
# I define helper functions to fetch and parse MEDLINE records with retry logic
def efetch_medline_batch(pmids: list[str], max_retries: int = 5) -> str:
    """Fetch MEDLINE records for a batch of PMIDs with exponential backoff."""
    for attempt in range(max_retries):
        try:
            handle = Entrez.efetch(db="pubmed", id=",".join(pmids), rettype="medline", retmode="text")
            return handle.read()
        except (HTTPError, URLError, RemoteDisconnected, ConnectionError, TimeoutError) as e:
            if attempt == max_retries - 1:
                print(f"Failed batch after {max_retries} attempts: {e}")
                return ""
            backoff = 2 ** attempt
            print(f"Error: {type(e).__name__}, retrying in {backoff}s...")
            time.sleep(backoff)
    return ""

def parse_medline_records(medline_text: str):
    """Parse MEDLINE text into dictionaries."""
    for record in Medline.parse(StringIO(medline_text)):
        yield {
            "pmid": record.get("PMID", ""),
            "title": record.get("TI", ""),
            "abstract": record.get("AB", ""),
            "journal": record.get("JT", ""),
            "year": record.get("DP", "").split(" ")[0],
            "authors": "; ".join(record.get("AU", [])),
        }

In [6]:
# I fetch all abstracts in batches, saving progress as I go (skip if all PMIDs already fetched)
if len(pmids_to_fetch) == 0:
    print("All PMIDs already fetched - skipping download.")
else:
    total_batches = (len(pmids_to_fetch) + BATCH_SIZE - 1) // BATCH_SIZE
    file_exists = OUTPUT_CSV.exists()
    with OUTPUT_CSV.open("a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["pmid", "title", "abstract", "journal", "year", "authors"])
        if not file_exists:
            writer.writeheader()
        
        fetched_count = 0
        start_time = datetime.now()
        
        for i in range(0, len(pmids_to_fetch), BATCH_SIZE):
            batch = pmids_to_fetch[i:i + BATCH_SIZE]
            batch_num = i // BATCH_SIZE + 1
            
            medline_text = efetch_medline_batch(batch)
            if medline_text:
                for record in parse_medline_records(medline_text):
                    writer.writerow(record)
                    fetched_count += 1
            
            if batch_num % 50 == 0 or batch_num == total_batches:
                elapsed = (datetime.now() - start_time).total_seconds()
                rate = fetched_count / elapsed if elapsed > 0 else 0
                remaining = (len(pmids_to_fetch) - fetched_count) / rate / 60 if rate > 0 else 0
                print(f"Batch {batch_num:,}/{total_batches:,} | Fetched: {fetched_count:,} | Rate: {rate:.1f}/sec | ETA: {remaining:.1f} min")
                f.flush()
            
            time.sleep(SLEEP)
    
    print("-" * 60)
    print(f"Done! Total fetched this run: {fetched_count:,}")

Batch 4/4 | Fetched: 600 | Rate: 57.9/sec | ETA: 0.0 min
------------------------------------------------------------
Done! Total fetched this run: 600


In [7]:
# I verify the output file and check how many papers have abstracts
if OUTPUT_CSV.exists():
    result = pd.read_csv(OUTPUT_CSV, dtype={"pmid": str})
    print(f"Total records in output file: {len(result):,}")
    print(f"Records with abstracts: {(result['abstract'].notna() & (result['abstract'] != '')).sum():,}")
    print(f"\nSample:")
    display(result.head())

Total records in output file: 491,529
Records with abstracts: 443,977

Sample:


Unnamed: 0,pmid,title,abstract,journal,year,authors
0,2314794,The use of modified Martius graft as an adjunc...,"The use of the Martius graft, a labial fibro-f...",Obstetrics and gynecology,1990,Elkins TE; DeLancey JO; McGuire EJ
1,21905761,Quality of life following successful repair of...,INTRODUCTION: The impact of obstetric vesicova...,Rural and remote health,2011,Umoiyoho AJ; Inyang-Etoh EC; Abah GM; Abasiatt...
2,32459344,Association of Low Socioeconomic Status With P...,IMPORTANCE: Individuals with low socioeconomic...,JAMA cardiology,2020,Hamad R; Penko J; Kazi DS; Coxson P; Guzman D;...
3,10547403,Long-term benefit of primary angioplasty as co...,BACKGROUND: As compared with thrombolytic ther...,The New England journal of medicine,1999,Zijlstra F; Hoorntje JC; de Boer MJ; Reiffers ...
4,12241831,Interventional versus conservative treatment f...,"BACKGROUND: Current guidelines suggest that, f...","Lancet (London, England)",2002,Fox KA; Poole-Wilson PA; Henderson RA; Clayton...
