# 02: Fetch Cochrane PDFs via Wiley TDM API

## Summary
This notebook downloads Cochrane review PDFs using the Wiley Text and Data Mining (TDM) API. The PDFs contain structured reference sections that properly categorize included vs. excluded studies - information not available in PubMed XML.

**Pipeline Position:** Third notebook - downloads source PDFs for extracting categorized references.

**What this notebook does:**
1. Loads Cochrane review metadata from PubMed (including DOIs)
2. Downloads PDFs for each review via Wiley TDM API
3. Implements smart batching with rate limiting (3/sec and 60/10min)
4. Skips PDFs that already exist on disk
5. Saves PDFs to local storage for text extraction

**Input:** `Data/cochrane_pubmed_abstracts.csv`

**Output:** `Data/cochrane_pdfs/*.pdf` - Downloaded PDF files

**Requirements:**
- Wiley TDM API token in `.env` file (WILEY_TEXT_AND_DATA_MINING_TOKEN)
- Institutional IP access to Cochrane Library content

**Important:** PDFs contain proprietary content and must NOT be uploaded to GitHub. They are excluded via .gitignore.

## Download Strategy
- **Skip existing:** PDFs already on disk are not re-downloaded
- **Rate limiting:** Respects Wiley's 3/sec and 60/10min limits
- **Retry logic:** Automatic exponential backoff on transient failures

In [3]:
# Install required packages for PDF download and processing
%pip install -q wiley-tdm python-dotenv pandas biopython

Note: you may need to restart the kernel to use updated packages.


In [4]:
# Set up environment and load credentials
import os
from pathlib import Path
from dotenv import load_dotenv
import pandas as pd

notebook_dir = Path.cwd()
project_root = notebook_dir if (notebook_dir / ".env").exists() else notebook_dir.parent
env_path = project_root / ".env"
load_dotenv(env_path, override=True)

WILEY_TDM_TOKEN = os.getenv("WILEY_TEXT_AND_DATA_MINING_TOKEN")
os.environ['TDM_API_TOKEN'] = WILEY_TDM_TOKEN or ""

DATA_DIR = project_root / "Data"
PDF_DIR = DATA_DIR / "cochrane_pdfs"
ABSTRACTS_CSV = DATA_DIR / "cochrane_pubmed_abstracts.csv"

print(f"Wiley TDM Token loaded: {'‚úì' if WILEY_TDM_TOKEN else '‚úó'}")
print(f"PDF output directory: {PDF_DIR}")

if not WILEY_TDM_TOKEN:
    raise ValueError("WILEY_TEXT_AND_DATA_MINING_TOKEN not set in .env file")

Wiley TDM Token loaded: ‚úì
PDF output directory: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\cochrane_pdfs


In [5]:
# Load Cochrane reviews and fetch DOIs from PubMed
from Bio import Entrez
import time
import re

Entrez.email = os.getenv("NCBI_EMAIL", "")
Entrez.api_key = os.getenv("NCBI_API_KEY", "")

abstracts = pd.read_csv(ABSTRACTS_CSV, dtype={"pmid": str})
print(f"Loaded {len(abstracts):,} Cochrane reviews")

def get_doi_from_pmid(pmid):
    """Fetch DOI for a PubMed ID."""
    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, rettype="xml", retmode="xml")
        xml = handle.read()
        if isinstance(xml, bytes):
            xml = xml.decode('utf-8')
        handle.close()
        doi_match = re.search(r'<ArticleId IdType="doi">([^<]+)</ArticleId>', xml)
        return doi_match.group(1) if doi_match else None
    except Exception as e:
        return None

# Test with a sample
sample_pmid = abstracts['pmid'].iloc[0]
sample_doi = get_doi_from_pmid(sample_pmid)
print(f"Sample: PMID {sample_pmid} -> DOI {sample_doi}")

Loaded 17,328 Cochrane reviews
Sample: PMID 17636697 -> DOI 10.1002/14651858.CD002182


In [6]:
# Initialize Wiley TDM client and test download
from wiley_tdm import TDMClient

PDF_DIR.mkdir(parents=True, exist_ok=True)
tdm = TDMClient(download_dir=str(PDF_DIR))

# Test with a single download
if sample_doi:
    print(f"Testing download for DOI: {sample_doi}")
    result = tdm.download_pdf(sample_doi)
    print(f"Result: {result}")
else:
    print("Could not get DOI for test download")

Testing download for DOI: 10.1002/14651858.CD002182
Result: Existing File


In [7]:
# Filter to reviews with valid DOIs (ALL years, ALL types - no filtering!)
valid_dois = abstracts[abstracts['doi'].notna() & (abstracts['doi'] != '')].copy()
valid_dois['year'] = pd.to_numeric(valid_dois['year'], errors='coerce')

# DEDUPLICATE by DOI - same DOI appears multiple times in PubMed data
# Keep the most recent version of each review
valid_dois = valid_dois.sort_values('year', ascending=False).drop_duplicates(subset=['doi'], keep='first')

print(f"Total UNIQUE Cochrane entries with valid DOIs: {len(valid_dois):,}")
print(f"Year range: {valid_dois['year'].min():.0f} - {valid_dois['year'].max():.0f}")

# Helper to convert DOI to safe filename
def doi_to_filename(doi):
    return doi.replace("/", "-").replace(":", "_")

valid_dois['filename'] = valid_dois['doi'].apply(doi_to_filename)

# Check which PDFs already exist on disk ‚Äî skip these
existing_pdfs = set(f.stem for f in PDF_DIR.glob("*.pdf"))
print(f"PDFs already downloaded: {len(existing_pdfs):,}")

# Determine what needs downloading (not already on disk)
to_download = valid_dois[~valid_dois['filename'].isin(existing_pdfs)].copy()

# Sort by year (newest first - often more available)
to_download = to_download.sort_values('year', ascending=False)
print(f"\nRemaining to download: {len(to_download):,}")
print(f"Year distribution of remaining:")
print(to_download.groupby('year').size().sort_index(ascending=False).head(10))

Total UNIQUE Cochrane entries with valid DOIs: 16,646
Year range: 1996 - 2026
PDFs already downloaded: 16,618

Remaining to download: 28
Year distribution of remaining:
year
2019    1
2011    6
2010    4
2009    1
2003    3
2002    2
2001    3
2000    8
dtype: int64


In [8]:
# Resumable PDF downloader with ROBUST rate limiting and auto-retry
import requests
from tqdm.notebook import tqdm
from datetime import datetime, timedelta, timezone
from collections import deque
import json

# =============================================================================
# WILEY API RATE LIMITS (confirmed by Wiley support):
#   1) Up to 3 articles per second
#   2) Up to 60 requests per 10 minutes
#
# Strategy: Download quickly (respecting 3/sec), then pause when hitting 60/10min
# Expected rate: ~360 PDFs/hour (60 every 10 minutes)
# =============================================================================

# Configuration
REQUESTS_PER_10_MIN = 60  # Wiley's limit
WINDOW_SECONDS = 600  # 10 minutes in seconds
MIN_DELAY_BETWEEN_REQUESTS = 0.4  # Slightly under 3/sec to be safe
MAX_RETRIES = 5  # Retries for connection issues
INITIAL_BACKOFF = 30  # Initial wait on failure (seconds)
MAX_BACKOFF = 600  # Maximum wait between retries (10 minutes)

# Track request timestamps for rate limiting
request_timestamps = deque(maxlen=REQUESTS_PER_10_MIN)

def doi_to_safe_filename(doi):
    """Convert DOI to safe filename - MUST match the check in cell above."""
    return doi.replace("/", "-").replace(":", "_")

def wait_for_rate_limit():
    """Wait if we've hit the 60 requests / 10 minutes limit."""
    now = time.time()
    
    # Clean old timestamps (older than 10 minutes)
    while request_timestamps and (now - request_timestamps[0]) > WINDOW_SECONDS:
        request_timestamps.popleft()
    
    # If we've made 60 requests in the last 10 minutes, wait for oldest to expire
    if len(request_timestamps) >= REQUESTS_PER_10_MIN:
        oldest = request_timestamps[0]
        wait_time = WINDOW_SECONDS - (now - oldest) + 2  # Add 2s buffer
        if wait_time > 0:
            print(f"\n‚è≥ Hit 60/10min limit - waiting {wait_time:.0f}s for window to slide...")
            time.sleep(wait_time)
    
    # Maintain minimum delay between requests (3/sec limit)
    if request_timestamps:
        time_since_last = time.time() - request_timestamps[-1]
        if time_since_last < MIN_DELAY_BETWEEN_REQUESTS:
            time.sleep(MIN_DELAY_BETWEEN_REQUESTS - time_since_last)

def download_pdf_with_retry(doi, token, output_dir, max_retries=MAX_RETRIES):
    """Download PDF with exponential backoff retry logic."""
    url = f"https://api.wiley.com/onlinelibrary/tdm/v1/articles/{doi}"
    headers = {"Wiley-TDM-Client-Token": token}
    
    for attempt in range(max_retries + 1):
        # Respect rate limit before making request
        wait_for_rate_limit()
        
        try:
            # Record this request timestamp
            request_timestamps.append(time.time())
            
            response = requests.get(url, headers=headers, timeout=60)
            
            if response.status_code == 200:
                content_type = response.headers.get('Content-Type', '')
                if 'application/pdf' in content_type and len(response.content) > 1000:
                    filename = doi_to_safe_filename(doi) + ".pdf"
                    filepath = output_dir / filename
                    with open(filepath, 'wb') as f:
                        f.write(response.content)
                    return "success", len(response.content)
                elif 'application/json' in content_type:
                    try:
                        error_data = response.json()
                        if 'quota' in str(error_data).lower():
                            return "quota_exceeded", 0
                    except:
                        pass
                    return "api_error", 0
                else:
                    return "empty_response", 0
                    
            elif response.status_code == 500:
                # Check if it's a quota violation
                try:
                    if 'application/json' in response.headers.get('Content-Type', ''):
                        error_data = response.json()
                        if 'quota' in str(error_data).lower() or 'ratelimit' in str(error_data).lower():
                            return "quota_exceeded", 0
                except:
                    pass
                # HTTP 500 might be temporary - retry with backoff
                if attempt < max_retries:
                    backoff = min(INITIAL_BACKOFF * (2 ** attempt), MAX_BACKOFF)
                    print(f"\n‚ö†Ô∏è  HTTP 500 for {doi[:30]}... - retrying in {backoff}s (attempt {attempt+1}/{max_retries})")
                    time.sleep(backoff)
                    continue
                return "http_500", 0
                
            elif response.status_code in [403, 404]:
                # Permanent failures - don't retry
                return f"http_{response.status_code}", 0
                
            elif response.status_code in [429, 503]:
                # Rate limited or service unavailable - wait and retry
                if attempt < max_retries:
                    backoff = min(INITIAL_BACKOFF * (2 ** attempt), MAX_BACKOFF)
                    print(f"\n‚ö†Ô∏è  HTTP {response.status_code} - server busy, waiting {backoff}s...")
                    time.sleep(backoff)
                    continue
                return f"http_{response.status_code}", 0
            else:
                return f"http_{response.status_code}", 0
                
        except requests.Timeout:
            if attempt < max_retries:
                backoff = min(INITIAL_BACKOFF * (2 ** attempt), MAX_BACKOFF)
                print(f"\n‚ö†Ô∏è  Timeout for {doi[:30]}... - retrying in {backoff}s")
                time.sleep(backoff)
                continue
            return "timeout", 0
            
        except requests.ConnectionError:
            if attempt < max_retries:
                backoff = min(INITIAL_BACKOFF * (2 ** attempt), MAX_BACKOFF)
                print(f"\nüîå Connection error - retrying in {backoff}s (attempt {attempt+1}/{max_retries})")
                time.sleep(backoff)
                continue
            return "connection_error", 0
            
        except Exception as e:
            if attempt < max_retries:
                backoff = min(INITIAL_BACKOFF * (2 ** attempt), MAX_BACKOFF)
                print(f"\n‚ö†Ô∏è  Error {type(e).__name__} - retrying in {backoff}s")
                time.sleep(backoff)
                continue
            return f"error_{type(e).__name__}", 0
    
    return "max_retries_exceeded", 0

# =============================================================================
# MAIN DOWNLOAD LOOP - Runs until complete, handles all errors gracefully
# =============================================================================

print(f"üöÄ Starting ROBUST download of {len(to_download):,} PDFs")
print(f"   Rate limits: 3/sec AND 60/10min ‚Üí ~360 PDFs/hour max")
print(f"   Retries: {MAX_RETRIES} attempts with exponential backoff")
print(f"   Estimated time: ~{len(to_download)/360:.1f} hours")
print("-" * 60)

success_count = 0
fail_count = 0
total_bytes = 0
start_time = datetime.now()
rate_limit_waits = 0

i = 0
pbar = tqdm(total=len(to_download), desc="Downloading", unit="pdf")

while i < len(to_download):
    row = to_download.iloc[i]
    doi = row['doi']
    
    status, file_size = download_pdf_with_retry(doi, WILEY_TDM_TOKEN, PDF_DIR)
    
    if status == "success":
        success_count += 1
        total_bytes += file_size
        pbar.update(1)
        i += 1
        
    elif status == "quota_exceeded":
        # Unexpected quota hit - wait 10 minutes and retry
        rate_limit_waits += 1
        print(f"\n‚è∞ QUOTA EXCEEDED at {datetime.now().strftime('%H:%M:%S')}")
        print(f"   Progress so far: {success_count} downloaded, {fail_count} failed")
        print(f"   Waiting 10 minutes for quota reset...")
        
        # Clear our tracking (server's window may differ)
        request_timestamps.clear()
        
        # Wait with countdown
        for remaining in range(600, 0, -60):
            print(f"   ‚è≥ {remaining//60} minutes remaining...", end='\r')
            time.sleep(min(60, remaining))
        print(f"\n‚úì Resuming downloads...")
        # Don't increment i - retry the same DOI
        continue
        
    else:
        # Permanent failure (403, 404, etc.) - log and move on
        fail_count += 1
        pbar.update(1)
        i += 1
    
    # Print progress periodically
    if i % 50 == 0 and i > 0:
        elapsed = (datetime.now() - start_time).total_seconds() / 3600
        rate = success_count / elapsed if elapsed > 0 else 0
        remaining_time = (len(to_download) - i) / rate if rate > 0 else 0
        print(f"\n   üìä {success_count} done, {fail_count} failed | {rate:.0f}/hr | ~{remaining_time:.1f}hr remaining")

pbar.close()

# =============================================================================
# SESSION SUMMARY
# =============================================================================
elapsed_hours = (datetime.now() - start_time).total_seconds() / 3600
print("\n" + "=" * 60)
print("‚úÖ DOWNLOAD SESSION COMPLETE")
print("=" * 60)
print(f"Duration: {elapsed_hours:.1f} hours")
print(f"Downloaded: {success_count:,} PDFs ({total_bytes/1024/1024:.1f} MB)")
print(f"Failed (permanent): {fail_count:,}")
print(f"Rate limit pauses: {rate_limit_waits}")
print(f"Actual rate: {success_count/elapsed_hours:.0f} PDFs/hour" if elapsed_hours > 0 else "")
print(f"\nüìÅ PDFs saved to: {PDF_DIR}")

üöÄ Starting ROBUST download of 28 PDFs
   Rate limits: 3/sec AND 60/10min ‚Üí ~360 PDFs/hour max
   Retries: 5 attempts with exponential backoff
   Estimated time: ~0.1 hours
------------------------------------------------------------


Downloading:   0%|          | 0/28 [00:00<?, ?pdf/s]


‚úÖ DOWNLOAD SESSION COMPLETE
Duration: 0.0 hours
Downloaded: 0 PDFs (0.0 MB)
Failed (permanent): 28
Rate limit pauses: 0
Actual rate: 0 PDFs/hour

üìÅ PDFs saved to: c:\Users\juanx\Documents\LSE-UKHSA Project\Data\cochrane_pdfs


In [9]:
# Overall download progress summary
pdf_files = list(PDF_DIR.glob("*.pdf"))
total_size_mb = sum(p.stat().st_size for p in pdf_files) / (1024 * 1024)

print("=" * 60)
print("OVERALL DOWNLOAD PROGRESS")
print("=" * 60)
print(f"Total Cochrane reviews with DOIs: {len(valid_dois):,}")
print(f"PDFs on disk: {len(pdf_files):,}")
print(f"Storage used: {total_size_mb:.1f} MB ({total_size_mb/1024:.2f} GB)")
print(f"Average PDF size: {total_size_mb/len(pdf_files)*1024:.0f} KB" if pdf_files else "")
print(f"\nProgress: {len(pdf_files)/len(valid_dois)*100:.1f}% complete")

remaining = len(valid_dois) - len(pdf_files)
print(f"Remaining downloadable: ~{remaining:,}")

print(f"\nüìã Next steps:")
if len(pdf_files) >= 100:
    print(f"   ‚úì Enough PDFs to validate pipeline - proceed to notebook 03")
if remaining > 0:
    print(f"   ‚Üí Re-run download cell after quota reset to continue")
else:
    print(f"   ‚úì All available PDFs downloaded!")

OVERALL DOWNLOAD PROGRESS
Total Cochrane reviews with DOIs: 16,646
PDFs on disk: 16,618
Storage used: 12913.5 MB (12.61 GB)
Average PDF size: 796 KB

Progress: 99.8% complete
Remaining downloadable: ~28

üìã Next steps:
   ‚úì Enough PDFs to validate pipeline - proceed to notebook 03
   ‚Üí Re-run download cell after quota reset to continue
