# Data Sources and Compliance Analysis

**Summary:** This notebook documents the data sources used in the LSE-UKHSA Systematic Review Screening Project and analyzes compliance with the relevant terms of use from NCBI/NLM.

---

## Table of Contents

1. [Data Sources Overview](#1-data-sources-overview)
2. [NCBI E-utilities Usage Guidelines](#2-ncbi-e-utilities-usage-guidelines)
3. [NLM Data Download Terms](#3-nlm-data-download-terms)
4. [Copyright Considerations for Abstracts](#4-copyright-considerations-for-abstracts)
5. [Compliance Analysis](#5-compliance-analysis)
6. [Recommendations](#6-recommendations)
7. [Acknowledgment Statement](#7-acknowledgment-statement)

---

## 1. Data Sources Overview

**All data in this project comes from a single source: PubMed via the NCBI Entrez API.**

| Notebook | Data Fetched | Source | Records |
|----------|--------------|--------|--------|
| `00_obtain_cochrane_abstracts.ipynb` | Cochrane review abstracts | PubMed (NCBI Entrez) | ~17,000 reviews |
| `00_obtain_cochrane_abstracts.ipynb` | Reference lists | PubMed (NCBI Entrez) | ~1.2M reference edges |
| `02_fetch_referenced_abstracts.ipynb` | Cited paper abstracts | PubMed (NCBI Entrez) | ~491,000 papers |
| `03_build_ground_truth.ipynb` | Derived dataset | No new downloads | 1,000 samples |
| `04_llm_evaluation.ipynb` | LLM responses | Local Ollama | — |

### Technology Stack

- **API:** NCBI Entrez E-utilities
- **Python Library:** BioPython (`Bio.Entrez`, `Bio.Medline`)
- **Authentication:** Email (required) + API key (optional)
- **Data Formats:** MEDLINE text format (abstracts), XML (references)

---

## 2. NCBI E-utilities Usage Guidelines

The NCBI provides official guidelines for using the E-utilities API. These are documented at:
- https://www.ncbi.nlm.nih.gov/books/NBK25497/

### Rate Limits

| Condition | Maximum Requests |
|-----------|------------------|
| Without API key | 3 requests per second |
| With API key | 10 requests per second |

### Required Parameters

| Parameter | Purpose | Required? |
|-----------|---------|----------|
| `email` | Contact information for NCBI | **Yes** |
| `api_key` | Authentication for higher rate limits | Recommended |
| `tool` | Identifier for your software | Recommended |

### Timing Recommendations

For large jobs, NCBI recommends:
- Run on **weekends**, or
- Run between **9:00 PM and 5:00 AM Eastern Time** on weekdays

### Consequences of Non-Compliance

> "Failure to comply with this policy may result in an IP address being blocked from accessing NCBI."

If blocked, developers must register their `tool` and `email` values with NCBI to restore access.

---

## 3. NLM Data Download Terms

The NLM provides terms and conditions for downloading data at:
- https://www.nlm.nih.gov/databases/download.html

### Key Terms

Users of NLM data agree to:

1. **Acknowledge NLM as the source** of the data in a clear and conspicuous manner

2. **Not indicate or imply** that NLM has endorsed their products/services/applications

3. **Maintain current data** if redistributing, or clearly indicate that data may not reflect the most current version

4. **Hold NLM harmless** from any liability resulting from errors in the data

### Disclaimer

> "These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data."

---

## 4. Copyright Considerations for Abstracts

### Critical Point: Abstracts May Be Copyrighted

From the NCBI E-utilities documentation:

> "**NLM does not claim the copyright on the abstracts in PubMed; however, journal publishers or authors may.** NLM provides no legal advice concerning distribution of copyrighted materials."

From the NLM download page:

> "The data...include works of the United States Government that are not protected by U.S. copyright law but **may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. or non-US copyright law.**"

### What This Means

| Data Type | Copyright Status |
|-----------|------------------|
| PMIDs, metadata fields | Generally public domain |
| Abstract text | **May be copyrighted** by publishers/authors |
| Reference lists | Generally public domain |

### Fair Use Considerations

Using abstracts for **text mining and research** typically falls under fair use, but:

- **Republishing or redistributing** full abstract text may require publisher permission
- **Commercial use** of abstract text may require licensing agreements
- The extent of fair use protection depends on jurisdiction

### NCBI's Position

> "NLM does not provide legal advice concerning distribution of copyrighted materials. Please consult your legal counsel."

---

## 5. Compliance Analysis

### How This Project Handles Each Requirement

In [1]:
import pandas as pd

# Create compliance checklist
compliance_data = [
    {
        "Requirement": "Provide email to NCBI",
        "Status": "✅ Compliant",
        "Implementation": "NCBI_EMAIL loaded from .env file and passed to Entrez.email"
    },
    {
        "Requirement": "Respect rate limits (3/sec without key, 10/sec with key)",
        "Status": "✅ Compliant",
        "Implementation": "SLEEP=0.9s without API key, SLEEP=0.35s with API key"
    },
    {
        "Requirement": "Use E-utility URLs (not web interface)",
        "Status": "✅ Compliant",
        "Implementation": "BioPython Entrez module routes to eutils.ncbi.nlm.nih.gov"
    },
    {
        "Requirement": "Acknowledge NLM as data source",
        "Status": "⚠️ Should Add",
        "Implementation": "Add acknowledgment to documentation and any publications"
    },
    {
        "Requirement": "Do not imply NLM endorsement",
        "Status": "✅ Compliant",
        "Implementation": "No endorsement claims made in project materials"
    },
    {
        "Requirement": "Register tool name with NCBI",
        "Status": "⚠️ Recommended",
        "Implementation": "Consider adding Entrez.tool parameter for identification"
    },
    {
        "Requirement": "Respect abstract copyright",
        "Status": "⚠️ Consider for redistribution",
        "Implementation": "Research use likely OK; public redistribution needs review"
    },
]

df = pd.DataFrame(compliance_data)
print("NCBI/NLM Compliance Checklist")
print("=" * 80)
display(df)

NCBI/NLM Compliance Checklist


Unnamed: 0,Requirement,Status,Implementation
0,Provide email to NCBI,✅ Compliant,NCBI_EMAIL loaded from .env file and passed to...
1,"Respect rate limits (3/sec without key, 10/sec...",✅ Compliant,"SLEEP=0.9s without API key, SLEEP=0.35s with A..."
2,Use E-utility URLs (not web interface),✅ Compliant,BioPython Entrez module routes to eutils.ncbi....
3,Acknowledge NLM as data source,⚠️ Should Add,Add acknowledgment to documentation and any pu...
4,Do not imply NLM endorsement,✅ Compliant,No endorsement claims made in project materials
5,Register tool name with NCBI,⚠️ Recommended,Consider adding Entrez.tool parameter for iden...
6,Respect abstract copyright,⚠️ Consider for redistribution,Research use likely OK; public redistribution ...


### Code Review: Rate Limit Implementation

The project correctly implements rate limiting in both data-fetching notebooks:

In [2]:
# From 00_obtain_cochrane_abstracts.ipynb:
# ----------------------------------------
# BATCH_SIZE = 50
# SLEEP = 0.9  # Complies with 3 requests/sec limit (without API key)

# From 02_fetch_referenced_abstracts.ipynb:
# -----------------------------------------
# BATCH_SIZE = 200
# SLEEP = 0.35 if Entrez.api_key else 0.9  # Adjusts based on API key presence

# Verification:
print("Rate Limit Verification")
print("=" * 40)
print(f"Without API key: 1 request / 0.9s = {1/0.9:.2f} requests/sec (limit: 3/sec) ✅")
print(f"With API key: 1 request / 0.35s = {1/0.35:.2f} requests/sec (limit: 10/sec) ✅")

Rate Limit Verification
Without API key: 1 request / 0.9s = 1.11 requests/sec (limit: 3/sec) ✅
With API key: 1 request / 0.35s = 2.86 requests/sec (limit: 10/sec) ✅


---

## 6. Recommendations

### 6.1 Add NLM Acknowledgment

Include the following statement in project documentation and any publications:

> **Data Acknowledgment:** Bibliographic data used in this project were obtained from PubMed®, a database of the National Library of Medicine, National Institutes of Health.

### 6.2 Register Tool Name (Optional but Recommended)

Add a tool identifier to Entrez calls for better tracking:

In [3]:
# Recommended addition to data-fetching notebooks:
from Bio import Entrez

# Add these lines after setting Entrez.email:
Entrez.tool = "lse_ukhsa_sr_screening"  # Identifies this project to NCBI

print(f"Tool identifier: {Entrez.tool}")
print("This helps NCBI contact you if there are issues with your requests.")

Tool identifier: lse_ukhsa_sr_screening
This helps NCBI contact you if there are issues with your requests.


### 6.3 Consider Copyright for Data Sharing

If sharing datasets publicly (e.g., on GitHub, Zenodo), consider:

| Option | Approach | Pros | Cons |
|--------|----------|------|------|
| **A** | Share PMIDs only | No copyright concerns | Users must fetch abstracts themselves |
| **B** | Share with disclaimer | Convenient for users | Potential copyright issues |
| **C** | Request permission | Legally safest | Time-consuming |

**Suggested disclaimer for Option B:**

> This dataset contains abstracts retrieved from PubMed. Abstract text may be copyrighted by the original publishers or authors. This dataset is provided for research purposes only. Users are responsible for ensuring their use complies with applicable copyright laws.

### 6.4 Document Use Case

The project use case (academic research evaluating LLM screening performance) is:
- ✅ Non-commercial
- ✅ For research/educational purposes
- ✅ Transformative use (training/evaluating AI models)
- ✅ Consistent with NCBI's intended purpose for the E-utilities

---

## 7. Acknowledgment Statement

The following acknowledgment should be included in any publications or public releases from this project:

---

### Data Sources and Acknowledgments

Bibliographic data used in this project were retrieved from **PubMed®**, a free resource developed and maintained by the **National Center for Biotechnology Information (NCBI)** at the **National Library of Medicine (NLM)**, National Institutes of Health.

Cochrane systematic reviews were identified using PubMed's search functionality and retrieved via the NCBI Entrez E-utilities API.

**Disclaimer:** The abstract text contained in this project's datasets may be protected by copyright held by the original publishers or authors. NLM does not claim copyright on PubMed abstracts. This project uses these materials for non-commercial research purposes.

**References:**
- NCBI E-utilities: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- NLM Data Download Terms: https://www.nlm.nih.gov/databases/download.html
- NCBI Policies: https://www.ncbi.nlm.nih.gov/home/about/policies/

---

## Summary

| Aspect | Finding |
|--------|--------|
| **Data Source** | All data from PubMed via NCBI Entrez API |
| **Technical Compliance** | ✅ Rate limits, email, and API usage all compliant |
| **Attribution** | ⚠️ Should add NLM acknowledgment to documentation |
| **Copyright** | ⚠️ Abstracts may be copyrighted; research use likely OK |
| **Redistribution** | ⚠️ Consider implications before public sharing of full abstracts |
| **Overall Assessment** | Project follows NCBI guidelines correctly; minor documentation additions recommended |

---

*Notebook created: January 2026*  
*Last reviewed: January 2026*