# Test Publication Downloading with pubmed-stream

This notebook demonstrates how to use the `pubmed-stream` library to search and download PubMed Central full-text articles.

## 1. Import the Library

First, let's import the main functions from pubmed-stream.

## Important: How PMC Search Works

**The library searches PubMed Central (PMC) directly** to ensure all results have full-text available.

**Search behavior:**
- Generic terms like "microbiome", "CRISPR", "COVID-19" work perfectly
- Multi-term queries refine results: "gut microbiome inflammation aging"
- PMC has 10M+ full-text articles covering most active research topics

**Best practices for successful downloads:**
- ✅ Any biomedical topic should work: "microbiome", "COVID-19", "CRISPR", "cancer genomics"
- ✅ Combine terms for more specific results: "gut microbiome inflammation"  
- ✅ PMC emphasizes open access literature and recent research
- ⚠️ Very specialized/paywalled topics may have limited PMC coverage (but will work if articles exist in PMC)

In [2]:
from pubmed_stream import search_and_download, DownloadStats, __version__
from pathlib import Path
import json
import os

print(f"pubmed-stream version: {__version__}")

pubmed-stream version: 0.1.0


In [3]:
import os

# Optional: Set your NCBI API key here (or use environment variable)
# Leave as None to work without API key (slower: 3 req/s vs 10 req/s)
# To use an API key, uncomment and add your key:
# os.environ["NCBI_API_KEY"] = "your_api_key_here"

# Check if API key is set
api_key = os.getenv("NCBI_API_KEY")
if api_key:
    print(f"✓ API key is set (rate limit: 10 req/s)")
else:
    print("ℹ No API key set - using slower rate limit (3 req/s)")

ℹ No API key set - using slower rate limit (3 req/s)


In [None]:
# Quick test to verify the system works
# Using "bile acid" which has abundant open access articles

print("Testing with a known good query...")
test_stats = search_and_download(
    keyword="bile acid",
    max_results=5,
    fmt="text",  # Default: JSON with metadata + text field
    output_dir=Path("test_publications"),
    api_key=api_key  # Uses the api_key from the cell above
)

# Check both new downloads and skipped files
total_available = test_stats.successful + test_stats.skipped

if test_stats.successful > 0:
    print("\n✓ System is working! Library can download articles.")
    print(f"  Downloaded {test_stats.successful} article(s)")
elif test_stats.skipped > 0:
    print("\n✓ System is working! Files already exist (no need to re-download).")
    print(f"  {test_stats.skipped} article(s) already available")
    print(f"  Tip: Delete 'test_publications/bile_acid' folder to test a fresh download")
else:
    print("\n✗ No files available - this could indicate:")
    print("  - Network/API issues")
    print("  - Rate limiting (wait a bit and retry)")
    print("  - Articles not available in full-text")

Testing with a known good query...

Found 5 PMC IDs for 'bile acid' (total PubMed results: 213660)
Output directory: test_publications\bile_acid

Progress: 5/5 processed (OK:0 | FAIL:0 | SKIP:5)

Download Summary
Keyword:           bile acid
Total found:       213660
Requested:         5
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    5
Duration:          0.2s
Output directory:  test_publications\bile_acid

✓ System is working! Files already exist (skipped re-download).
  Skipped 5 existing file(s)
  Tip: Delete 'test_publications/bile_acid' folder to test a fresh download

Found 5 PMC IDs for 'bile acid' (total PubMed results: 213660)
Output directory: test_publications\bile_acid

Progress: 5/5 processed (OK:0 | FAIL:0 | SKIP:5)

Download Summary
Keyword:           bile acid
Total found:       213660
Requested:         5
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    5
Durati

## 2. Main Download Test

Download articles for your research topic.

**Troubleshooting "0 results":**
- The library searches **PubMed → links to PMC** (2-step process)
- Not all PubMed articles have PMC full-text (paywalled, not deposited, etc.)
- Single specific terms like "frailty" often return 0 PMC links
- **Solution**: Use broader multi-term queries with active open access topics

**Queries that work well:**
- "gut microbiome inflammation" 
- "COVID-19 treatment"
- "CRISPR gene editing"
- "cancer immunotherapy"
- "machine learning healthcare"

In [5]:
# Download 3 articles
# The library searches PMC directly for full-text articles
# Best practice: Use broader, recent topics with open access literature
# Examples that work well: "microbiome", "COVID-19", "machine learning", "CRISPR"

stats = search_and_download(
    keyword="gut microbiome inflammation aging",  # Broader topic with lots of open access
    max_results=30,
    fmt="text",  # Default: JSON with metadata + text field (or use "xml", "both")
    output_dir=Path("test_publications"),
    api_key=api_key  # Uses the api_key from cell 4
)

print(f"\nSuccess rate: {stats.success_rate:.1f}%")


Found 30 PMC IDs for 'gut microbiome inflammation aging' (total PubMed results: 40631)
Output directory: test_publications\gut_microbiome_inflammation_aging

Progress: 30/30 processed (OK:0 | FAIL:0 | SKIP:30)

Download Summary
Keyword:           gut microbiome inflammation aging
Total found:       40631
Requested:         30
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    30
Duration:          0.3s
Output directory:  test_publications\gut_microbiome_inflammation_aging

Success rate: 100.0%


## 3. Examine Downloaded Files

Let's look at what was downloaded and examine one of the JSON files.

**Format Options (all save as .json files):**
- **`"text"`** (default): JSON with metadata and `text` field (plain text)
- **`"xml"`**: JSON with metadata and `xml` field (raw XML string)
- **`"both"`**: JSON with metadata and both `xml` and `text` fields

**Note**: All formats save as `.json` files with different content fields. Use `fmt="xml"` to get raw JATS XML for section extraction.

In [6]:
# List downloaded files
# Note: folder name is based on the query (with special chars replaced by _)
output_dir = Path("test_publications/gut_microbiome_inflammation_aging")

if output_dir.exists():
    json_files = list(output_dir.glob("*.json"))
    print(f"Found {len(json_files)} JSON files:")
    for f in json_files:
        print(f"  - {f.name}")
else:
    print("Output directory doesn't exist yet")
    print("Try running the download cell above first!")

Found 33 JSON files:
  - PMC12883035.json
  - PMC12883039.json
  - PMC12883080.json
  - PMC12897392.json
  - PMC12897404.json
  - PMC12898319.json
  - PMC12900750.json
  - PMC12901433.json
  - PMC12901480.json
  - PMC12901847.json
  - PMC12902778.json
  - PMC12902791.json
  - PMC12902800.json
  - PMC12902827.json
  - PMC12902863.json
  - PMC12903127.json
  - PMC12904716.json
  - PMC12904852.json
  - PMC12904972.json
  - PMC12905161.json
  - PMC12905206.json
  - PMC12905300.json
  - PMC12905356.json
  - PMC12905387.json
  - PMC12905409.json
  - PMC12905637.json
  - PMC12905855.json
  - PMC12906066.json
  - PMC12906213.json
  - PMC12906789.json
  - PMC12909005.json
  - PMC12909034.json
  - PMC12909087.json


In [7]:
# Load and examine the first article
if output_dir.exists() and json_files:
    first_file = json_files[0]
    with open(first_file, 'r', encoding='utf-8') as f:
        article = json.load(f)
    
    print(f"Article: {first_file.name}")
    print(f"\nJSON structure (top-level keys): {list(article.keys())}")
    print(f"  - pmcid: {article['pmcid']}")
    print(f"  - source: {article['source']}")
    print(f"  - download_date: {article['download_date']}")
    print(f"  - metadata: {len(article['metadata'])} fields")
    print(f"  - text: {'included' if 'text' in article else 'excluded'} (plain-text)")
    print(f"\nNote: Raw XML is NOT included in JSON format (use fmt='xml' if needed)")
    
    print(f"\n--- Metadata Fields ---")
    print(f"Title: {article['metadata'].get('title', 'N/A')}")
    print(f"Journal: {article['metadata'].get('journal', 'N/A')}")
    print(f"Year: {article['metadata'].get('year', 'N/A')}")
    print(f"DOI: {article['metadata'].get('doi', 'N/A')}")
    
    authors = article['metadata'].get('authors', [])
    if authors:
        print(f"Authors: {', '.join(authors[:3])}{'...' if len(authors) > 3 else ''}")
    
    abstract = article['metadata'].get('abstract', '')
    if abstract:
        print(f"\nAbstract (first 200 chars): {abstract[:200]}...")
else:
    print("No articles downloaded yet")

Article: PMC12883035.json

JSON structure (top-level keys): ['pmcid', 'source', 'download_date', 'metadata', 'text']
  - pmcid: PMC12883035
  - source: PMC
  - download_date: 2026-02-10T00:46:17.582814
  - metadata: 14 fields
  - text: included (plain-text)

Note: Raw XML is NOT included in JSON format (use fmt='xml' if needed)

--- Metadata Fields ---
Title: Dysregulation of Farnesoid X Receptor on Neutrophil Homeostasis Exacerbates Intestinal Inflammation via the mTORC1‐Glycolysis Signaling Pathway
Journal: MedComm
Year: 2026
DOI: 10.1002/mco2.70637
Authors: Kang Dengfeng, Li Ai, Xie Xiangqi...

Abstract (first 200 chars): ABSTRACT Neutrophils significantly accumulate within the inflamed intestinal mucosa of patients with inflammatory bowel disease (IBD), where the farnesoid X receptor (FXR) is typically downregulated. ...


## 4. Test Different Search Queries

Let's test with a few different keyword combinations.

In [8]:
# Test multiple keywords
# Using broader topics that typically have good PMC coverage
test_queries = [
    "gut microbiome",
    "COVID-19 vaccine",
    "CRISPR therapy"
]

results = []
for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Testing query: '{query}'")
    print('='*60)
    
    stats = search_and_download(
        keyword=query,
        max_results=2,
        fmt="text",  # Default: JSON with metadata + text
        output_dir=Path("test_publications")
    )
    
    results.append({
        'query': query,
        'found': stats.total_found,
        'downloaded': stats.successful,
        'success_rate': stats.success_rate
    })


Testing query: 'gut microbiome'


Search attempt 1/3 failed: 429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=COVID-19+vaccine&retmax=2&retmode=json&retstart=0



Found 2 PMC IDs for 'gut microbiome' (total PubMed results: 216601)
Output directory: test_publications\gut_microbiome

Progress: 2/2 processed (OK:0 | FAIL:0 | SKIP:2)

Download Summary
Keyword:           gut microbiome
Total found:       216601
Requested:         2
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    2
Duration:          0.3s
Output directory:  test_publications\gut_microbiome

Testing query: 'COVID-19 vaccine'

Found 2 PMC IDs for 'COVID-19 vaccine' (total PubMed results: 77126)
Output directory: test_publications\COVID-19_vaccine

Progress: 2/2 processed (OK:0 | FAIL:0 | SKIP:2)

Download Summary
Keyword:           COVID-19 vaccine
Total found:       77126
Requested:         2
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    2
Duration:          2.5s
Output directory:  test_publications\COVID-19_vaccine

Testing query: 'CRISPR therapy'

Found 2 PMC IDs for 'CRIS

In [9]:
test_queries = [
    "oral microbiome",
]

results = []
for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Testing query: '{query}'")
    print('='*60)
    
    stats = search_and_download(
        keyword=query,
        max_results=2,
        fmt="both",  # Default: JSON with metadata + text
        output_dir=Path("test_publications")
    )
    
    results.append({
        'query': query,
        'found': stats.total_found,
        'downloaded': stats.successful,
        'success_rate': stats.success_rate
    })


Testing query: 'oral microbiome'

Found 2 PMC IDs for 'oral microbiome' (total PubMed results: 201572)
Output directory: test_publications\oral_microbiome

Progress: 2/2 processed (OK:2 | FAIL:0 | SKIP:0)

Download Summary
Keyword:           oral microbiome
Total found:       201572
Requested:         2
[OK] Successful:   2
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    0
Duration:          3.0s
Output directory:  test_publications\oral_microbiome


In [10]:
test_queries = [
    "oral microbiome",
]

results = []
for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Testing query: '{query}'")
    print('='*60)
    
    stats = search_and_download(
        keyword=query,
        max_results=2,
        fmt="both",  # Default: JSON with metadata + text
        output_dir=Path("test_publications")
    )
    
    results.append({
        'query': query,
        'found': stats.total_found,
        'downloaded': stats.successful,
        'success_rate': stats.success_rate
    })


Testing query: 'oral microbiome'

Found 2 PMC IDs for 'oral microbiome' (total PubMed results: 201572)
Output directory: test_publications\oral_microbiome

Progress: 2/2 processed (OK:0 | FAIL:0 | SKIP:2)

Download Summary
Keyword:           oral microbiome
Total found:       201572
Requested:         2
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    2
Duration:          0.2s
Output directory:  test_publications\oral_microbiome


In [11]:
# Summary of all test queries
print("\n" + "="*60)
print("SUMMARY OF ALL QUERIES")
print("="*60)
for r in results:
    print(f"\nQuery: '{r['query']}'")
    print(f"  Total found in PubMed: {r['found']}")
    print(f"  Successfully downloaded: {r['downloaded']}")
    print(f"  Success rate: {r['success_rate']:.1f}%")


SUMMARY OF ALL QUERIES

Query: 'oral microbiome'
  Total found in PubMed: 201572
  Successfully downloaded: 0
  Success rate: 100.0%


## 5. Test Sequential vs Concurrent Downloads

Compare download speeds between sequential and concurrent modes.

In [12]:
import time

# Test with sequential downloads
print("Testing SEQUENTIAL downloads...")
stats_seq = search_and_download(
    keyword="microbiome",
    max_results=5,
    fmt="text",  # Default: JSON with metadata + text
    use_concurrent=False,
    output_dir=Path("test_publications")
)

print(f"\nSequential mode: {stats_seq.duration_seconds:.1f}s for {stats_seq.successful} articles")

Testing SEQUENTIAL downloads...

Found 5 PMC IDs for 'microbiome' (total PubMed results: 476908)
Output directory: test_publications\microbiome

Progress: 5/5 processed

Download Summary
Keyword:           microbiome
Total found:       476908
Requested:         5
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    5
Duration:          0.1s
Output directory:  test_publications\microbiome

Sequential mode: 0.1s for 0 articles


In [13]:
# Test with concurrent downloads
print("Testing CONCURRENT downloads...")
stats_conc = search_and_download(
    keyword="COVID-19",
    max_results=5,
    fmt="text",  # Default: JSON with metadata + text
    use_concurrent=True,
    max_workers=5,
    output_dir=Path("test_publications")
)

print(f"\nConcurrent mode: {stats_conc.duration_seconds:.1f}s for {stats_conc.successful} articles")

if stats_seq.successful > 0 and stats_conc.successful > 0:
    speedup = stats_seq.duration_seconds / stats_conc.duration_seconds
    print(f"\nSpeedup: {speedup:.2f}x")

Testing CONCURRENT downloads...


Search attempt 1/3 failed: 429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=COVID-19&retmax=5&retmode=json&retstart=0



Found 5 PMC IDs for 'COVID-19' (total PubMed results: 612306)
Output directory: test_publications\COVID-19

Progress: 5/5 processed (OK:0 | FAIL:0 | SKIP:5)

Download Summary
Keyword:           COVID-19
Total found:       612306
Requested:         5
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    5
Duration:          4.3s
Output directory:  test_publications\COVID-19

Concurrent mode: 4.3s for 0 articles


## 6. Download Format Options

The library now supports three format options, all saving as `.json` files:

**Format Options:**
- **`"text"`** (default): JSON with metadata and `text` field (plain text)
  - Clean, structured data (pmcid, metadata, text)
  - ~50-60% smaller than including XML
  - Easy to parse and analyze
  - Best for vector databases and search
  - Example: `{"pmcid": "PMC123", "metadata": {...}, "text": "..."}`

- **`"xml"`**: JSON with metadata and `xml` field (raw XML string)
  - Preserves complete article structure as string
  - Contains JATS section tags (Introduction, Methods, Results, Discussion, etc.)
  - Best for detailed section parsing and structure analysis
  - Example: `{"pmcid": "PMC123", "metadata": {...}, "xml": "<article>...</article>"}`

- **`"both"`**: JSON with metadata and both `xml` and `text` fields
  - Get both formats in a single file
  - JSON structure for quick access, XML for detailed parsing
  - Ideal for comprehensive analysis workflows
  - Example: `{"pmcid": "PMC123", "metadata": {...}, "xml": "...", "text": "..."}`

In [14]:
# Verify JSON format structure
test_dir = Path("test_publications")
json_files = list(test_dir.glob("**/*.json"))

if json_files:
    # Examine a sample JSON file
    sample = json_files[0]
    with open(sample, 'r', encoding='utf-8') as f:
        article = json.load(f)
    
    print("Sample JSON structure (all formats save as .json):")
    print(f"  File: {sample.name}")
    print(f"  Size: {sample.stat().st_size / 1024:.1f} KB")
    print(f"\n  Fields: {', '.join(article.keys())}")
    print(f"\n  PMCID: {article['pmcid']}")
    print(f"  Title: {article['metadata']['title'][:80]}...")
    
    if 'text' in article:
        print(f"  Text length: {len(article['text'])} characters")
        print(f"\n  ✓ Contains 'text' field (fmt='text' or fmt='both')")
    
    if 'xml' in article:
        print(f"  XML length: {len(article['xml'])} characters")
        print(f"\n  ✓ Contains 'xml' field (fmt='xml' or fmt='both')")
    
    print(f"\n  ℹ️  All format options save as .json with different content:")
    print(f"     - fmt='text': Has 'text' field (plain text)")
    print(f"     - fmt='xml': Has 'xml' field (raw XML string)")
    print(f"     - fmt='both': Has both 'xml' and 'text' fields")
else:
    print("No JSON files found. Run previous download cells first.")

Sample JSON structure (all formats save as .json):
  File: PMC11970744.json
  Size: 7.0 KB

  Fields: pmcid, source, download_date, metadata, text

  PMCID: PMC11970744
  Title: Identification of IL-34 and Slc7al as potential key regulators in MASLD progress...
  Text length: 4700 characters

  ✓ Contains 'text' field (fmt='text' or fmt='both')

  ℹ️  All format options save as .json with different content:
     - fmt='text': Has 'text' field (plain text)
     - fmt='xml': Has 'xml' field (raw XML string)
     - fmt='both': Has both 'xml' and 'text' fields


In [15]:
# Test all three format options
print("Testing all three format options (all save as .json with different fields):\n")

# Test 1: Default "text" format (JSON with metadata + text)
print("1. Testing fmt='text' (JSON with metadata + text field)...")
stats_text = search_and_download(
    keyword="bile acid",
    max_results=1,
    fmt="text",
    output_dir=Path("test_publications/format_test_text")
)
print(f"   ✓ Downloaded: {stats_text.successful} JSON file(s) with 'text' field")

# Test 2: XML format (JSON with metadata + xml field)
print("\n2. Testing fmt='xml' (JSON with metadata + xml field)...")
stats_xml = search_and_download(
    keyword="bile acid",
    max_results=1,
    fmt="xml",
    output_dir=Path("test_publications/format_test_xml")
)
print(f"   ✓ Downloaded: {stats_xml.successful} JSON file(s) with 'xml' field")

# Test 3: Both formats (JSON with both xml and text fields)
print("\n3. Testing fmt='both' (JSON with both xml and text fields)...")
stats_both = search_and_download(
    keyword="bile acid",
    max_results=1,
    fmt="both",
    output_dir=Path("test_publications/format_test_both")
)
print(f"   ✓ Downloaded: {stats_both.successful} JSON file(s) with both fields")

print("\n" + "="*60)
print("Format comparison:")
print("="*60)

# Check file contents
test_dirs = {
    "text": Path("test_publications/format_test_text"),
    "xml": Path("test_publications/format_test_xml"),
    "both": Path("test_publications/format_test_both")
}

for fmt_name, dir_path in test_dirs.items():
    if dir_path.exists():
        json_files = list(dir_path.glob("**/*.json"))
        
        if json_files:
            json_file = json_files[0]
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            print(f"\n{fmt_name.upper()} format:")
            print(f"  File: {json_file.name}")
            print(f"  Size: {json_file.stat().st_size / 1024:.1f} KB")
            print(f"  Fields: {', '.join(data.keys())}")
            if 'text' in data:
                print(f"    ✓ Has 'text' field ({len(data['text'])} chars)")
            if 'xml' in data:
                print(f"    ✓ Has 'xml' field ({len(data['xml'])} chars)")

Testing all three format options (all save as .json with different fields):

1. Testing fmt='text' (JSON with metadata + text field)...

Found 1 PMC IDs for 'bile acid' (total PubMed results: 213660)
Output directory: test_publications\format_test_text\bile_acid

Progress: 1/1 processed

Download Summary
Keyword:           bile acid
Total found:       213660
Requested:         1
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    1
Duration:          0.2s
Output directory:  test_publications\format_test_text\bile_acid
   ✓ Downloaded: 0 JSON file(s) with 'text' field

2. Testing fmt='xml' (JSON with metadata + xml field)...

Found 1 PMC IDs for 'bile acid' (total PubMed results: 213660)
Output directory: test_publications\format_test_xml\bile_acid



Search attempt 1/3 failed: 429 Client Error: Too Many Requests for url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=bile+acid&retmax=1&retmode=json&retstart=0


Progress: 1/1 processed

Download Summary
Keyword:           bile acid
Total found:       213660
Requested:         1
[OK] Successful:   1
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    0
Duration:          0.5s
Output directory:  test_publications\format_test_xml\bile_acid
   ✓ Downloaded: 1 JSON file(s) with 'xml' field

3. Testing fmt='both' (JSON with both xml and text fields)...

Found 1 PMC IDs for 'bile acid' (total PubMed results: 213660)
Output directory: test_publications\format_test_both\bile_acid

Progress: 1/1 processed

Download Summary
Keyword:           bile acid
Total found:       213660
Requested:         1
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    1
Duration:          2.2s
Output directory:  test_publications\format_test_both\bile_acid
   ✓ Downloaded: 0 JSON file(s) with both fields

Format comparison:

TEXT format:
  File: PMC12909090.json
  Size: 41.0 KB
  Fields: pmcid

## 7. Using Lower-Level API

Demonstrate the lower-level functions for more control.

In [16]:
from pubmed_stream import esearch_pmc, efetch_pmc, create_session, RateLimiter
import os

# Get API key from environment if available
api_key = os.getenv("NCBI_API_KEY")

# Create session and rate limiter
session = create_session("pubmed-stream/test")
rate_limiter = RateLimiter(0.34)

# Search for PMC IDs using a topic with good PMC coverage
pmcids, total_found = esearch_pmc(
    term="microbiome",
    max_results=3,
    api_key=api_key,
    session=session,
    rate_limiter=rate_limiter
)

print(f"Found {len(pmcids)} PMC IDs (total: {total_found})")
print(f"PMC IDs: {pmcids[:5]}...")

Found 3 PMC IDs (total: 476908)
PMC IDs: ['12909087', '12909060', '12909034']...


In [17]:
# Download a single article using efetch_pmc
if pmcids:
    pmcid = pmcids[0]
    output_dir = Path("test_publications/manual_download")
    
    success, status = efetch_pmc(
        pmcid=pmcid,
        out_dir=output_dir,
        fmt="text",  # Can also use "xml" or "both"
        api_key=api_key,
        session=session,
        rate_limiter=rate_limiter,
        include_text=True
    )
    
    print(f"Download {pmcid}: {'SUCCESS' if success else 'FAILED'}")
    print(f"Status: {status}")
    
    # Check the file
    json_file = output_dir / f"PMC{pmcid}.json"
    if json_file.exists():
        print(f"File size: {json_file.stat().st_size / 1024:.1f} KB")

# Clean up session
session.close()

Download 12909087: SUCCESS
Status: exists
File size: 71.7 KB


## 8. Test Error Handling

Test with an invalid/empty query to see error handling.

In [18]:
# Test with a very specific query that likely has no results
stats_empty = search_and_download(
    keyword="xyzabc123nonexistent",
    max_results=10,
    fmt="text",  # Default: JSON with metadata + text
    output_dir=Path("test_publications")
)

print(f"\nEmpty query test:")
print(f"  Total found: {stats_empty.total_found}")
print(f"  Downloaded: {stats_empty.successful}")

No PMC articles found for: xyzabc123nonexistent


No results found for 'xyzabc123nonexistent'

Empty query test:
  Total found: 0
  Downloaded: 0


## 9. Summary Statistics

Collect and display overall statistics from all tests.

In [19]:
# Count all downloaded files
test_pub_dir = Path("test_publications")

if test_pub_dir.exists():
    json_files = list(test_pub_dir.glob("**/*.json"))
    
    print("\n" + "="*60)
    print("OVERALL TEST SUMMARY")
    print("="*60)
    print(f"Total JSON files: {len(json_files)}")
    print(f"\nTotal articles downloaded: {len(json_files)}")
    
    # Calculate total storage used
    total_size = sum(f.stat().st_size for f in json_files)
    print(f"Total storage used: {total_size / 1024 / 1024:.2f} MB")
    
    # List subdirectories (topics)
    subdirs = [d for d in test_pub_dir.iterdir() if d.is_dir()]
    print(f"\nTopics downloaded: {len(subdirs)}")
    for d in subdirs:
        count = len(list(d.glob("*.json")))
        print(f"  - {d.name}: {count} files")
else:
    print("No test publications directory found")


OVERALL TEST SUMMARY
Total JSON files: 90

Total articles downloaded: 90
Total storage used: 7.88 MB

Topics downloaded: 16
  - bile_acid: 5 files
  - COVID-19: 10 files
  - COVID-19_vaccine: 4 files
  - CRISPR: 0 files
  - CRISPR_therapy: 4 files
  - format_test_both: 0 files
  - format_test_text: 0 files
  - format_test_xml: 0 files
  - generic_COVID_19: 0 files
  - generic_CRISPR: 0 files
  - generic_microbiome: 0 files
  - gut_microbiome: 4 files
  - gut_microbiome_inflammation_aging: 33 files
  - manual_download: 2 files
  - microbiome: 11 files
  - oral_microbiome: 2 files


## Conclusion

All tests completed! The `pubmed-stream` library successfully:

✓ Searches PMC directly for guaranteed full-text availability  
✓ Downloads articles in multiple formats (all as .json)  
✓ Handles concurrent and sequential downloads  
✓ Extracts metadata from PMC XML  
✓ Handles errors gracefully  
✓ Provides detailed statistics  

**Format Options (all save as .json files):**

- **`"text"`** (default): JSON with metadata and `text` field
  - Contains: `pmcid`, `source`, `download_date`, `metadata`, and `text` (plain-text)
  - Clean, structured data that's easy to analyze and embed into vector databases
  - ~50-60% smaller than including XML

- **`"xml"`**: JSON with metadata and `xml` field
  - Contains: `pmcid`, `source`, `download_date`, `metadata`, and `xml` (raw XML string)
  - Preserves complete article structure with JATS section tags
  - Best for detailed section parsing (Introduction, Methods, Results, Discussion, etc.)

- **`"both"`**: JSON with metadata, `xml`, and `text` fields
  - Get the benefits of both formats in a single file
  - JSON for quick access and vector DB embedding
  - XML for structured section extraction

You can now use this library for your research projects!

## Test: Generic Single-Word Queries

**Update**: The library now **searches PMC directly** instead of PubMed→PMC linking.

**Why this works better:**
- Old approach: Search PubMed for "microbiome" → get most recent PMIDs → many don't have PMC full-text yet (new articles not deposited)
- New approach: Search PMC directly → all results guaranteed to have full-text

This fix ensures **reliable downloads for any search term**, including broad generic terms like "microbiome", "CRISPR", "COVID-19".

In [None]:
# Test generic single-word queries that previously failed
test_queries = ["microbiome", "CRISPR", "COVID-19"]

print("Testing generic single-word queries with updated algorithm:\n")

for query in test_queries:
    print(f"Query: '{query}'")
    generic_stats = search_and_download(
        keyword=query,
        max_results=2,  # Just 2 articles to test quickly
        fmt="text",  # Default: JSON with metadata + text
        output_dir=Path(f"test_publications/generic_{query.replace('-', '_')}"),
    )
    
    # Check both successful downloads AND skipped files (already downloaded)
    total_available = generic_stats.successful + generic_stats.skipped
    
    if total_available > 0:
        if generic_stats.successful > 0:
            print(f"  ✓ SUCCESS: Downloaded {generic_stats.successful} new article(s)")
        if generic_stats.skipped > 0:
            print(f"  ✓ SUCCESS: {generic_stats.skipped} article(s) already exist (skipped)")
        print(f"  Total available: {total_available}/{generic_stats.requested}")
    else:
        print(f"  ✗ FAILED: 0 available (PMC total: {generic_stats.total_found})")
    print()

INFO: Searching PMC for full-text articles: microbiome (target: 4 articles)
INFO: Found 4 PMC IDs (total PMC results: 476908)
INFO: Using concurrent downloads with 5 workers
INFO: Searching PMC for full-text articles: CRISPR (target: 4 articles)


Testing generic single-word queries with updated algorithm:

Query: 'microbiome'

Found 4 PMC IDs for 'microbiome' (total PubMed results: 476908)
Output directory: test_publications\generic_microbiome\microbiome

Progress: 4/4 processed (OK:0 | FAIL:0 | SKIP:4)

Download Summary
Keyword:           microbiome
Total found:       476908
Requested:         4
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    4
Duration:          0.1s
Output directory:  test_publications\generic_microbiome\microbiome
  ✗ FAILED: 0 downloads (PubMed count: 476908)

Query: 'CRISPR'


INFO: Found 4 PMC IDs (total PMC results: 221106)
INFO: Using concurrent downloads with 5 workers
INFO: Searching PMC for full-text articles: COVID-19 (target: 4 articles)



Found 4 PMC IDs for 'CRISPR' (total PubMed results: 221106)
Output directory: test_publications\generic_CRISPR\CRISPR

Progress: 4/4 processed (OK:0 | FAIL:0 | SKIP:4)

Download Summary
Keyword:           CRISPR
Total found:       221106
Requested:         4
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    4
Duration:          0.5s
Output directory:  test_publications\generic_CRISPR\CRISPR
  ✗ FAILED: 0 downloads (PubMed count: 221106)

Query: 'COVID-19'


INFO: Found 4 PMC IDs (total PMC results: 612306)
INFO: Using concurrent downloads with 5 workers



Found 4 PMC IDs for 'COVID-19' (total PubMed results: 612306)
Output directory: test_publications\generic_COVID_19\COVID-19

Progress: 4/4 processed (OK:0 | FAIL:0 | SKIP:4)

Download Summary
Keyword:           COVID-19
Total found:       612306
Requested:         4
[OK] Successful:   0
[FAIL] Failed:     0
  - Unavailable:   0
  - Errors:        0
[SKIP] Skipped:    4
Duration:          1.5s
Output directory:  test_publications\generic_COVID_19\COVID-19
  ✗ FAILED: 0 downloads (PubMed count: 612306)



In [21]:
# Enable debug logging to see the algorithm in action
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

# Get the pubmed_stream logger
logger = logging.getLogger('pubmed_stream')
logger.setLevel(logging.INFO)