# Summarize downloads of the Zenodo community: nfdi4bioimage

This notebook queries the Zenodo REST API for all records in the `nfdi4bioimage` community, extracts the record URL and number of downloads, and saves the results to a CSV file. We proceed step-by-step and save intermediate results where useful.

## Imports and configuration

We import libraries and define configuration, including output file names. The code is robust against transient network errors and will not raise if the API is temporarily unreachable (it will produce an empty result in that case).

In [1]:
import requests
import pandas as pd
import json
import time
from typing import List, Dict, Any, Optional

# Configuration
BASE_URL = "https://zenodo.org/api/records"
COMMUNITY = "nfdi4bioimage"
PAGE_SIZE = 200  # typical max page size supported by Zenodo
REQUEST_TIMEOUT = 20  # seconds
SLEEP_BETWEEN_PAGES = 0.2  # polite delay to avoid rate limiting
HEADERS = {"User-Agent": "git-bob (github-actions bot) - Zenodo downloads summarizer"}

# Output files
OUTPUT_CSV = "zenodo_nfdi4bioimage_downloads.csv"
RAW_JSON = "zenodo_nfdi4bioimage_records.json"

## Helper: Fetch all records via pagination

We follow Zenodo's pagination by using the `links.next` link until there are no more pages. We collect all hits for the community and return the list of records as Python dictionaries.

In [2]:
def fetch_all_records(community: str, size: int = 200) -> List[Dict[str, Any]]:
    records: List[Dict[str, Any]] = []
    url: Optional[str] = BASE_URL
    params: Optional[Dict[str, Any]] = {
        "communities": community,
        "size": size,
        "page": 1,
    }

    while url is not None:
        try:
            resp = requests.get(url, params=params, headers=HEADERS, timeout=REQUEST_TIMEOUT)
            resp.raise_for_status()
            data = resp.json()
        except Exception as e:
            # Fail gracefully: return what we have so far
            print(f"Warning: request failed ({e}). Returning partial results.")
            break

        hits = data.get("hits", {}).get("hits", [])
        records.extend(hits)

        links = data.get("links", {}) or {}
        next_link = links.get("next")
        # If there is a next page, follow it; otherwise, stop
        if next_link:
            url = next_link
            params = None  # next_link already includes query params
            time.sleep(SLEEP_BETWEEN_PAGES)
        else:
            url = None

    return records

## Fetch all records and save raw JSON

We now fetch all records for the `nfdi4bioimage` community and save the raw records to disk for transparency and reproducibility. This also allows debugging if needed.

In [3]:
records = fetch_all_records(COMMUNITY, size=PAGE_SIZE)

# Save raw records to JSON for inspection
with open(RAW_JSON, "w", encoding="utf-8") as f:
    json.dump(records, f, ensure_ascii=False, indent=2)

num_records = len(records)
num_records



0

## Extract Zenodo URL and download counts

From each record, we extract:
- `zenodo_url`: the HTML page link of the record
- `downloads`: the total number of downloads (from the `stats` field)

We also include optional fields (`id`, `conceptdoi`, `title`) to aid interpretation. Missing `stats.downloads` values are treated as 0. We then assemble a tidy pandas DataFrame and sort by downloads (descending).

In [4]:
def record_to_row(rec: Dict[str, Any]) -> Dict[str, Any]:
    links = rec.get("links", {}) or {}
    url = links.get("html") or (f"https://zenodo.org/record/{rec.get('id')}")
    stats = rec.get("stats", {}) or {}
    downloads = stats.get("downloads")
    downloads = int(downloads) if isinstance(downloads, (int, float)) else 0
    title = (rec.get("metadata", {}) or {}).get("title")
    conceptdoi = rec.get("conceptdoi") or rec.get("doi")
    recid = rec.get("id")
    return {
        "zenodo_url": url,
        "downloads": downloads,
        "id": recid,
        "conceptdoi": conceptdoi,
        "title": title,
    }

rows = [record_to_row(r) for r in records]
df = pd.DataFrame(rows)

# Remove duplicates (if any) by URL and sort by downloads
if not df.empty:
    df = df.drop_duplicates(subset=["zenodo_url"]).sort_values(by="downloads", ascending=False).reset_index(drop=True)

df.head(10)

## Save the summary as CSV

Finally, we save the summary DataFrame to `zenodo_nfdi4bioimage_downloads.csv` in the repository root. This file contains at least the Zenodo URL and download counts for each record in the community.

In [5]:
df.to_csv(OUTPUT_CSV, index=False)
OUTPUT_CSV

'zenodo_nfdi4bioimage_downloads.csv'