An R ecosystem for large-scale, on-disk access to OpenAlex — the open catalogue of global scholarly work.
OpenAlex provides free, comprehensive metadata on over 250 million scholarly works, authors, institutions, and concepts. The openalexPro ecosystem is built around a single design principle: process data on disk rather than in memory, so workflows scale to millions of records without hitting RAM limits.
Core API client — query OpenAlex, page through results, and store everything in Parquet files for efficient downstream use. The on-disk processing approach scales to millions of records without RAM limits.
Snowball citation searches — iteratively expand a seed set by following forward and backward citations across the graph. Supports configurable depth and deduplication.
Export a Parquet corpus to BibTeX, BibLaTeX, CSL JSON, Markdown, LaTeX, HTML, or PDF via Pandoc. Bridges the openalexPro ecosystem with reference managers and publishing workflows.
Bulk snapshot tools — convert the full OpenAlex JSON.GZ snapshot to Parquet, build ID-lookup indexes, and extract records by ID at scale. Powered by a compiled Rust back-end with a pure-R/DuckDB fallback.
Text embedding, cosine-distance scoring, and threshold calibration — backend-neutral (HuggingFace, OpenAI, TEI). Adds semantic similarity search to the ecosystem.
Compiled Rust CLI and library powering openalexSnapshot's hot path (JSON→Parquet conversion, indexing, ID extraction). Downloaded automatically as a pre-built static library when installing openalexSnapshot — no manual Rust setup required for most users.
All R packages are available from the openalexPro r-universe:
install.packages(
c("openalexPro", "openalexSnowball", "openalexConvert",
"openalexSnapshot", "openalexVectorComp"),
repos = c("https://openalexpro.r-universe.dev", "https://cloud.r-project.org")
)- On-disk processing — results are paged and written to Parquet; memory use stays flat regardless of corpus size.
- Arrow / DuckDB throughout — all data manipulation uses columnar formats; SQL queries run in-process.
- Composable — each package has a single responsibility and speaks the same Parquet dialect, so they chain naturally.
- Rust where it matters — the bulk snapshot converter delegates hot-path work to a compiled Rust back-end, with a pure-R/DuckDB fallback.
Issues and pull requests are welcome on the individual package repositories. Please open an issue before starting large changes.
This ecosystem builds on the excellent openalexR package and the OpenAlex team's commitment to open scholarly infrastructure.
The packages are provided as is. The authors are not affiliated with OpenAlex.