Skip to content
@openalexPro

openalexPro

Tools to work with OpenAlex data, API as well as snapshot

openalexPro

An R ecosystem for large-scale, on-disk access to OpenAlex — the open catalogue of global scholarly work.

OpenAlex provides free, comprehensive metadata on over 250 million scholarly works, authors, institutions, and concepts. The openalexPro ecosystem is built around a single design principle: process data on disk rather than in memory, so workflows scale to millions of records without hitting RAM limits.


R Packages

CI r-universe DOI Lifecycle: maturing License: GPL-2+ Codecov Docs

Core API client — query OpenAlex, page through results, and store everything in Parquet files for efficient downstream use. The on-disk processing approach scales to millions of records without RAM limits.


CI r-universe DOI Lifecycle: maturing License: GPL-2+ Codecov Docs

Snowball citation searches — iteratively expand a seed set by following forward and backward citations across the graph. Supports configurable depth and deduplication.


CI r-universe DOI Lifecycle: experimental License: GPL-2+ Codecov Docs

Export a Parquet corpus to BibTeX, BibLaTeX, CSL JSON, Markdown, LaTeX, HTML, or PDF via Pandoc. Bridges the openalexPro ecosystem with reference managers and publishing workflows.


CI r-universe DOI Lifecycle: experimental License: GPL-2+ Codecov Docs

Bulk snapshot tools — convert the full OpenAlex JSON.GZ snapshot to Parquet, build ID-lookup indexes, and extract records by ID at scale. Powered by a compiled Rust back-end with a pure-R/DuckDB fallback.


CI r-universe DOI Lifecycle: experimental License: MIT Codecov Docs

Text embedding, cosine-distance scoring, and threshold calibration — backend-neutral (HuggingFace, OpenAI, TEI). Adds semantic similarity search to the ecosystem.


Rust Core

CI Release Docs

Compiled Rust CLI and library powering openalexSnapshot's hot path (JSON→Parquet conversion, indexing, ID extraction). Downloaded automatically as a pre-built static library when installing openalexSnapshot — no manual Rust setup required for most users.


Installation

All R packages are available from the openalexPro r-universe:

install.packages(
  c("openalexPro", "openalexSnowball", "openalexConvert",
    "openalexSnapshot", "openalexVectorComp"),
  repos = c("https://openalexpro.r-universe.dev", "https://cloud.r-project.org")
)

Design principles

  • On-disk processing — results are paged and written to Parquet; memory use stays flat regardless of corpus size.
  • Arrow / DuckDB throughout — all data manipulation uses columnar formats; SQL queries run in-process.
  • Composable — each package has a single responsibility and speaks the same Parquet dialect, so they chain naturally.
  • Rust where it matters — the bulk snapshot converter delegates hot-path work to a compiled Rust back-end, with a pure-R/DuckDB fallback.

Contributing

Issues and pull requests are welcome on the individual package repositories. Please open an issue before starting large changes.

Acknowledgements

This ecosystem builds on the excellent openalexR package and the OpenAlex team's commitment to open scholarly infrastructure.

Disclaimer

The packages are provided as is. The authors are not affiliated with OpenAlex.

Pinned Loading

  1. openalexPro openalexPro Public

    File Based Retrieval and Processing of large Literature Corpora from OpenAlex

    R 8

Repositories

Showing 9 of 9 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…