opencltools is a developer toolkit for R packages that want to accelerate
embarrassingly parallel computations using OpenCL-capable GPUs. It provides
the runtime plumbing --- device probing, kernel source loading, library
subsetting, program assembly helpers, and fp64 capability management --- so
package authors can focus on writing their GPU kernels rather than
re-implementing the same infrastructure layer each time.
opencltools is library-agnostic. It works with any collection of
OpenCL-ready .cl source files, whether those are a full port of an existing
C library, a partial port of selected routines, or kernels written from
scratch. There is no dependency on any particular ported library.
nmathopencl is the first
first packaged example of such a library: an OpenCL-C port of R's Mathlib
(nmath) packaged for reuse. It is a natural companion when your kernel
needs statistical functions, but it is one instance of a pattern that applies
equally to any other C library you choose to port --- numerical linear
algebra, signal processing, simulation, or domain-specific code. The
opencltools infrastructure is the same regardless of what library is on the
other end.
Many algorithms are embarrassingly parallel: hundreds or thousands of independent evaluations with no data dependency between them. The evaluations are bottlenecks not because the math is hard, but because CPU-sequential code cannot saturate modern hardware when the workload is large. OpenCL lets you dispatch that work to a GPU, evaluating all points simultaneously.
The obstacle is that the computation inside each parallel evaluation often
depends on an existing C library --- R's statistical math (nmath), a
linear algebra routine, a domain-specific simulation function --- that was
written for host-sequential execution. A GPU kernel cannot call a host
library directly.
The general solution is to port the required library to OpenCL C,
distribute the ported sources as a package, and load them at runtime alongside
your own kernel code. opencltools provides the infrastructure for that
runtime step: loading, ordering, subsetting, and assembling .cl source
files from any such ported library into a complete OpenCL program.
The canonical example is Bayesian GLM sampling via likelihood-subgradient
envelopes in glmbayes, where the bottleneck is gradient evaluation across
a large parameter grid. The inner math requires statistical functions from
R's nmath, ported to OpenCL C by nmathopencl. But the same pattern
applies to any package where a C library bottleneck can be parallelized:
port the library once, distribute it, and use opencltools to assemble
programs that consume it.
| Package | Role |
|---|---|
opencltools |
Runtime plumbing: device probing, kernel loading, library subsetting, program build helpers. Library-agnostic — works with any ported .cl library. |
nmathopencl |
Example ported library. OpenCL-C ports of R's nmath (>130 statistical functions), distributed as .cl files with dependency annotations (shipped with the package). The first of potentially many such libraries. |
glmbayes |
Reference downstream package. Bayesian GLM sampling with optional GPU acceleration of envelope gradient evaluation via f2_f3_opencl, using nmathopencl as its ported library and opencltools as the loader layer. |
Any other ported C library can occupy the role that nmathopencl plays here.
The opencltools infrastructure does not know or care what library it is
loading — it reads annotated .cl files, resolves dependency order, and
returns concatenated source strings.
Before assembling and compiling an OpenCL program, confirm the runtime environment:
library(opencltools)
has_opencl() # TRUE if this build was compiled with OpenCL
opencl_fp64_available() # Is double-precision (cl_khr_fp64) working?
opencl_device_info() # Cached device/driver metadata
gpu_names() # NVIDIA GPU names via nvidia-smi (Linux)
get_opencl_core_count() # Total compute units across GPU devices
verify_opencl_runtime() # Broader sanity check (ICD, driver, …)
check_runtime_env() # Workstation-level environment diagnosis
detect_compute_runtimes() # Enumerate CUDA, ROCm, OpenCL runtimesHost-side checks (detect_environment_and_gpus(), gpu_names(),
detect_compute_runtimes(), check_runtime_env()) do not require OpenCL to
be compiled into the package or present on the machine.
Before GPU dispatch, call has_opencl() and opencl_fp64_available() at
session start. has_opencl() is TRUE only when this build of
opencltools was compiled with OpenCL support (USE_OPENCL), not merely when
a GPU is attached. If either returns FALSE, diagnose with
verify_opencl_runtime() or detect_compute_runtimes() before
clBuildProgram. Driver issues caught early are easier to fix than failures
buried inside kernel compilation.
Load individual .cl files or entire annotated subdirectories into strings
ready for clCreateProgramWithSource. This step is file I/O only --- it
does not call the OpenCL driver or GPU. It works on every build of
opencltools, including CPU-only CRAN binaries. The package argument
names whichever installed package ships the .cl tree (opencltools,
nmathopencl, your own package, etc.):
# Single shard (example from opencltools inst/cl)
src <- load_kernel_source("nmath/bd0.cl", package = "opencltools")
# Full annotated library in dependency order
lib_src <- load_kernel_library("nmath", package = "opencltools")If no text could be read, the functions return "" and emit a message().
Missing paths still raise an error. Use has_opencl() when you need to know
whether this binary can run downstream GPU code, not to gate loading.
The C++ equivalents (openclPort::load_kernel_source,
openclPort::load_kernel_library, openclPort::load_library_for_kernel) are
declared in opencltools/openclPort.h via LinkingTo: opencltools. Add
opencltools::opencltoolsLdFlags() to PKG_LIBS so symbols resolve from this
package's shared library. OpenCL C API headers (CL/cl.h, etc.) are not
bundled here; use your own configure / SDK paths when compiling with
USE_OPENCL (see inst/include/README.md).
load_library_for_kernel() reads the dependency annotation on your launcher
.cl file and concatenates only the library shards that kernel actually
needs, rather than the entire library. This reduces first-call just-in-time (JIT)
compilation time and keeps the program source small. It works with any
annotated .cl library --- not just nmath:
lib_dir <- system.file("cl/mylib", package = "my_ported_lib")
kernel_path <- "path/to/my_kernel.cl"
src <- load_library_for_kernel(
kernel_path, lib_dir,
depends_tag = "all_depends_mylib"
)
print(src) # S3: nmathopencl_concatenated_lib (stems, size; not full source)Returns a character vector with class nmathopencl_concatenated_lib and
attributes listing requested/loaded stems. extract_library_subset() yields
nmathopencl_lib_extract_df for shipping a copied subset.
The C++ equivalent openclPort::load_library_for_kernel(...) is provided for
use inside kernel runner code where calling back into R from C++ is
undesirable (see § Kernel runners and wrappers below).
configureOpenCL() (C++ only) compiles tiny test kernels against a live
device to determine whether expm1 and log1p are available as native
device built-ins, and returns a buildOptions string (-DHAVE_EXPM1=1 -DHAVE_LOG1P=1) to pass to clBuildProgram. This is useful for any ported
library whose code uses platform-specific fast-path branches for those
functions.
For downstream developers annotating their own kernels against a pre-annotated library:
| Function | Purpose |
|---|---|
attach_kernel_call_tags() |
Step 1 — scan your kernel source, match calls against a library's @provides list, write @calls_<tag> and @depends_<tag> into the kernel files. No manual tagging needed. |
attach_cross_library_tags() |
Step 2 — read @depends_<tag>, compute the full transitive closure against the library index, write @all_depends_<tag> back into the kernel files |
load_library_for_kernel() |
At runtime (or interactively), read @all_depends_<tag> from a kernel file and concatenate only the library shards it needs, in dependency order. Emits warnings for known-problematic stems. |
extract_library_subset() |
Materialize a kernel-specific subset into a local directory (for packages that want to ship their own copy of the needed shards) |
write_kernel_dependency_index() |
Regenerate kernel_dependency_index.rds / .tsv after updating a library tree |
Exported functions and S3 print methods document return types and meaning in
help pages (?load_kernel_source, ?gpu_diagnostics, ?kernel_lib_subset_printing,
etc.): list structure for diagnostics, plain character vs nmathopencl_* classes
for loaders, and explicit side-effect-only wording for print() methods and
opencl_reset_device_selection().
Package examples are written so CRAN checks exercise real code without requiring a GPU (see also the CRAN cookbook on examples):
| Runs on every check | Stays in \donttest{} (slow full nmath only) |
|---|---|
Host diagnostics (detect_*, check_runtime_env) |
Re-indexing all ~137 shards in cl/nmath |
load_kernel_source() / load_kernel_library() |
Full-library demos in subset/tagging examples |
OpenCL probes in gpu_diagnostics (stubs on CPU-only builds) |
|
Tagging, indexing, subsetting on inst/cl/nmath_small/ |
Loaders and host checks do not use if (has_opencl()) in examples. Optional
OpenCL at compile time is probed with has_opencl() in your package before
GPU dispatch, not before reading .cl files.
Interactive demos of the full nmath library:
example(load_library_for_kernel, run.dontest = TRUE)
devtools::run_examples(run_donttest = TRUE)The general workflow for adding optional GPU acceleration to an R package:
if (!opencltools::has_opencl() || !opencltools::opencl_fp64_available()) {
message("OpenCL not available — using CPU path.")
use_opencl <- FALSE
} else {
use_opencl <- TRUE
}Pre-annotated libraries like nmathopencl already carry full dependency
metadata. You do not need to annotate the library or manually declare which
functions you call. Just write your kernel:
// @library_deps: nmath
__kernel void my_kernel(__global double* x, ...) {
double v = dgamma(x[get_global_id(0)], shape, scale, 0);
...
}The only line you add is // @library_deps: nmath to tell the tooling which
library to scan against. Two calls then handle everything else.
Step 1 — scan source and tag direct calls:
attach_kernel_call_tags reads the library's @provides annotations, scans
your kernel source for matching function calls, and writes
@calls_nmath, @depends_nmath, and @calls_opencl_builtin automatically:
nmath_dir <- system.file("cl/nmath", package = "nmathopencl")
attach_kernel_call_tags(
kernel_paths = list.files("inst/cl/src", "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
library_tag = "nmath"
)
# writes @calls_nmath and @depends_nmath by scanning your sourceStep 2 — expand to full transitive closure:
attach_cross_library_tags reads the @depends_nmath written in step 1,
walks the pre-built library index, and writes @all_depends_nmath — the
complete ordered list of every library shard the kernel needs:
attach_cross_library_tags(
kernel_paths = list.files("inst/cl/src", "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
depends_tag = "depends_nmath"
)
# writes @all_depends_nmath — nothing else to doRe-run both steps whenever you edit your kernel and add or remove library
calls. Both functions accept any pre-annotated library: change library_dir,
library_tag, and depends_tag to match the library's conventions.
Before wiring the kernel into production code, verify that the functions it
needs have been ported and are likely to work. opencltools maintains a
curated opencl_known_failures.json and surfaces warnings automatically when
you call load_library_for_kernel:
src <- load_library_for_kernel(
kernel_path, nmath_dir,
depends_tag = "all_depends_nmath"
)
# warnings fire automatically for any stems with known portability issuesOnce the warnings are clean, load_library_for_kernel is ready to use in
your C++ runner (see step 3).
Inside your kernel runner (a .cpp file in your package's src/). The
package argument to each loader call names whichever installed package
ships the .cl files --- substitute your own ported library for
"nmathopencl" below:
#include <opencltools/openclPort.h> // via LinkingTo: opencltools
// One-time program assembly (cache the result across calls)
std::string build_my_program(const std::string& package) {
using namespace openclPort;
// Load layers from your ported library package
// (shown here using nmathopencl as an example)
return
load_kernel_source("OPENCL.cl", "nmathopencl") + "\n" +
load_kernel_library("libR_shims", "nmathopencl") + "\n" +
load_kernel_library("R_ext_types", "nmathopencl") + "\n" +
load_kernel_library("R_shims", "nmathopencl") + "\n" +
load_kernel_library("R_ext_runtime", "nmathopencl") + "\n" +
load_kernel_library("R_ext_internals", "nmathopencl") + "\n" +
load_kernel_library("System", "nmathopencl") + "\n" +
// Subset only the shards this kernel needs
load_library_for_kernel(
"src/my_kernel.cl", "nmath", "nmathopencl", "all_depends_nmath") + "\n" +
// Your own kernel entry point
load_kernel_source("src/my_kernel.cl", package);
}Pass build_my_program(...) to clCreateProgramWithSource and compile once.
The loader calls are entirely symmetrical for any other ported library ---
just change the package names and subdirectory paths.
The kernel runner handles the raw OpenCL API calls: create context,
compile program, set arguments, dispatch, read back results. Keep it in C++.
The kernel wrapper is the Rcpp-exported entry point that R code calls; it
receives standard R objects, flattens them to vectors (using
openclPort::flattenMatrix / openclPort::copyVector), invokes the runner,
and returns results as Rcpp objects.
glmbayes demonstrates this pattern cleanly:
kernel_runners.cpp—f2_f3_kernel_runner(...): raw OpenCL dispatch, no R objectskernel_wrappers.cpp—f2_f3_opencl(...): Rcpp-facing wrapper that calls the runner, exported via[[Rcpp::export]]EnvelopeEval.cpp— callsf2_f3_opencl(family, link, G4, ...)whenuse_opencl = TRUE, andf2_f3_non_opencl(...)otherwise
The use_opencl flag is the key to graceful degradation. Every entry point
that dispatches to a GPU should have a CPU fallback and accept a use_opencl
argument. When OpenCL is unavailable (no ICD, no fp64, driver fault), the
code transparently routes to the CPU path. Machines without a GPU install and
run the package without any changes to user code.
// in EnvelopeEval.cpp (simplified from `glmbayes`)
if (use_opencl) {
prepGrad = f2_f3_opencl(family, link, G4, y, x, mu, P, alpha, wt, progbar);
} else {
prepGrad = f2_f3_non_opencl(family, link, G4, y, x, mu, P, alpha, wt, progbar);
}glmbayes provides the most complete example of opencltools + nmathopencl
in production use. Its GPU acceleration path for Bayesian GLM sampling
illustrates every step described above.
Bayesian posterior sampling via likelihood-subgradient envelopes requires
constructing a piecewise-linear bound on the log-posterior. That construction
involves evaluating the negative log-posterior and its gradient vector at
every point of a grid over parameter space. For a model with p predictors,
the grid can have O(3^p) faces; for p = 14 that is thousands of
independent evaluations per MCMC draw.
Each evaluation is entirely independent of the others --- exactly the
structure that makes GPU dispatch valuable. The inner math requires
lgamma, lbeta, dbinom_raw, dgamma, and pnorm5 depending on the
GLM family and link. glmbayes sources these from nmathopencl via
opencltools loaders.
load_likelihood_subgradient_program(family, link, package) in
glmbayes/src/kernel_loader.cpp assembles the complete OpenCL program source
for a given GLM family and link function in this fixed layer order:
1. OPENCL.cl — fp64 extension, IEEE constants, INLINE macro
2. libR_shims/ — R_pow, R_pow_di, R_CheckUserInterrupt shims
3. R_ext_types/ — SEXP, Rboolean, type aliases
4. R_shims/ — additional R API shims
5. R_ext_runtime/ — memory / error / I/O interface
6. R_ext_internals/ — R internal extension definitions
7. System/ — system-level OpenCL prelude
8. nmath/ (subset only) — only the stems needed by this kernel
9. src/f2_f3_<family>.cl — the __kernel entry point
Steps 1–8 are sourced from nmathopencl via openclPort::load_kernel_source
and openclPort::load_library_for_kernel. Step 9 is glmbayes-specific.
EnvelopeEval() accepts a use_opencl flag. When TRUE, it dispatches via
the GPU runner; when FALSE (or when OpenCL is not present), it calls the
equivalent CPU path f2_f3_non_opencl(). The calling code in EnvelopeBuild
checks has_opencl() and opencl_fp64_available() before setting the flag.
Users on machines without a GPU experience no difference in API surface.
The first call to f2_f3_opencl in a session triggers just-in-time (JIT) compilation of the
assembled source by the OpenCL driver. For a program that includes substantial
nmath content, this can take several seconds. Subsequent calls reuse the
compiled kernel; the marginal overhead drops to context setup, buffer
transfer, and dispatch --- small relative to the computation for large grids.
The speedup over the CPU path grows with model dimension because more grid
points are evaluated simultaneously.
install.packages(
"opencltools",
repos = c("https://knygren.r-universe.dev",
"https://cloud.r-project.org")
)The package installs on CPU-only systems. Reading and assembling .cl
sources works without OpenCL headers at build time. Optional OpenCL at
compile time enables has_opencl() == TRUE, device probes, and fp64 selection
for packages that compile and run kernels on a GPU (NVIDIA, AMD, Intel, etc.).
Downstream code should use a use_opencl flag (or equivalent) and probe
has_opencl() / opencl_fp64_available() before dispatch, not before loading
kernel text.
Vignettes are numbered so they appear in the correct order in the package
index, following the Chapter-NN convention used by glmbayes.
| Vignette file | Status | Title |
|---|---|---|
Chapter-01 |
✓ | Getting started — Setting up OpenCL, verifying the runtime, first kernel load |
Chapter-02 |
✓ | Using a ported library — Annotating kernels, assembling programs, subsetting a .cl library (nmathopencl as the worked example) |
Chapter-03 |
✓ | Kernel runners and wrappers — The runner/wrapper pattern from glmbayes, graceful fallback |
Chapter-00 |
planned | Introduction — What opencltools is and when to use it |
Chapter-04 |
planned | Testing and parity validation — Verifying numerical accuracy of ported device code before production use |
Chapter-A01 |
planned | Appendix: Kernel annotation in depth — attach_kernel_call_tags, attach_cross_library_tags, dependency closure, @depends vs @all_depends, index format |
Chapter-A02 |
planned | Appendix: Shipping a library subset — Using extract_library_subset and write_kernel_dependency_index to ship a minimal shard tree with your package |
Nygren, K. N. (2026). opencltools: OpenCL Tools for R Package Developers.
R package. Use citation("opencltools") for BibTeX and a layered set of
related entries (methodology, OpenCL, glmbayes).
The vignettes and ex_glmbayes_* example kernels evaluate likelihood
subgradient quantities used in envelope-based accept–reject sampling. That
statistical construction is from:
Nygren, K. N., & Nygren, L. M. (2006). Likelihood subgradient densities. Journal of the American Statistical Association, 101(475), 1144–1156. https://doi.org/10.1198/016214506000000357
Cite this paper when your work uses that envelope/subgradient computation, not when you only use the generic OpenCL loading and annotation tools.
Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Computing in Science & Engineering, 12(3), 66–72. https://doi.org/10.1109/MCSE.2010.69
Khronos OpenCL Working Group. The OpenCL Specification and The OpenCL C Specification. https://www.khronos.org/opencl/
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
nmathopencl— example ported statistical library used in vignettesglmbayes— reference downstream application and source of the runner/wrapper pattern; portions of the runtime layer were adapted during the split fromglmbayes