mdbplyr

A native tidy lazy backend for MongoDB in R.

Overview

mdbplyr provides a disciplined dplyr-style interface for read-only analytical MongoDB queries. Queries stay lazy, compile into MongoDB aggregation pipelines, and only execute at collect().

The package is intentionally conservative:

it targets MongoDB aggregation pipelines directly,
it does not emulate SQL or extend dbplyr,
it fails explicitly for unsupported semantics,
it avoids silent client-side fallback.

Install

The package is still in its initial development phase. While it is under testing and not available on CRAN, you can install it with:

install.package("devtools")
devtools::install_github("pbosetti/mdbplyr", build_vignettes=TRUE)

Current implemented subset

Core objects and terminals

mongo_src()
tbl_mongo()
collect()
cursor()
show_query()
schema_fields()
append_stage()

Supported verbs

filter()
select()
rename()
mutate()
transmute()
arrange()
group_by()
summarise()
slice_head()
slice_tail()
head()

Supported expressions

field references, including backticked dot paths such as `user.age`,
scalar literals,
comparison operators, including %in%,
boolean operators,
arithmetic operators, including %% and ^,
abs(), sqrt(), log(), log10(), exp(), floor(), ceiling(), trunc(), round(),
sin(), cos(), tan(), asin(), acos(), atan(), atan2(),
sinh(), cosh(), tanh(), asinh(), acosh(), atanh(),
pmin(), pmax(),
tolower(), toupper(), nchar(), paste(), paste0(), substr(), substring(),
if_else(),
case_when(),
is.na(),
1:n() in mutate() / transmute() for row numbering,
n(), sum(), mean(), min(), max().

Current limits

select() and rename() currently support only explicit bare field names,
mutate() and transmute() require named expressions and otherwise support scalar expressions except for the special 1:n() row-numbering case,
group_by() supports bare field names only,
summarise() supports only the documented aggregate functions,
joins, window functions, across(), reshaping, and write operations are out of scope.

Example

library(mdbplyr)
library(dplyr)

orders <- tbl_mongo(
  collection = mongolite::mongo(collection = "orders", db = "analytics"),
  schema = c("customer", "amount", "status")
)

query <- orders |>
  filter(status == "paid", amount > 0) |>
  mutate(double_amount = amount * 2) |>
  group_by(customer) |>
  summarise(total = sum(double_amount), n = n()) |>
  arrange(desc(total)) |>
  slice_head(n = 10)

show_query(query)
result <- collect(query)

iter <- cursor(query)
first_page <- iter$page(5)

When field metadata is not discoverable from the collection object, pass schema = ... to tbl_mongo() so that projection and rename operations can stay explicit and lazy.

Project Goal

The package should provide:

lazy query composition,
tidy evaluation,
translation of supported verbs into MongoDB aggregation stages,
query inspection,
explicit and predictable failure for unsupported operations.

The package should not aim to deliver full dplyr compatibility over arbitrary MongoDB collections.

Design Position

This project should be framed as:

a native tidy lazy analytical backend for MongoDB.

It should not be framed as:

a complete dbplyr equivalent for MongoDB.

MongoDB documents are not rectangular SQL tables. Nested fields, arrays, missing keys, heterogeneous schemas, and document-oriented semantics require a backend that is native to MongoDB rather than adapted from SQL assumptions.

Support matrix

Capability	Status	Notes
Lazy query state	Supported	Verbs update internal IR only
Pipeline inspection	Supported	`show_query()` renders compiled JSON
Flat-field filters	Supported	Uses `$match` + `$expr`
Projection and rename	Supported with caveats	Explicit bare field names only
Scalar mutation	Supported	Conservative expression subset
Grouped summaries	Supported	`n()`, `sum()`, `mean()`, `min()`, `max()`
Dot-path fields	Supported with caveats	Use backticked names such as `user.age`
Manual pipeline stage append	Supported with caveats	`append_stage()` appends raw JSON after generated stages and does not infer schema changes
Joins/window functions	Not supported	Explicitly out of scope
Client-side fallback	Not supported	Unsupported features error clearly

Core Technical Principles

1. Native MongoDB translation

Do not generate SQL. Do not emulate SQL. Translate directly into MongoDB aggregation pipelines.

2. Internal intermediate representation

Introduce a package-specific internal query representation between the user API and the pipeline compiler.

This is a core design requirement. It allows:

better testing,
better diagnostics,
cleaner compiler logic,
easier future extension.

3. Lazy semantics

All supported verbs should update query state, not execute immediately.

Execution should occur only at terminal steps such as collect().

4. Explicit failure

Unsupported operations should fail with precise diagnostics. The package should not silently pull data locally and continue computation unless that behavior is deliberately introduced later as an opt-in mode.

5. Conservative semantics

Ambiguous cases should be handled conservatively and documented explicitly, especially for:

missing fields,
NULL / NA behavior,
heterogeneous field types,
nested document paths,
ordering assumptions.

Expected Translation Model

Typical verb mappings are:

filter() -> $match
select() -> $project
mutate() -> $addFields or $project
arrange() -> $sort
group_by() + summarise() -> $group
slice_head() / head() -> $limit or array-slicing stages for negative n
slice_tail() -> array-slicing stages

This mapping should be documented, inspectable, and testable.

Author

Paolo Bosetti, University of Trento, Department of Industrial Engineering https://ror.org/05trd4x28

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
R		R
build		build
man		man
mdbplyr		mdbplyr
media		media
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mdbplyr

Overview

Install

Current implemented subset

Core objects and terminals

Supported verbs

Supported expressions

Current limits

Example

Project Goal

Design Position

Support matrix

Core Technical Principles

1. Native MongoDB translation

2. Internal intermediate representation

3. Lazy semantics

4. Explicit failure

5. Conservative semantics

Expected Translation Model

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mdbplyr

Overview

Install

Current implemented subset

Core objects and terminals

Supported verbs

Supported expressions

Current limits

Example

Project Goal

Design Position

Support matrix

Core Technical Principles

1. Native MongoDB translation

2. Internal intermediate representation

3. Lazy semantics

4. Explicit failure

5. Conservative semantics

Expected Translation Model

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages