A native tidy lazy backend for MongoDB in R.
mdbplyr provides a disciplined dplyr-style interface for read-only analytical MongoDB queries. Queries stay lazy, compile into MongoDB aggregation pipelines, and only execute at collect().
The package is intentionally conservative:
- it targets MongoDB aggregation pipelines directly,
- it does not emulate SQL or extend
dbplyr, - it fails explicitly for unsupported semantics,
- it avoids silent client-side fallback.
The package is still in its initial development phase. While it is under testing and not available on CRAN, you can install it with:
install.package("devtools")
devtools::install_github("pbosetti/mdbplyr", build_vignettes=TRUE)mongo_src()tbl_mongo()collect()cursor()show_query()schema_fields()append_stage()
filter()select()rename()mutate()transmute()arrange()group_by()summarise()slice_head()slice_tail()head()
- field references, including backticked dot paths such as
`user.age`, - scalar literals,
- comparison operators, including
%in%, - boolean operators,
- arithmetic operators, including
%%and^, abs(),sqrt(),log(),log10(),exp(),floor(),ceiling(),trunc(),round(),sin(),cos(),tan(),asin(),acos(),atan(),atan2(),sinh(),cosh(),tanh(),asinh(),acosh(),atanh(),pmin(),pmax(),tolower(),toupper(),nchar(),paste(),paste0(),substr(),substring(),if_else(),case_when(),is.na(),1:n()inmutate()/transmute()for row numbering,n(),sum(),mean(),min(),max().
select()andrename()currently support only explicit bare field names,mutate()andtransmute()require named expressions and otherwise support scalar expressions except for the special1:n()row-numbering case,group_by()supports bare field names only,summarise()supports only the documented aggregate functions,- joins, window functions,
across(), reshaping, and write operations are out of scope.
library(mdbplyr)
library(dplyr)
orders <- tbl_mongo(
collection = mongolite::mongo(collection = "orders", db = "analytics"),
schema = c("customer", "amount", "status")
)
query <- orders |>
filter(status == "paid", amount > 0) |>
mutate(double_amount = amount * 2) |>
group_by(customer) |>
summarise(total = sum(double_amount), n = n()) |>
arrange(desc(total)) |>
slice_head(n = 10)
show_query(query)
result <- collect(query)
iter <- cursor(query)
first_page <- iter$page(5)When field metadata is not discoverable from the collection object, pass schema = ... to tbl_mongo() so that projection and rename operations can stay explicit and lazy.
The package should provide:
- lazy query composition,
- tidy evaluation,
- translation of supported verbs into MongoDB aggregation stages,
- query inspection,
- explicit and predictable failure for unsupported operations.
The package should not aim to deliver full dplyr compatibility over arbitrary MongoDB collections.
This project should be framed as:
a native tidy lazy analytical backend for MongoDB.
It should not be framed as:
a complete
dbplyrequivalent for MongoDB.
MongoDB documents are not rectangular SQL tables. Nested fields, arrays, missing keys, heterogeneous schemas, and document-oriented semantics require a backend that is native to MongoDB rather than adapted from SQL assumptions.
| Capability | Status | Notes |
|---|---|---|
| Lazy query state | Supported | Verbs update internal IR only |
| Pipeline inspection | Supported | show_query() renders compiled JSON |
| Flat-field filters | Supported | Uses $match + $expr |
| Projection and rename | Supported with caveats | Explicit bare field names only |
| Scalar mutation | Supported | Conservative expression subset |
| Grouped summaries | Supported | n(), sum(), mean(), min(), max() |
| Dot-path fields | Supported with caveats | Use backticked names such as `user.age` |
| Manual pipeline stage append | Supported with caveats | append_stage() appends raw JSON after generated stages and does not infer schema changes |
| Joins/window functions | Not supported | Explicitly out of scope |
| Client-side fallback | Not supported | Unsupported features error clearly |
Do not generate SQL. Do not emulate SQL. Translate directly into MongoDB aggregation pipelines.
Introduce a package-specific internal query representation between the user API and the pipeline compiler.
This is a core design requirement. It allows:
- better testing,
- better diagnostics,
- cleaner compiler logic,
- easier future extension.
All supported verbs should update query state, not execute immediately.
Execution should occur only at terminal steps such as collect().
Unsupported operations should fail with precise diagnostics. The package should not silently pull data locally and continue computation unless that behavior is deliberately introduced later as an opt-in mode.
Ambiguous cases should be handled conservatively and documented explicitly, especially for:
- missing fields,
NULL/NAbehavior,- heterogeneous field types,
- nested document paths,
- ordering assumptions.
Typical verb mappings are:
filter()->$matchselect()->$projectmutate()->$addFieldsor$projectarrange()->$sortgroup_by()+summarise()->$groupslice_head()/head()->$limitor array-slicing stages for negativenslice_tail()-> array-slicing stages
This mapping should be documented, inspectable, and testable.
Paolo Bosetti, University of Trento, Department of Industrial Engineering https://ror.org/05trd4x28
