🦆 Ducks all the way down
This is a convenience package for everyone who has a deep and abiding love affair with SQL (and ducks).
Instead of using other incredibly popular and useful packages like data.table or dplyr, one could use
duckdf instead to slice and dice dataframes with SQL. Precedence exists in the sqldf package, but this one is better because it quacks (although is not nearly as comprehensive, well planned or tested).
It's also meant to be a bit of fun.
To install this package, first install the very experimental duckdb:
install.packages("duckdb", repos=c("http://download.duckdb.org/alias/master/rstats/", "http://cran.rstudio.com"))
At least version 0.1.6 is required.
Then install install
# install.packages("devtools") devtools::install_github("phillc73/duckdf") library("duckdf")
Write "normal" SQL in an R function:
duckdf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200")
This registers the well known
mtcars dataset as a virtual table in the
duckdb database, then selects just the columns
cyl, where the
disp column is greater than 200.
In reality, this function is just a simple wrapper around a collection of
DBI functions, such as
dbDisconnect() and the
duckdf_persist("SELECT mpg, cyl FROM mtcars WHERE disp >= 200")
The above is obviously the same SQL statement, however by using
duckdf_persist() an on-disk
duckdb database is created in the current working directory. This function currently remains in the package, but is not really useful now that
duckdb supports registering virtual tables from dataframes.
This simply removes all traces of the
mtcars from the current working directory.
Is this package any good? If some measure of good is the speed at which results are returned, then this package is reasonably good.
The benchmarks below are on a quite old, first generation i7 laptop. If you try these numbers yourself, the results will differ but the general themes should remain the same.
duckdf SELECT functions have been vaguely tested against other popular approaches including
duckdf() is significantly faster than
sqldf, somewhat faster than the current implementation of
dbplyr, not quite as fast as
dplyr and much, much slower than
library(duckdf) library(dplyr) library(microbenchmark) library(sqldf) library(data.table) library(ggplot2) # Make a data.table mtcars_data_table <- data.table(mtcars) # dbplyr db prep dsrc <- duckdb::src_duckdb() mtcars_db <- copy_to(dsrc, mtcars, "mtcars", temporary = FALSE) # Run the benchmark as often as you like duck_bench <- microbenchmark(times=500, sqldf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"), duckdf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"), duckdf_persist("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"), mtcars_data_table[disp >= 200, c("mpg", "cyl"),], mtcars %>% dplyr::filter(disp >= 200) %>% dplyr::select(mpg,cyl), tbl(dsrc, "mtcars") %>% dplyr::filter(disp >= 200) %>% dplyr::select(mpg,cyl) ) autoplot(duck_bench)
Of course there are lies, damn lies and benchmarks. Different datasets, of different size or different column types, may produce entirely different results.