Skip to content
πŸ¦† SQL for R dataframes, with ducks
R
Branch: master
Clone or download

README.md

πŸ¦† Ducks all the way down

This is a convenience package for everyone who has a deep and abiding love affair with SQL (and ducks).

Instead of using other incredibly popular and useful packages like data.table or dplyr, one could use duckdf instead to slice and dice dataframes with SQL. Precedence exists in the sqldf package, but this one is better because it quacks (although is not nearly as comprehensive, well planned or tested).

It's also meant to be a bit of fun.

Quick Start

To install this package, first install the very experimental duckdb:

install.packages("duckdb", 
                        repos=c("http://download.duckdb.org/alias/master/rstats/", 
                        "http://cran.rstudio.com"))

At least version 0.1.6 is required.

Then install install duckdf with devtools

# install.packages("devtools")
devtools::install_github("phillc73/duckdf")
library("duckdf")

Usage

Write "normal" SQL in an R function:

duckdf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200")

This registers the well known mtcars dataset as a virtual table in the duckdb database, then selects just the columns mpg and cyl, where the disp column is greater than 200.

In reality, this function is just a simple wrapper around a collection of DBI functions, such as dbConnect(), dbGetQuery(), dbDisconnect() and the duckdb function duckdb_register.

duckdf_persist("SELECT mpg, cyl FROM mtcars WHERE disp >= 200")

The above is obviously the same SQL statement, however by using duckdf_persist() an on-disk duckdb database is created in the current working directory. This function currently remains in the package, but is not really useful now that duckdb supports registering virtual tables from dataframes.

duckdf_cleanup("mtcars")

This simply removes all traces of the duckdb called mtcars from the current working directory.

Benchmarks

Is this package any good? If some measure of good is the speed at which results are returned, then this package is reasonably good.

The benchmarks below are on a quite old, first generation i7 laptop. If you try these numbers yourself, the results will differ but the general themes should remain the same.

The current duckdf SELECT functions have been vaguely tested against other popular approaches including data.table, dplyr and sqldf.

duckdf() is significantly faster than sqldf, somewhat faster than the current implementation of dbplyr, not quite as fast as dplyr and much, much slower than data.table.

library(duckdf)
library(dplyr)
library(microbenchmark)
library(sqldf)
library(data.table)
library(ggplot2)

# Make a data.table
mtcars_data_table <- data.table(mtcars)

# dbplyr db prep
dsrc <- duckdb::src_duckdb()
mtcars_db <-
  copy_to(dsrc, mtcars, "mtcars", temporary = FALSE)

# Run the benchmark as often as you like
duck_bench <- microbenchmark(times=500,
                             sqldf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"),
                             duckdf("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"),
                             duckdf_persist("SELECT mpg, cyl FROM mtcars WHERE disp >= 200"),
                             mtcars_data_table[disp >= 200, c("mpg", "cyl"),],
                             mtcars %>%
                               dplyr::filter(disp >= 200) %>%
                               dplyr::select(mpg,cyl),
                             tbl(dsrc, "mtcars") %>%
                               dplyr::filter(disp >= 200) %>%
                               dplyr::select(mpg,cyl)
                            )

autoplot(duck_bench)

Of course there are lies, damn lies and benchmarks. Different datasets, of different size or different column types, may produce entirely different results.

You can’t perform that action at this time.