Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
216 lines (172 sloc) 6.27 KB
---
title: "Recording SQL queries with dittodb for travelling"
author: "Mauricio Vargas"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Recording SQL queries with dittodb}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r recording, include = FALSE, eval = FALSE}
library(dplyr)
library(dbplyr)
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
nycflights13_sql(con_psql)
dbDisconnect(con_psql)
start_db_capturing(path = "vignettes/")
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
tbl(con_psql, in_schema("public", "flights")) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 10L)
dbDisconnect(con_psql)
stop_db_capturing()
```
```{r setup, include=FALSE}
library(dittodb)
# set the mockPaths for this vignette
.db_mock_paths("travelling")
knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
```
# Scope
The present consists in mocking the connection to a real PostgreSQL server that
contains a database version of the `nycflights13` dataset (among other
databases). See [the `nycflights13` vignette](nycflights.html) for
more information about this database.
This example is for you if you ever wondered how to use scripts that you use
at the office when you are at home or travelling. Or how to continue developing
these scripts while you don't have an internet connection.
Many of us have to use databases that are only accessible from a local network.
The package `dittodb` provides `with_mock_db()` that wraps the code and makes it
possible to run outside the office (or even with no internet access at all!).
# Recording queries
Suppose we are asked to analyze the flights to only show flights with planes
that have flown at least 100 flights.
One would find all planes that have flown at least 100 flights. The only
consideration would be to filter those flights with missing tail number or those
will be treated as a single plane.
We could run the following code to get that data with a direct connection to the
database (i.e. at the office):
```{r, error=TRUE, eval=FALSE}
library(dplyr)
library(dbplyr)
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
tbl(con_psql, in_schema("public", "flights")) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 10L)
```
However, this won't work if we can't connect to our database server. And since
`postgres.server` is an alias to an IP only accessible from the local network at
our office, we couldn't run this code and get a result elsewhere. But what if we
wanted to continue work on this analysis on the train home?
*Important:* This was just an example. Please never write your passwords in
scripts, use your `.Rprofile`, an environment variable, or some other
more-secure method instead.
One option would be saving a CSV or TXT file of the data manually, and then
manually reading it in to our R session. But this has a number of drawbacks: we
have to mentally keep track of where each query is from, save it to the right
file, read it in to the right place, etc. We also have to maintain a separate
system or code path for reading in the saved files. `dittodb` can take care of
all of this for us in the background, allowing us to record the results of the
necessary queries, and playing them back when those same queries are called
without a connection to the database.
While we are able to connect to the database (i.e. when we are at the office) we
can save the results returned by queries with code like the following (by
calling `start_db_capturing()` before the connection and the code that executes the
queries and then `stop_db_capturing()` at the end):
```{r, eval=FALSE}
library(dittodb)
start_db_capturing()
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
flights_100 <- tbl(con_psql, in_schema("public", "flights")) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 10L) %>%
collect()
flights_100
dbDisconnect(con_psql)
stop_db_capturing()
```
```{r cooking show trick, echo=FALSE}
library(dplyr)
library(dbplyr)
# this is the same code that is echoed below, but used here to show output that
# the chunk above would produce if it were able to connect
with_mock_db({
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
flights_100_from_mock <- tbl(con_psql, in_schema("public", "flights")) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 10L) %>%
collect()
flights_100_from_mock
})
# `dbDisconnect` returns TRUE
TRUE
```
# Reproducing query results
If there was a success capturing one or more queries, then we are able to
replicate the result connected to a different network or even without internet
access:
```{r}
with_mock_db({
con_psql <- RPostgreSQL::dbConnect(
drv = DBI::dbDriver("PostgreSQL"),
dbname = "dittodb",
host = "postgres.server",
user = "dittodb",
password = "_dittodb_"
)
flights_100_from_mock <- tbl(con_psql, in_schema("public", "flights")) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 10L) %>%
collect()
flights_100_from_mock
})
```
One thing to note is that when using `dbplyr`, we need to be a bit careful that
we wrap the entire interaction in with the database objects in `with_mock_db` if
we are taking advantage of `dbplyr`'s lazy evaluation (which is by default) and
use `collect()` to return the results when you want them recorded. Because
`dbplyr` waits until the last possible second to request the data, if you don't
have a `collect()` call (or a call the will implicitly send the query) there
won't be a query called, and `dittodb` won't see be able to record the response
from that query.
You can’t perform that action at this time.