Lazy ops's delay evaluation of variables for SQL backends #2370

jarodmeng · 2017-01-21T01:05:59Z

dplyr 0.5.0 has a new set of internals for SQL database backends. For the most part, the frontend APIs function the same. However, when it comes to using variables in building SQL statements, there's a material difference between 0.5.0 and 0.4.3. The variables are only evaluated when sql_render is finally called, usually within collect.

In the following example, I create a vector of 3 carrier names and a loop to go through the vector. Each step of the loop is to filter a part of the flights_sqlite table according to the carrier. I store the sub-tables in a list. When I called sql_render on each element outside the loop, they all appear to be the same. This is because all three sub-tables have filter ops on the variable crr which is not evaluated in the loop. When sql_render is called outside of the loop, crr has the last value in the vector, namely EV, and all three sub-tables are filtered to have EV airlines only.

I don't consider this a bug, but this behavior is very different from dplyr 0.4.3 in which the variables would be evaluated immediately when filter is called and thus remembered in the sub-table. It would be great if dplyr 0.5.0 could offer a way to keep this behavior rather than simply replace it with a pass-by-reference style.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(RSQLite)

flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
#> Caching nycflights db at /var/folders/0x/zdvx0wzs3dn8mzmj9dftnqxm00377p/T//Rtmpe1R0r2/nycflights13.sqlite
#> Creating table: airlines
#> Creating table: airports
#> Creating table: flights
#> Creating table: planes
#> Creating table: weather

vec.carriers <- c("UA", "DL", "EV")

list.flights <- list()
for (crr in vec.carriers) {
  list.flights[[crr]] <- flights_sqlite %>%
    filter(carrier == crr)
}

sql_render(list.flights[[1]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')
sql_render(list.flights[[2]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')
sql_render(list.flights[[3]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')

The text was updated successfully, but these errors were encountered:

austenhead · 2017-01-21T01:26:08Z

Here is an approach that doesn't address the underlying problem but does give you your desired result. Try using lapply rather than a for loop (for some reason...)

library(dplyr)
library(RSQLite)
library(nycflights13)
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
vec.carriers <- c("UA", "DL", "EV")
list.flights <- lapply(vec.carriers, function(crr)flights_sqlite %>% filter(carrier == crr))

Then I get your desired result

sql_render(list.flights[[1]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'UA')
sql_render(list.flights[[2]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'DL')
sql_render(list.flights[[3]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')

austenhead · 2017-01-21T01:43:33Z

or you can use the function dplyr::collapse()

for (crr in vec.carriers) {
    list.flights[[crr]] <- flights_sqlite %>%
        filter(carrier == crr) %>%
        collapse()
}

https://github.com/hadley/dplyr/blob/master/R/tbl-sql.r#L366-L375

hadley · 2017-02-14T16:47:08Z

Here's a simpler reprex:

my_x <- 1
query <- memdb_frame(x = 1:2) %>% filter(x == my_x)

my_x <- 2
query %>% show_query()

The interpolation of variables needs to happen at each step, not just before the query is executed.

hadley added database feature a feature request or enhancement labels Jan 31, 2017

hadley closed this as completed in 2da63cb Feb 14, 2017

hadley mentioned this issue Feb 21, 2017

Altering captured reference damages spark results. #2455

Closed

JohnMount mentioned this issue Feb 21, 2017

Altering captured reference damages spark results. sparklyr/sparklyr#503

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy ops's delay evaluation of variables for SQL backends #2370

Lazy ops's delay evaluation of variables for SQL backends #2370

jarodmeng commented Jan 21, 2017

austenhead commented Jan 21, 2017

austenhead commented Jan 21, 2017

hadley commented Feb 14, 2017

Lazy ops's delay evaluation of variables for SQL backends #2370

Lazy ops's delay evaluation of variables for SQL backends #2370

Comments

jarodmeng commented Jan 21, 2017

austenhead commented Jan 21, 2017

austenhead commented Jan 21, 2017

hadley commented Feb 14, 2017