Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy ops's delay evaluation of variables for SQL backends #2370

Closed
jarodmeng opened this issue Jan 21, 2017 · 3 comments
Closed

Lazy ops's delay evaluation of variables for SQL backends #2370

jarodmeng opened this issue Jan 21, 2017 · 3 comments
Labels
feature a feature request or enhancement

Comments

@jarodmeng
Copy link

dplyr 0.5.0 has a new set of internals for SQL database backends. For the most part, the frontend APIs function the same. However, when it comes to using variables in building SQL statements, there's a material difference between 0.5.0 and 0.4.3. The variables are only evaluated when sql_render is finally called, usually within collect.

In the following example, I create a vector of 3 carrier names and a loop to go through the vector. Each step of the loop is to filter a part of the flights_sqlite table according to the carrier. I store the sub-tables in a list. When I called sql_render on each element outside the loop, they all appear to be the same. This is because all three sub-tables have filter ops on the variable crr which is not evaluated in the loop. When sql_render is called outside of the loop, crr has the last value in the vector, namely EV, and all three sub-tables are filtered to have EV airlines only.

I don't consider this a bug, but this behavior is very different from dplyr 0.4.3 in which the variables would be evaluated immediately when filter is called and thus remembered in the sub-table. It would be great if dplyr 0.5.0 could offer a way to keep this behavior rather than simply replace it with a pass-by-reference style.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(RSQLite)

flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
#> Caching nycflights db at /var/folders/0x/zdvx0wzs3dn8mzmj9dftnqxm00377p/T//Rtmpe1R0r2/nycflights13.sqlite
#> Creating table: airlines
#> Creating table: airports
#> Creating table: flights
#> Creating table: planes
#> Creating table: weather

vec.carriers <- c("UA", "DL", "EV")

list.flights <- list()
for (crr in vec.carriers) {
  list.flights[[crr]] <- flights_sqlite %>%
    filter(carrier == crr)
}

sql_render(list.flights[[1]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')
sql_render(list.flights[[2]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')
sql_render(list.flights[[3]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')
@austenhead
Copy link
Contributor

Here is an approach that doesn't address the underlying problem but does give you your desired result. Try using lapply rather than a for loop (for some reason...)

library(dplyr)
library(RSQLite)
library(nycflights13)
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
vec.carriers <- c("UA", "DL", "EV")
list.flights <- lapply(vec.carriers, function(crr)flights_sqlite %>% filter(carrier == crr))

Then I get your desired result

sql_render(list.flights[[1]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'UA')
sql_render(list.flights[[2]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'DL')
sql_render(list.flights[[3]])
#> <SQL> SELECT *
#> FROM `flights`
#> WHERE (`carrier` = 'EV')

@austenhead
Copy link
Contributor

or you can use the function dplyr::collapse()

for (crr in vec.carriers) {
    list.flights[[crr]] <- flights_sqlite %>%
        filter(carrier == crr) %>%
        collapse()
}

https://github.com/hadley/dplyr/blob/master/R/tbl-sql.r#L366-L375

@hadley hadley added database feature a feature request or enhancement labels Jan 31, 2017
@hadley
Copy link
Member

hadley commented Feb 14, 2017

Here's a simpler reprex:

my_x <- 1
query <- memdb_frame(x = 1:2) %>% filter(x == my_x)

my_x <- 2
query %>% show_query()

The interpolation of variables needs to happen at each step, not just before the query is executed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants