Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test with next version of dtplyr #17

Closed
hadley opened this issue Feb 4, 2021 · 9 comments
Closed

Failing test with next version of dtplyr #17

hadley opened this issue Feb 4, 2021 · 9 comments

Comments

@hadley
Copy link

hadley commented Feb 4, 2021

This test now fails:

  expect_equal(
    query(
      "SELECT origin, dest,
          COUNT(flight) AS num_flts,
          round(AVG(distance)) AS dist,
          round(AVG(arr_delay)) AS avg_delay
          FROM flights_dt
        WHERE distance BETWEEN 200 AND 300
          AND air_time IS NOT NULL
        GROUP BY origin, dest
        HAVING num_flts > 3000
        ORDER BY num_flts DESC, avg_delay DESC
        LIMIT 100;"
    ),
    flights_dt %>%
      filter(between(distance,200,300) & !is.na(air_time)) %>%
      group_by(origin, dest) %>%
      filter(sum(!is.na(flight)) > 3000) %>%
      summarise(
        num_flts = sum(!is.na(flight)),
        dist = round(mean(distance, na.rm = TRUE)),
        avg_delay = round(mean(arr_delay, na.rm = TRUE))
      ) %>%
      ungroup() %>%
      arrange(desc(num_flts), desc(avg_delay)) %>%
      head(100L)
  )

Unfortunately you don't get a particularly informative error (even with local_edition(3)) because the pipeline is rather deep. However, I think this is the key difference:

actual$parent$parent$parent$parent$parent$i vs expected$parent$parent$parent$parent$parent$i
- `\`_DT3\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`
+ `\`_DT4\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`

i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove filter(sum(!is.na(flight)) > 3000) and HAVING num_flts > 3000 the test passes

@hadley
Copy link
Author

hadley commented Feb 4, 2021

Ooooooh, the problem is that dtplyr always creates a unique name for the intermediate table it now needs for the grouped filter:

_DT10 <- `_DT2`[between(distance, 200, 300) & !is.na(air_time)]
  head(`_DT10`[`_DT10`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, 
    dest)]$V1, .(num_flts = sum(!is.na(flight)), dist = round(mean(distance, 
    na.rm = TRUE)), avg_delay = round(mean(arr_delay, na.rm = TRUE))), 
    keyby = .(origin, dest)][order(desc(num_flts), desc(avg_delay))], 
    n = 100L)

And since the name is session unique, it's incremented between the two tests.

Maybe I can make the name unique with in a pipeline.

@hadley
Copy link
Author

hadley commented Feb 4, 2021

I'm having difficulties figuring out how to do this better because independent pipelines might get combined into one with a left_join(), so the locals have to be globally unique, not just locally unique. Let me know if you have any ideas.

@ianmcook
Copy link
Owner

ianmcook commented Feb 5, 2021

@hadley Thanks for looking into this. In these tests, I suppose I should have used as.data.frame() to test for equality of the result sets, not for equality of the returned dtplyr_step object.

I just committed that change (7aa7dbb), but now I'm getting some strange test errors. I'll try to debug. https://travis-ci.org/github/ianmcook/tidyquery/builds/757640217#L3025

@ianmcook
Copy link
Owner

ianmcook commented Feb 5, 2021

Here's a minimal reprex of the type of failure I'm seeing after adding as.data.frame() in 7aa7dbb:

If set up a testthat file like this:

library(dtplyr)
library(dplyr)

iris_dt <- lazy_dt(iris)

test_that("testthat works with dtplyr", {
  iris_dt %>% select(Sepal.Length) %>% as.data.frame()
  succeed()
})

then I run devtools::test() and I get this error:

Error: could not find function "."

If I just run the code outside of test() it runs with no error. But then after test() runs, something changes in the environment that causes it to fail even running outside of test().

@hadley
Copy link
Author

hadley commented Feb 5, 2021

The traceback is:

9: `[.data.frame`(x, i, j)
8: `[.data.table`(`_DT12`, , .(Sepal.Length))
7: `_DT12`[, .(Sepal.Length)]
6: eval_tidy(quo) at tidyeval.R#11
5: dt_eval(x) at step.R#144
4: as.data.frame(dt_eval(x)) at step.R#144
3: as.data.frame.dtplyr_step(.)
2: as.data.frame(.)
1: iris_dt %>% select(Sepal.Length) %>% as.data.frame()

So [.data.table is dispatching to [.data.frame which obviously doesn't know how to treat . in this context.

I don't know why this behaviour would have changed, but I suspect it's caused by this commit: tidyverse/dtplyr@5f8d51b.

@hadley
Copy link
Author

hadley commented Feb 5, 2021

Yeah, with options(datatable.verbose = TRUE), I see:

cedta decided 'tidyquery' wasn't data.table aware. Here is call stack with [[1L]] applied:
[[1]]
`%>%`

[[2]]
as.data.frame

[[3]]
as.data.frame.dtplyr_step

[[4]]
as.data.frame

[[5]]
dt_eval

[[6]]
eval_tidy

[[7]]
`[`

[[8]]
`[.data.table`

[[9]]
cedta

I can fix this in dtplyr.

@hadley
Copy link
Author

hadley commented Feb 5, 2021

The problem is that lazy_dt() captures the calling environment, and that's what data.table uses to figure out if it's enabled or not. It works in dtplyr itself because it imports data.table, but tidyquery does not. (And it shouldn't need to; I just need to figure out how to tell data.table that we're good.)

@hadley
Copy link
Author

hadley commented Feb 5, 2021

Looking at the code in https://github.com/Rdatatable/data.table/blob/master/R/cedta.R, it's not clear how to force this to be true; any approach is going to be hacky. The easiest fix would be for you to include .datatable.aware <- TRUE somewhere in your package while I think about how to handle this.

@ianmcook
Copy link
Owner

ianmcook commented Feb 6, 2021

Fixed in 7aa7dbb and 7cc5c71. Version 0.2.2 on its way to CRAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants