Failing test with next version of dtplyr #17

hadley · 2021-02-04T18:45:15Z

This test now fails:

  expect_equal(
    query(
      "SELECT origin, dest,
          COUNT(flight) AS num_flts,
          round(AVG(distance)) AS dist,
          round(AVG(arr_delay)) AS avg_delay
          FROM flights_dt
        WHERE distance BETWEEN 200 AND 300
          AND air_time IS NOT NULL
        GROUP BY origin, dest
        HAVING num_flts > 3000
        ORDER BY num_flts DESC, avg_delay DESC
        LIMIT 100;"
    ),
    flights_dt %>%
      filter(between(distance,200,300) & !is.na(air_time)) %>%
      group_by(origin, dest) %>%
      filter(sum(!is.na(flight)) > 3000) %>%
      summarise(
        num_flts = sum(!is.na(flight)),
        dist = round(mean(distance, na.rm = TRUE)),
        avg_delay = round(mean(arr_delay, na.rm = TRUE))
      ) %>%
      ungroup() %>%
      arrange(desc(num_flts), desc(avg_delay)) %>%
      head(100L)
  )

Unfortunately you don't get a particularly informative error (even with local_edition(3)) because the pipeline is rather deep. However, I think this is the key difference:

actual$parent$parent$parent$parent$parent$i vs expected$parent$parent$parent$parent$parent$i
- `\`_DT3\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`
+ `\`_DT4\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`

i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove filter(sum(!is.na(flight)) > 3000) and HAVING num_flts > 3000 the test passes

The text was updated successfully, but these errors were encountered:

hadley · 2021-02-04T18:47:30Z

Ooooooh, the problem is that dtplyr always creates a unique name for the intermediate table it now needs for the grouped filter:

_DT10 <- `_DT2`[between(distance, 200, 300) & !is.na(air_time)]
  head(`_DT10`[`_DT10`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, 
    dest)]$V1, .(num_flts = sum(!is.na(flight)), dist = round(mean(distance, 
    na.rm = TRUE)), avg_delay = round(mean(arr_delay, na.rm = TRUE))), 
    keyby = .(origin, dest)][order(desc(num_flts), desc(avg_delay))], 
    n = 100L)

And since the name is session unique, it's incremented between the two tests.

Maybe I can make the name unique with in a pipeline.

hadley · 2021-02-04T21:05:21Z

I'm having difficulties figuring out how to do this better because independent pipelines might get combined into one with a left_join(), so the locals have to be globally unique, not just locally unique. Let me know if you have any ideas.

ianmcook · 2021-02-05T03:57:44Z

@hadley Thanks for looking into this. In these tests, I suppose I should have used as.data.frame() to test for equality of the result sets, not for equality of the returned dtplyr_step object.

I just committed that change (7aa7dbb), but now I'm getting some strange test errors. I'll try to debug. https://travis-ci.org/github/ianmcook/tidyquery/builds/757640217#L3025

ianmcook · 2021-02-05T05:20:31Z

Here's a minimal reprex of the type of failure I'm seeing after adding as.data.frame() in 7aa7dbb:

If set up a testthat file like this:

library(dtplyr)
library(dplyr)

iris_dt <- lazy_dt(iris)

test_that("testthat works with dtplyr", {
  iris_dt %>% select(Sepal.Length) %>% as.data.frame()
  succeed()
})

then I run devtools::test() and I get this error:

Error: could not find function "."

If I just run the code outside of test() it runs with no error. But then after test() runs, something changes in the environment that causes it to fail even running outside of test().

hadley · 2021-02-05T14:29:13Z

The traceback is:

9: `[.data.frame`(x, i, j)
8: `[.data.table`(`_DT12`, , .(Sepal.Length))
7: `_DT12`[, .(Sepal.Length)]
6: eval_tidy(quo) at tidyeval.R#11
5: dt_eval(x) at step.R#144
4: as.data.frame(dt_eval(x)) at step.R#144
3: as.data.frame.dtplyr_step(.)
2: as.data.frame(.)
1: iris_dt %>% select(Sepal.Length) %>% as.data.frame()

So [.data.table is dispatching to [.data.frame which obviously doesn't know how to treat . in this context.

I don't know why this behaviour would have changed, but I suspect it's caused by this commit: tidyverse/dtplyr@5f8d51b.

hadley · 2021-02-05T14:30:52Z

Yeah, with options(datatable.verbose = TRUE), I see:

cedta decided 'tidyquery' wasn't data.table aware. Here is call stack with [[1L]] applied:
[[1]]
`%>%`

[[2]]
as.data.frame

[[3]]
as.data.frame.dtplyr_step

[[4]]
as.data.frame

[[5]]
dt_eval

[[6]]
eval_tidy

[[7]]
`[`

[[8]]
`[.data.table`

[[9]]
cedta

I can fix this in dtplyr.

hadley · 2021-02-05T14:34:44Z

The problem is that lazy_dt() captures the calling environment, and that's what data.table uses to figure out if it's enabled or not. It works in dtplyr itself because it imports data.table, but tidyquery does not. (And it shouldn't need to; I just need to figure out how to tell data.table that we're good.)

hadley · 2021-02-05T14:52:33Z

Looking at the code in https://github.com/Rdatatable/data.table/blob/master/R/cedta.R, it's not clear how to force this to be true; any approach is going to be hacky. The easiest fix would be for you to include .datatable.aware <- TRUE somewhere in your package while I think about how to handle this.

ianmcook · 2021-02-06T14:07:33Z

Fixed in 7aa7dbb and 7cc5c71. Version 0.2.2 on its way to CRAN.

This was referenced Feb 5, 2021

Release dtplyr 1.1.0 tidyverse/dtplyr#181

Closed

Ensure code executed by dtplyr is marked as data.table aware tidyverse/dtplyr#184

Open

ianmcook closed this as completed Feb 6, 2021

mkoohafkan mentioned this issue Mar 24, 2022

namespace issue using dtplyr in package - cannot find function "." tidyverse/dtplyr#349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing test with next version of dtplyr #17

Failing test with next version of dtplyr #17

hadley commented Feb 4, 2021

hadley commented Feb 4, 2021

hadley commented Feb 4, 2021

ianmcook commented Feb 5, 2021

ianmcook commented Feb 5, 2021 •

edited

Loading

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

ianmcook commented Feb 6, 2021

Failing test with next version of dtplyr #17

Failing test with next version of dtplyr #17

Comments

hadley commented Feb 4, 2021

hadley commented Feb 4, 2021

hadley commented Feb 4, 2021

ianmcook commented Feb 5, 2021

ianmcook commented Feb 5, 2021 • edited Loading

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

hadley commented Feb 5, 2021

ianmcook commented Feb 6, 2021

ianmcook commented Feb 5, 2021 •

edited

Loading