-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow performance of mutate.() with by grouping for many groups #82
Comments
Can you install the dev version and let me know if you still have these issues? I actually found an issue that was slowing down pretty much every function in devtools::install_github("markfairbanks/tidytable") Here were the times I got when I ran your example. One note - I made the dataset a library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
rows <- 1000000
ids <- 50000
#simple data set with many different IDs and 1M rows, 3 cols
df <- tibble(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
year = sample(1980:2020, size = rows, replace = TRUE),
stringsAsFactors = FALSE)
dt <- as.data.table(df)
results <- bench::mark(
#first run with tidytable
tidytable = dt %>%
#sort by case id, time and item
tidytable::arrange.(id, year, bike)%>%
#calculate new item number variable #group by case id
tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
#second run with dplyr
dplyr = df %>%
#sort by case id, time and item
dplyr::arrange(id, year, bike)%>%
#calculate new item number variable #group by case id
dplyr::group_by(id) %>%
dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
dplyr::ungroup(),
#third run with data.table
data.table = data.table::copy(dt) %>%
# data.table::as.data.table(.) %>%
#sort by case id, time and item
.[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
#calculate new item number variable #group by case id
.[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
.[],
iterations = 3, filter_gc = FALSE, check = FALSE
)
ggplot2::autoplot(results) |
With the current dev version I can reproduce the much improved results. Added some more scenarios for performance tests. Do you consider releasing this improvement soon? #performance test with various scenarios
library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(2048)
results_new <- bench::press(
rows = c(100000, 1000000, 1e7),
ids = c(1000, 10000, 100000),
{df <- tibble(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
year = sample(1980:2020, size = rows, replace = TRUE))
dt <- as.data.table(df)
bench::mark(
#first run with tidytable
tidytable = dt %>%
#sort by case id, time and item
tidytable::arrange.(id, year, bike)%>%
#calculate new item number variable #group by case id
tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
#second run with dplyr
dplyr = df %>%
#sort by case id, time and item
dplyr::arrange(id, year, bike)%>%
#calculate new item number variable #group by case id
dplyr::group_by(id) %>%
dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
dplyr::ungroup(),
#third run with data.table
data.table = data.table::copy(dt) %>%
#sort by case id, time and item
.[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
#calculate new item number variable #group by case id
.[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
.[],
iterations = 3, filter_gc = FALSE, check = FALSE
)
}
)
#> Running with:
#> rows ids
#> 1 100000 1000
#> 2 1000000 1000
#> 3 10000000 1000
#> 4 100000 10000
#> 5 1000000 10000
#> 6 10000000 10000
#> 7 100000 100000
#> 8 1000000 100000
#> 9 10000000 100000
ggplot2::autoplot(results_new) Created on 2020-06-09 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.3 (2020-02-29)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-06-09
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
#> backports 1.1.7 2020-05-13 [1] CRAN (R 3.6.3)
#> beeswarm 0.2.3 2016-04-25 [1] CRAN (R 3.6.0)
#> bench 1.1.1 2020-01-13 [1] CRAN (R 3.6.2)
#> blob 1.2.1 2020-01-20 [1] CRAN (R 3.6.3)
#> broom 0.5.6 2020-04-20 [1] CRAN (R 3.6.3)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1)
#> curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
#> data.table * 1.12.9 2020-03-04 [1] Github (Rdatatable/data.table@b1b1832)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 3.6.1)
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.2)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 3.6.3)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.3)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1)
#> ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 3.6.3)
#> ggplot2 * 3.3.1 2020-05-28 [1] CRAN (R 3.6.3)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1)
#> haven 2.3.1 2020-06-01 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.1)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
#> jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.2)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2)
#> lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> lubridate 1.7.8 2020-04-06 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.1)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1)
#> mime 0.9 2020-02-04 [1] CRAN (R 3.6.2)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 3.6.3)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1)
#> nlme 3.1-144 2020-02-06 [2] CRAN (R 3.6.3)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
#> profmem 0.5.0 2018-01-30 [1] CRAN (R 3.6.2)
#> ps 1.3.3 2020-05-08 [1] CRAN (R 3.6.3)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.1)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.1)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 3.6.1)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
#> rmarkdown 2.2 2020-05-31 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
#> rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.1)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.1)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble * 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyr * 1.1.0 2020-05-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> tidytable * 0.5.1.9 2020-06-09 [1] Github (markfairbanks/tidytable@c133581)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 3.6.1)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 3.6.3)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.1)
#> vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
#> vipor 0.4.5 2017-03-22 [1] CRAN (R 3.6.3)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.3)
#> xfun 0.14 2020-05-20 [1] CRAN (R 3.6.3)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Users/usr/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library |
Awesome, good to hear that the performance issue is fixed.
Yep - my goal is to submit to CRAN this weekend. I'll keep you updated and let you know when it's accepted. |
@marianschmidt FYI the CRAN submission of v0.5.2 has been put on hold bc CRAN changed their documentation requirements sometime in the past week or two. My initial submission a couple days ago was rejected bc of this change. Once r-lib/roxygen2#1108 is fixed I’ll submit to CRAN again. There are quite a few packages that are having this same problem, but it looks like it will be fixed soon! As far as I can tell it will be fixed in the next day or two |
@markfairbanks Thanks a lot for your efforts. Feel free to close this issue whenever convenient for you; I consider it closed. |
@marianschmidt Glad the package is working out well! As far as the And if you want to specify that you are using a variable from the global environment, you can just unquote it using library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
test_df <- data.table(x = c(1, 1, 1))
x <- 5
# Using tidytable
test_df %>%
mutate.(data_x_plus_global_x = .SD$x + !!x)
#> x data_x_plus_global_x
#> 1: 1 6
#> 2: 1 6
#> 3: 1 6
# Using the tidyverse (version 1)
test_df %>%
mutate(data_x_plus_global_x = .data$x + !!x)
#> x data_x_plus_global_x
#> 1: 1 6
#> 2: 1 6
#> 3: 1 6
# Using the tidyverse (version 2)
test_df %>%
mutate(data_x_plus_global_x = .data$x + .env$x)
#> x data_x_plus_global_x
#> 1: 1 6
#> 2: 1 6
#> 3: 1 6 |
@marianschmidt FYI there is a small API change - the library(tidytable, warn.conflicts = FALSE)
test_df <- data.table(x = 1:3, y = c("a", "a", "b"))
# Using `by` causes a warning
test_df %>%
summarize.(avg_x = mean(x), by = y)
#> Warning: The `by` argument of `summarize.()` is deprecated as of tidytable 0.5.2.
#> Please use the `.by` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> y avg_x
#> 1: a 1.5
#> 2: b 3.0
# Using `.by` works normally
test_df %>%
summarize.(avg_x = mean(x), .by = y)
#> y avg_x
#> 1: a 1.5
#> 2: b 3.0 |
As somebody who likes the tidyverse syntax and requires the data.table performance while struggling with its modify-by-reference, I was very happy finding tidytable. Thanks for this great package!
I am working with large datasets (1-10M rows, 50-500 cols) that often require mutating of grouped data.
In this scenario however, I found
tidytable::mutate.()
to be much slower than thedata.table
equivalent, and still considerably slower than thedplyr
alternative.Created on 2020-06-08 by the reprex package (v0.3.0)
Session info
The text was updated successfully, but these errors were encountered: