Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with data.table #6

Open
carbonmetrics opened this issue Nov 4, 2023 · 11 comments · May be fixed by matloff/regtools#37
Open

Working with data.table #6

carbonmetrics opened this issue Nov 4, 2023 · 11 comments · May be fixed by matloff/regtools#37

Comments

@carbonmetrics
Copy link

Great book!
Working with data.table does not always work:

pacman::p_load(data.table, regtools, qeML)

data(day1)
setDT(day1)

# page 17 , 1
data(mlb)
setDT(mlb)
w = mlb[, .(Height, Weight, Age)]
z = qeKNN(w, "Weight")
# Warning message:
#   In eval(tmp, parent.frame()) : "data" converted to data frame
predict(z, data.table(Height = 72, Age = 24)) # does not work
# Error in `[.data.table`(dfr, , i) : 
#   j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.

# page 17 , 2
y = mlb[, 3:6]
knnout1 = qeKNN(y, "Weight", k = 25)
# Warning message:
#   In eval(tmp, parent.frame()) : "data" converted to data frame
predict(knnout1, data.table(Height = 72, Age = 24, Position = "Catcher")) # works!

In view of the much better performance of data.table on larger datasets I'd rather avoid data.frames and tibbles.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qeML_1.1          tufte_0.13        rmarkdown_2.25    regtools_1.7.0    gtools_3.9.4      FNN_1.1.3.2       data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11         compiler_4.2.1      partools_1.1.6      iterators_1.0.14    tools_4.2.1         rpart_4.1.21        digest_0.6.33       evaluate_0.22       lattice_0.22-5      rlang_1.1.1         Matrix_1.5-4.1      foreach_1.5.2      
[13] cli_3.6.1           mlapi_0.1.1         rstudioapi_0.15.0   parallel_4.2.1      RhpcBLASctl_0.23-42 gbm_2.1.8.1         mvtnorm_1.2-3       xfun_0.40           fastmap_1.1.1       knitr_1.45          text2vec_0.6.3      pdist_1.2.1        
[25] glmnet_4.1-8        grid_4.2.1          rje_1.12.1          grf_2.3.1           R6_2.5.1            FOCI_0.1.3          survival_3.5-7      pacman_0.5.1        carData_3.0-5       lgr_0.4.4           car_3.1-2           codetools_0.2-19   
[37] htmltools_0.5.6.1   MASS_7.3-60         splines_4.2.1       rpart.plot_3.1.1    abind_1.4-5         float_0.3-1         shape_1.4.6         rsparse_0.5.1       sandwich_3.0-2      crayon_1.5.2        zoo_1.8-12     
@jangorecki
Copy link

@jangorecki jangorecki linked a pull request Nov 21, 2023 that will close this issue
@jangorecki
Copy link

jangorecki commented Nov 21, 2023

As for the error, fix has been submitted in regtools repo. matloff/regtools#37

As for the data.table support inside qeML/regtools. I think it is good to submit issue asking particularly for that. Code in this report works fine (although with informative warning) after proposed fix, so IMO report could be closed after merging patch.

#...
z = qeKNN(w, "Weight")
#Warning message:
#In eval(tmp, parent.frame()) : "data" converted to data frame
predict(z, data.table(Height = 72, Age = 24))
#       [,1]
#[1,] 184.28

@matloff
Copy link
Owner

matloff commented Nov 21, 2023 via email

@matloff
Copy link
Owner

matloff commented Nov 22, 2023

I'm not sure what fix you are referring to.

The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.

As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.

Please let me know your thoughts on this.

@jangorecki
Copy link

My change in regtools is about resolving user error, not about using data.table inside which would be a bigger change, many places would have to be edited.

@matloff
Copy link
Owner

matloff commented Nov 22, 2023

My tentative plan is to put in checks in all the predict() functions in qeML (not just predict.qeKNN). Let me know soon if you see any problem with this. Thanks very much for your very valuable feedback.

@jangorecki
Copy link

@matloff yes, that sounds even better. As I don't know codebase in your project I went straight away where error traceback leaded me and fixed only that single error reported here.

@carbonmetrics
Copy link
Author

I'm not sure what fix you are referring to.

The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.

As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.

Please let me know your thoughts on this.

data.table is much faster than the tidyverse, the code is less verbose, the api is stable, and your environment is not flooded with function names and dependencies. I therefore apply restrictions on everything tidyverse, even while i find many of the functions useful. So, even if the gains of data.table would be small in certain situations, I still would have no data.frames, tibbles or whatnot to work with, simply because I don't use them. Which I think is the natural state of things for data.table users.

@jangorecki
Copy link

Sorry for the offtopic... In case you missed it there is data.table users survey open till 1st December. Feel invited to be heard Rdatatable/data.table#5704

@matloff
Copy link
Owner

matloff commented Nov 29, 2023

Thanks re the data.table survey, just posted to my Twitter/X account and will do so on my R/stat blog (https://matloff.wordpress.com/) as well.

@matloff
Copy link
Owner

matloff commented Nov 29, 2023

Again, I am a huge supporter of data.table and its creator, Matt Dowle, and am a big critic of the Tidyverse (https://github.com/matloff/TidyverseSkeptic). In developing qeML, though, I needed to aim for the "lowest common denominator," i.e. data frames. I also needed this to interface to the packages that qeML makes use of. My thinking is that users of data.table's or tibbles would not be much burdened to convert back when qeML returns a data frame. If you do feel it is a burden, my apologies.

For the future, though, an intriguing idea would be to make available an R environment variable that records whether the user is coming from data.table, Tidy or R. It would play the same role as the current R.version variable recording the ambient OS being used. So, before returning a data.frame, qeML functions would check this environment variable and do the appropriate conversion if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants