Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run 2E9 rows in-ram on EC2 #71

Closed
mattdowle opened this issue Jan 22, 2019 · 8 comments
Closed

Run 2E9 rows in-ram on EC2 #71

mattdowle opened this issue Jan 22, 2019 · 8 comments

Comments

@mattdowle
Copy link
Contributor

mattdowle commented Jan 22, 2019

db-bench runs on a dedicated machine (provided by H2O) which has 125GB of RAM. So 2E9 won't fit in-ram (the data itself takes 100GB and there's too little working memory left). This machine has fast large disk though and it's much higher priority to test out-of-ram than it is to test bigger RAM; i.e. adding 500GB (1E10) test (#39) on the same 125GB RAM db-bench machine where spark and pydatatable will work but the other products will fail.
However, for completeness, it would still be nice to know if pandas works now on 2E9 on a node with 250GB RAM (it didn't 4 years ago but data.table did).
This issue was moved here from Rdatatable/data.table#823

@jangorecki
Copy link
Contributor

this issue Rdatatable/data.table#2956 can be also confirmed as resolved when doing 2E9 benchmark

@jangorecki
Copy link
Contributor

blocked by tidyverse/dplyr#4334 as of now

@jangorecki
Copy link
Contributor

tidyverse/dplyr#4334 has been recently resolved, once it will land on CRAN we should be good to proceed with this issue.

@jangorecki
Copy link
Contributor

We can wait for dplyr 1.0 to be released as it seems to be the next major version. Pandas got 1.0 version recently also.

@jangorecki
Copy link
Contributor

Need to post-pone that to dplyr 1.1.0. Performance polishing was shifted to 1.1.0 release, and dplyr 1.0 is expected to be slower.

@jangorecki
Copy link
Contributor

It is now blocked on tidyverse/dplyr#5291

@jangorecki jangorecki added this to the 2.1.0 milestone Nov 21, 2020
@jangorecki
Copy link
Contributor

Same machine as in 2014 was used, 244GB memory. Using recent stable versions as of today.

  • data.table 1.13.2, R 4.0.3
  • dplyr 1.0.2, R 4.0.3
  • pandas 1.1.4, python 3.6

Minor changes to 2014's script:

  • data.table: added setDTthreads(0L)
  • dplyr: q4 and q5 updated summarise function, all questions updated group_by for .drop=TRUE

Results:

@jangorecki jangorecki removed this from the 2.1.0 milestone Nov 22, 2020
@jangorecki
Copy link
Contributor

data.table got the regression fixed in Rdatatable/data.table#4297
I retry tests defined in this issue.
dplyr version haven't change since I tried last time so I skipped retrying it.
pandas got upgraded to 1.1.5.

Results:

  • data.table finishes benchmark script successfully now, timings pasted here
  • pandas 1.1.5 is still being killed, same as on 1.1.4

Tmonster pushed a commit to Tmonster/db-benchmark that referenced this issue Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants