Run 2E9 rows in-ram on EC2 #71

mattdowle · 2019-01-22T17:18:11Z

db-bench runs on a dedicated machine (provided by H2O) which has 125GB of RAM. So 2E9 won't fit in-ram (the data itself takes 100GB and there's too little working memory left). This machine has fast large disk though and it's much higher priority to test out-of-ram than it is to test bigger RAM; i.e. adding 500GB (1E10) test (#39) on the same 125GB RAM db-bench machine where spark and pydatatable will work but the other products will fail.
However, for completeness, it would still be nice to know if pandas works now on 2E9 on a node with 250GB RAM (it didn't 4 years ago but data.table did).
This issue was moved here from Rdatatable/data.table#823

jangorecki · 2019-01-23T06:41:06Z

this issue Rdatatable/data.table#2956 can be also confirmed as resolved when doing 2E9 benchmark

jangorecki · 2019-08-21T14:59:21Z

blocked by tidyverse/dplyr#4334 as of now

jangorecki · 2020-01-12T13:21:26Z

tidyverse/dplyr#4334 has been recently resolved, once it will land on CRAN we should be good to proceed with this issue.

jangorecki · 2020-02-26T04:54:38Z

We can wait for dplyr 1.0 to be released as it seems to be the next major version. Pandas got 1.0 version recently also.

jangorecki · 2020-05-13T09:16:08Z

Need to post-pone that to dplyr 1.1.0. Performance polishing was shifted to 1.1.0 release, and dplyr 1.0 is expected to be slower.

jangorecki · 2020-11-15T18:13:34Z

It is now blocked on tidyverse/dplyr#5291

jangorecki · 2020-11-22T11:27:13Z

Same machine as in 2014 was used, 244GB memory. Using recent stable versions as of today.

data.table 1.13.2, R 4.0.3
dplyr 1.0.2, R 4.0.3
pandas 1.1.4, python 3.6

Minor changes to 2014's script:

data.table: added setDTthreads(0L)
dplyr: q4 and q5 updated summarise function, all questions updated group_by for .drop=TRUE

Results:

data.table got internal error during first query: regression in big grouping Rdatatable/data.table#4818
dplyr got internal error during first query: Internal error: Dictionary is full! tidyverse/dplyr#5291
pandas python process got killed when creating 2e9 dataset, so couldn't even make an attempt to run first query.

jangorecki · 2020-12-10T17:22:17Z

data.table got the regression fixed in Rdatatable/data.table#4297
I retry tests defined in this issue.
dplyr version haven't change since I tried last time so I skipped retrying it.
pandas got upgraded to 1.1.5.

Results:

data.table finishes benchmark script successfully now, timings pasted here
pandas 1.1.5 is still being killed, same as on 1.1.4

mattdowle mentioned this issue Jan 22, 2019

Rerun pandas 2E9 benchmark from dev Rdatatable/data.table#823

Closed

jangorecki added data.table dplyr pandas labels Oct 16, 2019

jangorecki added this to the 2.1.0 milestone Nov 21, 2020

jangorecki removed this from the 2.1.0 milestone Nov 22, 2020

jangorecki closed this as completed Dec 10, 2020

Tmonster pushed a commit to Tmonster/db-benchmark that referenced this issue Jun 6, 2024

Use consistent spelling for 'DuckDB Labs' (h2oai#71)

d31c678

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run 2E9 rows in-ram on EC2 #71

Run 2E9 rows in-ram on EC2 #71

mattdowle commented Jan 22, 2019 •

edited

Loading

jangorecki commented Jan 23, 2019

jangorecki commented Aug 21, 2019

jangorecki commented Jan 12, 2020

jangorecki commented Feb 26, 2020

jangorecki commented May 13, 2020

jangorecki commented Nov 15, 2020

jangorecki commented Nov 22, 2020

jangorecki commented Dec 10, 2020

Run 2E9 rows in-ram on EC2 #71

Run 2E9 rows in-ram on EC2 #71

Comments

mattdowle commented Jan 22, 2019 • edited Loading

jangorecki commented Jan 23, 2019

jangorecki commented Aug 21, 2019

jangorecki commented Jan 12, 2020

jangorecki commented Feb 26, 2020

jangorecki commented May 13, 2020

jangorecki commented Nov 15, 2020

jangorecki commented Nov 22, 2020

jangorecki commented Dec 10, 2020

mattdowle commented Jan 22, 2019 •

edited

Loading