Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join performance regression #2775

Open
jangorecki opened this issue Nov 26, 2020 · 0 comments
Open

join performance regression #2775

jangorecki opened this issue Nov 26, 2020 · 0 comments
Labels
join performance Issues focused on the speed of execution of various datatable functions.

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Nov 26, 2020

introduced between a45cc50...31fbad7

reproduced using

cd db-benchmark
source ./pydatatable/py-pydatatable/bin/activate
pip install --upgrade git+https://github.com/h2oai/datatable.git@a45cc503494571bfbf0feb00ece03ad0bab16dfc
./_launcher/solution.R --solution=pydatatable --task=join --nrow=1e7 --quiet=true --out=dt-time.csv
sleep 5
./_launcher/solution.R --solution=pydatatable --task=join --nrow=1e8 --quiet=true --out=dt-time.csv
sleep 5
pip install --upgrade git+https://github.com/h2oai/datatable.git@31fbad7b471113e30f34de8cc2b321c76f926c32
./_launcher/solution.R --solution=pydatatable --task=join --nrow=1e7 --quiet=true --out=dt-time.csv
sleep 5
./_launcher/solution.R --solution=pydatatable --task=join --nrow=1e8 --quiet=true --out=dt-time.csv
deactivate
R
library(data.table)
fctr = function(x) factor(x, levels=unique(x)) ## this retains order for strings
d = fread("dt-time.csv")
dd = d[run==1L, .(time_sec, in_rows, question=fctr(question), git=fctr(substr(git,1,7)))]
dcast(dd, in_rows+question~git, value.var="time_sec")
      in_rows               question a45cc50 31fbad7
 1:  10000000     small inner on int   7.419  10.852
 2:  10000000    medium inner on int  12.999  13.902
 3:  10000000    medium outer on int   8.128   6.900
 4:  10000000 medium inner on factor  14.145  10.085
 5:  10000000       big inner on int  16.316  27.621
 6: 100000000     small inner on int  39.371 154.445
 7: 100000000    medium inner on int  33.088 208.082
 8: 100000000    medium outer on int  24.478  94.655
 9: 100000000 medium inner on factor  33.831 163.471
10: 100000000       big inner on int 114.547 153.472
@jangorecki jangorecki added the performance Issues focused on the speed of execution of various datatable functions. label Nov 26, 2020
@jangorecki jangorecki added the join label Jan 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
join performance Issues focused on the speed of execution of various datatable functions.
Projects
None yet
Development

No branches or pull requests

1 participant