BENCH: add some cases for `join` and `merge` ops from pandas #5021

jbrockmendel · 2022-09-22T16:23:37Z

Broken off from #4988

Closes #5111

anmyachev

@YarShev having the rest of the benchmarks are ok for me.

asv_bench/benchmarks/pandas/join_merge.py

Garra1980 · 2022-09-22T21:01:31Z

asv_bench/benchmarks/pandas/join_merge.py

@@ -0,0 +1,261 @@
+import string


We already have join, merge and concat benchmarks in place. let's try to add maximum 1-2 runs in each case. @YarShev
@anmyachev I think it's your call :)

I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.

Now the main thing is to enter these benchmarks into the Modin system. We need to do a couple of things for this:

add test runs of these benchmarks to CI

add runs of these benchmarks to our main benchmarking configuration

use the functionality from this pull request so that the code can work in two modes. TEST-#5014: Simplify adding new ASV benchmarks #5015

so far, there is no automatic mechanism implemented that could automatically trigger all calculations so that we get the correct times, which means it will need to be done manually (BenchmarkMode does not work for OmniSci and besides, it can slow down the setup function)

in addition, we need to solve the issue with the dimensions of the datasets. At too small dimensions, where pandas can execute a function in a few tens of milliseconds, modin is unlikely to be able to overtake pandas. However, if you increase the dataset, you may encounter a lack of memory on the machine that runs these validation tests during the CI process. Because of this, the already added benchmarks have a built-in functionality that allows you to change the data size for benchmarks depending on these situations, but it is still manual, we need to manually select the size for these two situations.

I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.

By "leaving as is" you mean current asv testing or contents of this PR? :) I agree that functions are very important but it could be too heavy for our CI to run all cases from PR

I mean content of this PR.

Okay, let's try and see it goes but still not sure we want (at least now) # outer join of non-unique case and all cases from Merge (eg empty and cross)

Now the main thing is to enter these benchmarks into the Modin system

@anmyachev, could you help @jbrockmendel to start doing this?

@YarShev ok I'll take care of it

YarShev · 2022-09-26T12:39:57Z

asv_bench/benchmarks/pandas/join_merge.py

+        self.temp = Series(1.0, index)[self.fracofday.index]
+
+    def time_join_non_unique_equal(self):
+        self.fracofday * self.temp


What is tested here?

alignment, multiplication

asv_bench/benchmarks/pandas/join_merge.py

asv_bench/benchmarks/pandas/pandas_vb_common.py

YarShev · 2022-09-27T10:04:07Z

I suggest starting integration of the test suite into our asv benchmark system (link).

codecov · 2022-09-27T13:21:24Z

Codecov Report

Merging #5021 (28518a0) into master (00a0fb9) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5021      +/-   ##
==========================================
- Coverage   84.57%   84.52%   -0.05%     
==========================================
  Files         256      257       +1     
  Lines       19347    19629     +282     
==========================================
+ Hits        16362    16592     +230     
- Misses       2985     3037      +52

Impacted Files	Coverage Δ
modin/experimental/sklearn/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/xgboost/test/test_dmatrix.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/xgboost/test/test_xgboost.py	`0.00% <0.00%> (-100.00%)`	⬇️
modin/experimental/core/execution/ray/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...n/experimental/sklearn/model_selection/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...mental/sklearn/model_selection/train_test_split.py	`0.00% <0.00%> (-100.00%)`	⬇️
...tal/core/execution/ray/implementations/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...tion/ray/implementations/pandas_on_ray/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...n/ray/implementations/pandas_on_ray/io/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
...ecution/ray/implementations/pandas_on_ray/io/io.py	`0.00% <0.00%> (-93.34%)`	⬇️
... and 61 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

asv_bench/benchmarks/benchmarks.py

anmyachev · 2022-09-28T12:42:36Z

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

Garra1980 · 2022-09-28T12:51:44Z

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

Good question, let’s find out

anmyachev · 2022-09-28T14:22:57Z

I suggest renaming the folder from pandas to ported_from_pandas to make it more clear.

YarShev · 2022-09-28T15:50:47Z

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

Good question, let’s find out

Do we want to keep both our current benchmarks and pandas ported? Maybe we should keep something only one?

YarShev · 2022-09-29T12:37:42Z

asv_bench/benchmarks/pandas/join_merge.py

@@ -0,0 +1,165 @@
+import numpy as np


Let's integrate these benchmarks into our system so we can see the difference and the new added cases.

as for concat

asv_bench/benchmarks/benchmarks.py

YarShev · 2022-09-30T09:11:34Z

ci / lint (pydocstyle) failed, please take a look.

anmyachev · 2022-09-30T09:41:30Z

@YarShev I started creating separate pull requests for each case added.

asv_bench/benchmarks/benchmarks.py

YarShev · 2022-10-06T20:13:41Z

Please ping me when the changes are ready for review.

anmyachev · 2022-10-06T23:15:40Z

Please ping me when the changes are ready for review.

@YarShev ready for review

asv_bench/benchmarks/benchmarks.py

asv_bench/benchmarks/utils/common.py

YarShev · 2022-10-07T08:16:25Z

ci / lint (pydocstyle) failed, please take a look.

still failing

anmyachev · 2022-10-07T09:25:18Z

ci / lint (pydocstyle) failed, please take a look.

still failing

@YarShev ready for review

asv_bench/benchmarks/benchmarks.py

…andas Signed-off-by: Myachev <anatoly.myachev@intel.com>

Signed-off-by: Myachev <anatoly.myachev@intel.com>

anmyachev · 2022-10-10T11:19:28Z

@YarShev ready for review

asv_bench/benchmarks/benchmarks.py

YarShev · 2022-10-10T13:32:13Z

asv_bench/benchmarks/benchmarks.py

@@ -127,12 +128,56 @@ def time_join(self, shapes, how, sort):
        execute(self.df1.join(self.df2, how=how, lsuffix="left_", sort=sort))


+class TimeJoinStringIndex:
+    param_names = ["shapes", "sort"]


We likely want to benchmark left and inner values for how parameter, don't we?

We already have benchmarks for the parameter. I think we don't need another benchmark for this as it seems like a duplication to me.

Signed-off-by: Myachev <anatoly.myachev@intel.com>

jbrockmendel requested a review from a team as a code owner September 22, 2022 16:23

anmyachev reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

anmyachev reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

Garra1980 reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

Garra1980 reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

Garra1980 reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

Garra1980 reviewed Sep 22, 2022

View reviewed changes

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved

Garra1980 reviewed Sep 22, 2022

View reviewed changes

YarShev reviewed Sep 26, 2022

View reviewed changes

YarShev mentioned this pull request Sep 27, 2022

BENCH: Port frame_methods.py benchmarks from pandas #5036

Closed

8 tasks

anmyachev force-pushed the asvs2 branch 3 times, most recently from 2ba8fef to a843699 Compare September 27, 2022 12:56

anmyachev reviewed Sep 27, 2022

View reviewed changes

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

anmyachev force-pushed the asvs2 branch from 6036f73 to 2461a6a Compare September 27, 2022 18:14

jbrockmendel changed the title ~~BENCH: port subset of benchmarks from pandas~~ BENCH: port join_merge benchmarks from pandas Sep 28, 2022

YarShev reviewed Sep 29, 2022

View reviewed changes

YarShev reviewed Sep 30, 2022

View reviewed changes

asv_bench/benchmarks/benchmarks.py Show resolved Hide resolved

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

anmyachev force-pushed the asvs2 branch from 64ba5cd to 7952e4c Compare October 5, 2022 13:21

anmyachev reviewed Oct 6, 2022

View reviewed changes

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

anmyachev force-pushed the asvs2 branch from 40625af to 2365730 Compare October 6, 2022 21:15

anmyachev added the Ready for review label Oct 6, 2022

YarShev reviewed Oct 7, 2022

View reviewed changes

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

TEST-modin-project#5111: add some cases for join and merge ops from p…

32706da

…andas Signed-off-by: Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the asvs2 branch from 0052b32 to 32706da Compare October 10, 2022 10:49

anmyachev changed the title ~~BENCH: port join_merge benchmarks from pandas~~ BENCH: add some cases for join and merge ops from pandas Oct 10, 2022

drop cross case for now

40f0628

Signed-off-by: Myachev <anatoly.myachev@intel.com>

YarShev reviewed Oct 10, 2022

View reviewed changes

fix comments

28518a0

Signed-off-by: Myachev <anatoly.myachev@intel.com>

YarShev approved these changes Oct 10, 2022

View reviewed changes

YarShev merged commit abcf1e9 into modin-project:master Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCH: add some cases for `join` and `merge` ops from pandas #5021

BENCH: add some cases for `join` and `merge` ops from pandas #5021

jbrockmendel commented Sep 22, 2022 •

edited by anmyachev

Loading

anmyachev left a comment

Garra1980 Sep 22, 2022

anmyachev Sep 23, 2022

Garra1980 Sep 23, 2022

anmyachev Sep 23, 2022

Garra1980 Sep 23, 2022

YarShev Sep 26, 2022

anmyachev Sep 26, 2022

YarShev Sep 26, 2022

jbrockmendel Sep 26, 2022

YarShev commented Sep 27, 2022

codecov bot commented Sep 27, 2022 •

edited

Loading

anmyachev commented Sep 28, 2022

Garra1980 commented Sep 28, 2022

anmyachev commented Sep 28, 2022

YarShev commented Sep 28, 2022

YarShev Sep 29, 2022

YarShev Sep 29, 2022

YarShev commented Sep 30, 2022

anmyachev commented Sep 30, 2022

YarShev commented Oct 6, 2022

anmyachev commented Oct 6, 2022

YarShev commented Oct 7, 2022

anmyachev commented Oct 7, 2022

anmyachev commented Oct 10, 2022

YarShev Oct 10, 2022

anmyachev Oct 10, 2022

BENCH: add some cases for join and merge ops from pandas #5021

BENCH: add some cases for join and merge ops from pandas #5021

Conversation

jbrockmendel commented Sep 22, 2022 • edited by anmyachev Loading

anmyachev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Sep 27, 2022

codecov bot commented Sep 27, 2022 • edited Loading

Codecov Report

anmyachev commented Sep 28, 2022

Garra1980 commented Sep 28, 2022

anmyachev commented Sep 28, 2022

YarShev commented Sep 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Sep 30, 2022

anmyachev commented Sep 30, 2022

YarShev commented Oct 6, 2022

anmyachev commented Oct 6, 2022

YarShev commented Oct 7, 2022

anmyachev commented Oct 7, 2022

anmyachev commented Oct 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BENCH: add some cases for `join` and `merge` ops from pandas #5021

BENCH: add some cases for `join` and `merge` ops from pandas #5021

jbrockmendel commented Sep 22, 2022 •

edited by anmyachev

Loading

codecov bot commented Sep 27, 2022 •

edited

Loading