Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BENCH: add some cases for join and merge ops from pandas #5021

Merged
merged 3 commits into from
Oct 10, 2022

Conversation

jbrockmendel
Copy link
Collaborator

@jbrockmendel jbrockmendel commented Sep 22, 2022

Broken off from #4988

Closes #5111

@jbrockmendel jbrockmendel requested a review from a team as a code owner September 22, 2022 16:23
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YarShev having the rest of the benchmarks are ok for me.

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved
@@ -0,0 +1,261 @@
import string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have join, merge and concat benchmarks in place. let's try to add maximum 1-2 runs in each case. @YarShev
@anmyachev I think it's your call :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.

Now the main thing is to enter these benchmarks into the Modin system. We need to do a couple of things for this:

  • add test runs of these benchmarks to CI
  • add runs of these benchmarks to our main benchmarking configuration
  • use the functionality from this pull request so that the code can work in two modes. TEST-#5014: Simplify adding new ASV benchmarks #5015
  • so far, there is no automatic mechanism implemented that could automatically trigger all calculations so that we get the correct times, which means it will need to be done manually (BenchmarkMode does not work for OmniSci and besides, it can slow down the setup function)
  • in addition, we need to solve the issue with the dimensions of the datasets. At too small dimensions, where pandas can execute a function in a few tens of milliseconds, modin is unlikely to be able to overtake pandas. However, if you increase the dataset, you may encounter a lack of memory on the machine that runs these validation tests during the CI process. Because of this, the already added benchmarks have a built-in functionality that allows you to change the data size for benchmarks depending on these situations, but it is still manual, we need to manually select the size for these two situations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of leaving as is, those functions that you have named, as they are one of the most important. A little redundancy can be dealt with later.

By "leaving as is" you mean current asv testing or contents of this PR? :) I agree that functions are very important but it could be too heavy for our CI to run all cases from PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean content of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's try and see it goes but still not sure we want (at least now) # outer join of non-unique case and all cases from Merge (eg empty and cross)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the main thing is to enter these benchmarks into the Modin system

@anmyachev, could you help @jbrockmendel to start doing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YarShev ok I'll take care of it

self.temp = Series(1.0, index)[self.fracofday.index]

def time_join_non_unique_equal(self):
self.fracofday * self.temp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is tested here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment, multiplication

asv_bench/benchmarks/pandas/join_merge.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/pandas/pandas_vb_common.py Outdated Show resolved Hide resolved
@YarShev
Copy link
Collaborator

YarShev commented Sep 27, 2022

I suggest starting integration of the test suite into our asv benchmark system (link).

@codecov
Copy link

codecov bot commented Sep 27, 2022

Codecov Report

Merging #5021 (28518a0) into master (00a0fb9) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5021      +/-   ##
==========================================
- Coverage   84.57%   84.52%   -0.05%     
==========================================
  Files         256      257       +1     
  Lines       19347    19629     +282     
==========================================
+ Hits        16362    16592     +230     
- Misses       2985     3037      +52     
Impacted Files Coverage Δ
modin/experimental/sklearn/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/xgboost/test/test_dmatrix.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/xgboost/test/test_xgboost.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/core/execution/ray/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...n/experimental/sklearn/model_selection/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...mental/sklearn/model_selection/train_test_split.py 0.00% <0.00%> (-100.00%) ⬇️
...tal/core/execution/ray/implementations/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...tion/ray/implementations/pandas_on_ray/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...n/ray/implementations/pandas_on_ray/io/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...ecution/ray/implementations/pandas_on_ray/io/io.py 0.00% <0.00%> (-93.34%) ⬇️
... and 61 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@anmyachev
Copy link
Collaborator

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

@Garra1980
Copy link
Collaborator

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

Good question, let’s find out

@anmyachev
Copy link
Collaborator

I suggest renaming the folder from pandas to ported_from_pandas to make it more clear.

@YarShev
Copy link
Collaborator

YarShev commented Sep 28, 2022

I don't know how to deal with the license header that we add to each of our files. Since this code is directly taken from pandas without modification.

Good question, let’s find out

Do we want to keep both our current benchmarks and pandas ported? Maybe we should keep something only one?

@jbrockmendel jbrockmendel changed the title BENCH: port subset of benchmarks from pandas BENCH: port join_merge benchmarks from pandas Sep 28, 2022
@@ -0,0 +1,165 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's integrate these benchmarks into our system so we can see the difference and the new added cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for concat

asv_bench/benchmarks/benchmarks.py Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
@YarShev
Copy link
Collaborator

YarShev commented Sep 30, 2022

ci / lint (pydocstyle) failed, please take a look.

@anmyachev
Copy link
Collaborator

@YarShev I started creating separate pull requests for each case added.

@YarShev
Copy link
Collaborator

YarShev commented Oct 6, 2022

Please ping me when the changes are ready for review.

@anmyachev
Copy link
Collaborator

Please ping me when the changes are ready for review.

@YarShev ready for review

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/utils/common.py Show resolved Hide resolved
@YarShev
Copy link
Collaborator

YarShev commented Oct 7, 2022

ci / lint (pydocstyle) failed, please take a look.

still failing

@anmyachev
Copy link
Collaborator

ci / lint (pydocstyle) failed, please take a look.

still failing

@YarShev ready for review

…andas

Signed-off-by: Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev changed the title BENCH: port join_merge benchmarks from pandas BENCH: add some cases for join and merge ops from pandas Oct 10, 2022
Signed-off-by: Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator

@YarShev ready for review

asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
@@ -127,12 +128,56 @@ def time_join(self, shapes, how, sort):
execute(self.df1.join(self.df2, how=how, lsuffix="left_", sort=sort))


class TimeJoinStringIndex:
param_names = ["shapes", "sort"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely want to benchmark left and inner values for how parameter, don't we?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have benchmarks for the parameter. I think we don't need another benchmark for this as it seems like a duplication to me.

Signed-off-by: Myachev <anatoly.myachev@intel.com>
@YarShev YarShev merged commit abcf1e9 into modin-project:master Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BENCH: add some cases for join and merge ops from pandas
4 participants