FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

dchigarev · 2024-02-26T12:47:48Z

What do these changes do?

This PR adds a new implementation for merge() using range-partitioning mechanism. The new implementation does the following:

Computes range-partition bins for left dataframe
Builds range-partitioning for left dataframe using bins computed at the first step
Builds range-partitioning for right dataframe using bins computed at the first step
Applies the merge kernel row-wise to the left dataframe and broadcasts row partitions of the right dataframe there

The benefit of this implementation is that we never gather the whole right dataframe in one partition (as we do it now) but rather repartition it in a way so it would be correct to broadcast only row partitions of the right df. This implementation benefits when the right dataframe is relatively big.

One of the downsides of this implementation is that repartitioning changes the order of rows, meaning that the result is always sorted by keys at the end.

At the moment, one can only use range-partitioning impl for merge by specifying a special config variable (cfg.RangePartitioningMerge).

Perf measurements for h2o
Important note: original h2o dataset has several categorical columns with high cardinality (high amount of unique values), this is a problematic case for modin (see 2nd point) and it so it works terribly slow with such dtypes. In the measurements below, all categorical columns were casted to strings.

h2o joins, 500mb data

h2o joins, 5gb data

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Implement merge() using range-partitioning implementation #6965
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

…g implementation Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-02-27T10:32:38Z

modin/core/storage_formats/pandas/merge.py

+        -------
+        PandasQueryCompiler
+        """
+        how = kwargs.get("how", "inner")


this function was copied without changes from PandasQueryCompiler.merge

dchigarev · 2024-02-27T10:33:33Z

modin/core/storage_formats/pandas/merge.py

+        new_dtypes : ModinDtypes or None
+            Dtypes for the result of merge. ``None`` if not enought metadata to compute.
+        """
+        new_columns = None


this logic was copied without any changes from PandasQueryCompiler.merge and was placed in a separate method to be reused by range-partitioning impl

dchigarev · 2024-02-27T12:01:05Z

modin/config/envvars.py

@@ -770,6 +770,19 @@ def _sibling(cls) -> type[EnvWithSibilings]:
 )


+class RangePartitioningMerge(EnvironmentVariable, type=bool):


We're planning to implement more methods using range-partitioning and give users a choice to switch between implementations on their own. The approach with config variables doesn't seem to scale good enough, as creating a config variable for each method isn't a good idea IMO.

An alternative could be passing some parameter at the pandas API level specifying which implementation to use:

df1.merge(df2, on="key", impl="range-partitioning") df.groupby(...).apply(..., impl="range-partitioning") df.nunique(impl="range-partitioning")

The problems with this approach are:

The user's code loses compatibility with pandas (switching back from modin to pandas would require removing those extra arguments)

The parameter will only make sense for executions based on PandasQueryCompiler (snowflake, hdk executions won't be able to support it)

What do others think in this regard?

What do others think in this regard?

Why not just use one config variable (like RangePartitioningImpl) for all the functions that support it?

Why not just use one config variable (like RangePartitioningImpl) for all the functions that support it?

Yep, it seems like the only option. We could have then expanded this page, giving a short description of every operation that supports range-partitioning with tips on when to use it, and then probably link this page in our optimization notes . I'm personally voting for this option, will try to implement it in this PR. cc @YarShev

Added a common variable RangePartitioning, let's handle the transition from RangePartitioningGroupby as well as changes in documentation in a separate PR

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

docs/flow/modin/experimental/index.rst

modin/pandas/test/dataframe/test_join_sort.py

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

anmyachev · 2024-02-28T18:22:03Z

modin/core/storage_formats/pandas/merge.py

+                new_columns=new_columns,
+                new_dtypes=new_dtypes,
+            )
+        ).reset_index(drop=True)


added the following in-code comment:

# pandas resets the index of the result unless we were merging on an index level, # the current implementation only supports merging on column names, so dropping # the index unconditionally

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev

LGTM!

dchigarev added 6 commits February 26, 2024 18:25

new_merge

4484736

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

FEAT-modin-project#6965: Implement '.merge()' using range-partitionin…

a2c496d

…g implementation Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

add tests to ci

18874ce

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

add 'merge.py'

3f1b87c

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

add more docs

e9a54d7

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

fix docs

4b35506

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev force-pushed the mm_exp branch from 5a20d51 to 4b35506 Compare February 26, 2024 17:25

dchigarev commented Feb 27, 2024

View reviewed changes

dchigarev marked this pull request as ready for review February 27, 2024 11:51

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners February 27, 2024 11:51

dchigarev commented Feb 27, 2024

View reviewed changes

rename 'RangePartitioningMerge' -> 'RangePartitioning'

50c3ab5

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev reviewed Feb 28, 2024

View reviewed changes

apply review suggestions

c115ffa

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev approved these changes Feb 29, 2024

View reviewed changes

anmyachev merged commit a966395 into modin-project:master Mar 1, 2024
37 checks passed

dchigarev mentioned this pull request Mar 1, 2024

Rework documentation regarding range-partitioning implementations #6987

Closed

YarShev mentioned this pull request Mar 20, 2024

merge operation is significantly slower than stock pandas #4293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

dchigarev commented Feb 26, 2024 •

edited

dchigarev Feb 27, 2024

dchigarev Feb 27, 2024

dchigarev Feb 27, 2024 •

edited

anmyachev Feb 27, 2024

dchigarev Feb 28, 2024 •

edited

dchigarev Feb 28, 2024

anmyachev Feb 28, 2024

dchigarev Feb 29, 2024

anmyachev left a comment

		@@ -770,6 +770,19 @@ def _sibling(cls) -> type[EnvWithSibilings]:
		)


		class RangePartitioningMerge(EnvironmentVariable, type=bool):

FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

Conversation

dchigarev commented Feb 26, 2024 • edited

What do these changes do?

dchigarev Feb 27, 2024

Choose a reason for hiding this comment

dchigarev Feb 27, 2024

Choose a reason for hiding this comment

dchigarev Feb 27, 2024 • edited

Choose a reason for hiding this comment

anmyachev Feb 27, 2024

Choose a reason for hiding this comment

dchigarev Feb 28, 2024 • edited

Choose a reason for hiding this comment

dchigarev Feb 28, 2024

Choose a reason for hiding this comment

anmyachev Feb 28, 2024

Choose a reason for hiding this comment

dchigarev Feb 29, 2024

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Feb 26, 2024 •

edited

dchigarev Feb 27, 2024 •

edited

dchigarev Feb 28, 2024 •

edited