PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

dchigarev · 2023-03-13T12:57:44Z

What do these changes do?

This PR removes unnecessary conversion to row partitions while reshuffling partitions.

Changes of our ASV benchmark:

MODIN_TEST_DATASET_SIZE="Big" asv continuous origin/master issue_5778 --launch-method=spawn -b TimeSortValues --no-only-changed -a repeat=5

All benchmarks:

       before           after         ratio
     [8d3db2b4]       [b50ee4e2]
     <issue_5778~1>       <issue_5778>
-       7.52±0.1s       1.15±0.04s     0.15  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 1, True)
-       7.74±0.4s       1.17±0.03s     0.15  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 2, True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Do not build row-partitions for a final function during reshuffling #5778
tests are passing
module layout described at docs/development/architecture.rst is up-to-date

…eshuffling Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2023-03-14T15:49:58Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-        col_partitioning_func = np.vectorize(
-            lambda partition: cls._row_partition_class(partition)
-        )
-        split_row_partitions = col_partitioning_func(split_row_partitions)


Each partition from split_row_partitions is already a row partition at this point since we did the conversion before:

modin/modin/core/dataframe/pandas/partitioning/partition_manager.py

Lines 1584 to 1590 in cdedd71

# Convert our list of block partitions to row partitions. We need to create full-axis

# row partitions since we need to send the whole partition to the split step as otherwise

# we wouldn't know how to split the block partitions that don't contain the shuffling key.

row_partitions = [

partition.force_materialization().list_of_block_partitions[0]

for partition in cls.row_partitions(partitions)

]

There's no need to do double-wrapping.

Makes sense!

anmyachev

Looks impressive! @RehanSD please also take a look.

RehanSD

LGTM! Left a couple of quick questions!

RehanSD · 2023-03-15T22:23:38Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-        col_partitioning_func = np.vectorize(
-            lambda partition: cls._row_partition_class(partition)
-        )
-        split_row_partitions = col_partitioning_func(split_row_partitions)


Makes sense!

RehanSD · 2023-03-15T22:25:10Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-        col_partitioning_func = np.vectorize(
-            lambda partition: cls._row_partition_class(partition)
-        )
-        split_row_partitions = col_partitioning_func(split_row_partitions)
        new_partitions = [


Do we want to make this lazy? Since split_row_partitions is in effect the properly partitioned dataframe, we can transform to col partitions, and then add_to_apply_calls the sort instead, and defer metadata materialization till it's needed?

This could be a future PR though!

We can definitely do deferred meta-data materialization until really needed, created an issue for this #5808.

Regarding the lazy functions submission using add_to_apply_calls: I remember we had a performance regression quite ago when we switched to lazy execution. Since then we reverted the changes in #2471 and never tried to apply them again, so there is probably some careful evaluation that has to be done before making this change again, created an issue for this #5809

dchigarev · 2023-03-16T18:28:44Z

So, can we merge this one?

anmyachev · 2023-03-17T15:01:40Z

So, can we merge this one?

@dchigarev we can, but Rehan has an unanswered comment.

dchigarev · 2023-03-21T12:39:01Z

So, can we merge this one?

@dchigarev we can, but Rehan has an unanswered comment.

@anmyachev I've opened the required issues, can we merge the PR now?

PERF-modin-project#5778: Avoid extra materialization at range-based r…

b50ee4e

…eshuffling Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added the partitions reshuffling 🔀 Issues related to partitions reshuffling label Mar 13, 2023

dchigarev commented Mar 14, 2023

View reviewed changes

dchigarev marked this pull request as ready for review March 14, 2023 15:51

dchigarev requested a review from a team as a code owner March 14, 2023 15:51

dchigarev requested a review from RehanSD March 14, 2023 15:51

anmyachev approved these changes Mar 15, 2023

View reviewed changes

RehanSD approved these changes Mar 15, 2023

View reviewed changes

anmyachev merged commit ab91d4f into modin-project:master Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

dchigarev commented Mar 13, 2023 •

edited

dchigarev Mar 14, 2023

RehanSD Mar 15, 2023

anmyachev left a comment

RehanSD left a comment

RehanSD Mar 15, 2023

RehanSD Mar 15, 2023

RehanSD Mar 15, 2023

dchigarev Mar 17, 2023

dchigarev commented Mar 16, 2023

anmyachev commented Mar 17, 2023

dchigarev commented Mar 21, 2023

	# Convert our list of block partitions to row partitions. We need to create full-axis
	# row partitions since we need to send the whole partition to the split step as otherwise
	# we wouldn't know how to split the block partitions that don't contain the shuffling key.
	row_partitions = [
	partition.force_materialization().list_of_block_partitions[0]
	for partition in cls.row_partitions(partitions)
	]

PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

Conversation

dchigarev commented Mar 13, 2023 • edited

What do these changes do?

dchigarev Mar 14, 2023

Choose a reason for hiding this comment

RehanSD Mar 15, 2023

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

RehanSD Mar 15, 2023

Choose a reason for hiding this comment

RehanSD Mar 15, 2023

Choose a reason for hiding this comment

RehanSD Mar 15, 2023

Choose a reason for hiding this comment

dchigarev Mar 17, 2023

Choose a reason for hiding this comment

dchigarev commented Mar 16, 2023

anmyachev commented Mar 17, 2023

dchigarev commented Mar 21, 2023

dchigarev commented Mar 13, 2023 •

edited