Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#5778: Avoid extra materialization at range-based reshuffling #5780

Merged
merged 1 commit into from
Mar 21, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Mar 13, 2023

What do these changes do?

This PR removes unnecessary conversion to row partitions while reshuffling partitions.

Changes of our ASV benchmark:

MODIN_TEST_DATASET_SIZE="Big" asv continuous origin/master issue_5778 --launch-method=spawn -b TimeSortValues --no-only-changed -a repeat=5

All benchmarks:

       before           after         ratio
     [8d3db2b4]       [b50ee4e2]
     <issue_5778~1>       <issue_5778>
-       7.52±0.1s       1.15±0.04s     0.15  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 1, True)
-       7.74±0.4s       1.17±0.03s     0.15  benchmarks.TimeSortValues.time_sort_values([1000000, 10], 2, True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Do not build row-partitions for a final function during reshuffling #5778
  • tests are passing
  • module layout described at docs/development/architecture.rst is up-to-date

…eshuffling

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev dchigarev added the partitions reshuffling 🔀 Issues related to partitions reshuffling label Mar 13, 2023
Comment on lines -1601 to -1604
col_partitioning_func = np.vectorize(
lambda partition: cls._row_partition_class(partition)
)
split_row_partitions = col_partitioning_func(split_row_partitions)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each partition from split_row_partitions is already a row partition at this point since we did the conversion before:

# Convert our list of block partitions to row partitions. We need to create full-axis
# row partitions since we need to send the whole partition to the split step as otherwise
# we wouldn't know how to split the block partitions that don't contain the shuffling key.
row_partitions = [
partition.force_materialization().list_of_block_partitions[0]
for partition in cls.row_partitions(partitions)
]

There's no need to do double-wrapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@dchigarev dchigarev marked this pull request as ready for review March 14, 2023 15:51
@dchigarev dchigarev requested a review from a team as a code owner March 14, 2023 15:51
@dchigarev dchigarev requested a review from RehanSD March 14, 2023 15:51
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks impressive! @RehanSD please also take a look.

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a couple of quick questions!

Comment on lines -1601 to -1604
col_partitioning_func = np.vectorize(
lambda partition: cls._row_partition_class(partition)
)
split_row_partitions = col_partitioning_func(split_row_partitions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

col_partitioning_func = np.vectorize(
lambda partition: cls._row_partition_class(partition)
)
split_row_partitions = col_partitioning_func(split_row_partitions)
new_partitions = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this lazy? Since split_row_partitions is in effect the properly partitioned dataframe, we can transform to col partitions, and then add_to_apply_calls the sort instead, and defer metadata materialization till it's needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a future PR though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely do deferred meta-data materialization until really needed, created an issue for this #5808.

Regarding the lazy functions submission using add_to_apply_calls: I remember we had a performance regression quite ago when we switched to lazy execution. Since then we reverted the changes in #2471 and never tried to apply them again, so there is probably some careful evaluation that has to be done before making this change again, created an issue for this #5809

@dchigarev
Copy link
Collaborator Author

So, can we merge this one?

@anmyachev
Copy link
Collaborator

So, can we merge this one?

@dchigarev we can, but Rehan has an unanswered comment.

@dchigarev
Copy link
Collaborator Author

So, can we merge this one?

@dchigarev we can, but Rehan has an unanswered comment.

@anmyachev I've opened the required issues, can we merge the PR now?

@anmyachev anmyachev merged commit ab91d4f into modin-project:master Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
partitions reshuffling 🔀 Issues related to partitions reshuffling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not build row-partitions for a final function during reshuffling
3 participants