-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#5778: Avoid extra materialization at range-based reshuffling #5780
Conversation
…eshuffling Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
col_partitioning_func = np.vectorize( | ||
lambda partition: cls._row_partition_class(partition) | ||
) | ||
split_row_partitions = col_partitioning_func(split_row_partitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each partition from split_row_partitions
is already a row partition at this point since we did the conversion before:
modin/modin/core/dataframe/pandas/partitioning/partition_manager.py
Lines 1584 to 1590 in cdedd71
# Convert our list of block partitions to row partitions. We need to create full-axis | |
# row partitions since we need to send the whole partition to the split step as otherwise | |
# we wouldn't know how to split the block partitions that don't contain the shuffling key. | |
row_partitions = [ | |
partition.force_materialization().list_of_block_partitions[0] | |
for partition in cls.row_partitions(partitions) | |
] |
There's no need to do double-wrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks impressive! @RehanSD please also take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a couple of quick questions!
col_partitioning_func = np.vectorize( | ||
lambda partition: cls._row_partition_class(partition) | ||
) | ||
split_row_partitions = col_partitioning_func(split_row_partitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
col_partitioning_func = np.vectorize( | ||
lambda partition: cls._row_partition_class(partition) | ||
) | ||
split_row_partitions = col_partitioning_func(split_row_partitions) | ||
new_partitions = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this lazy? Since split_row_partitions
is in effect the properly partitioned dataframe, we can transform to col partitions, and then add_to_apply_calls
the sort instead, and defer metadata materialization till it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a future PR though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely do deferred meta-data materialization until really needed, created an issue for this #5808.
Regarding the lazy functions submission using add_to_apply_calls
: I remember we had a performance regression quite ago when we switched to lazy execution. Since then we reverted the changes in #2471 and never tried to apply them again, so there is probably some careful evaluation that has to be done before making this change again, created an issue for this #5809
So, can we merge this one? |
@dchigarev we can, but Rehan has an unanswered comment. |
@anmyachev I've opened the required issues, can we merge the PR now? |
What do these changes do?
This PR removes unnecessary conversion to row partitions while reshuffling partitions.
Changes of our ASV benchmark:
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date