-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718
Conversation
3ea18ca
to
e92aeda
Compare
Codecov Report
@@ Coverage Diff @@
## master #4718 +/- ##
==========================================
+ Coverage 85.16% 89.87% +4.71%
==========================================
Files 259 260 +1
Lines 19205 19495 +290
==========================================
+ Hits 16355 17521 +1166
+ Misses 2850 1974 -876
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
e92aeda
to
b3bacfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnlitvinov thank you, this is excellent! I left one minor cleanup comment.
[idx.apply(func) for idx in partitions[0]] if len(partitions) else [] | ||
) | ||
target = partitions.T if axis == 0 else partitions | ||
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else [] | ||
new_idx = cls.get_objects_from_partitions(new_idx) | ||
# TODO FIX INFORMATION LEAK!!!!1!!1!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any information leak here anymore :) I think the PR that added get_objects_from_partitions
deleted it the access to partitions[i]._data
. Could you delete this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager
:
modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py
Line 110 in 49c0398
return ray.get([partition._data for partition in partitions]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's doing this because partition.get
will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug into the history of this comment, and I still don't know what it's talking about. The diff from the original PR is here. I don't see any information leak there, where the code uses get()
to get the data.
Regarding the _data
access in the ray partition manager, I think you're correct that it's incorrect because it doesn't drain the call queue. This use in get_indices
happens to work because apply
drain the call queue. I think we need to take a stance as to whether we should allow accessing the _data
of block/non-full-axis-virtual partitions. We should probably use get()
, which drains the call queue, instead.
So back to this comment, I think the right thing would be to file an issue to drain the call queue before getting _data
(or maybe just to get rid of all accesses to _data
), and also to move this comment to get_objects_from_partitions
until that issue is fixed. But I don't think that's a responsibility for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related issue #4530
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm - I'll go ahead and approve and merge then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should get rid of ._data
access anywhere but in the partition class itself (ideally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - lets merge after @mvashishtha's comments are resolved!
) | ||
query_compiler = PandasQueryCompiler(result_modin_frame) | ||
result_df = pd.DataFrame(query_compiler=query_compiler) | ||
final_results[id] = result_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - this refactor is way way neater lol!
[idx.apply(func) for idx in partitions[0]] if len(partitions) else [] | ||
) | ||
target = partitions.T if axis == 0 else partitions | ||
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else [] | ||
new_idx = cls.get_objects_from_partitions(new_idx) | ||
# TODO FIX INFORMATION LEAK!!!!1!!1!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager
:
modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py
Line 110 in 49c0398
return ray.get([partition._data for partition in partitions]) |
[idx.apply(func) for idx in partitions[0]] if len(partitions) else [] | ||
) | ||
target = partitions.T if axis == 0 else partitions | ||
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else [] | ||
new_idx = cls.get_objects_from_partitions(new_idx) | ||
# TODO FIX INFORMATION LEAK!!!!1!!1!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's doing this because partition.get
will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
9e81360
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…modin-project#4718) Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Co-authored-by: Rehan Sohail Durrani <rehan@ponder.io>
Signed-off-by: Vasily Litvinov fam1ly.n4me@yandex.ru
What do these changes do?
Make some sane default for
index_func=None
argument ofPartitionManager.get_indices()
, which improves its usability.Also return internal partition indices as well as the total frame index - this allows to compute the
row_lengths
/column_widths
information later on without some separate remote calls.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date