Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

Merged
merged 3 commits into from
Jul 26, 2022

Conversation

vnlitvinov
Copy link
Collaborator

@vnlitvinov vnlitvinov commented Jul 26, 2022

Signed-off-by: Vasily Litvinov fam1ly.n4me@yandex.ru

What do these changes do?

Make some sane default for index_func=None argument of PartitionManager.get_indices(), which improves its usability.
Also return internal partition indices as well as the total frame index - this allows to compute the row_lengths / column_widths information later on without some separate remote calls.

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Improve PartitionManager.get_indices() usage #4717
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@vnlitvinov vnlitvinov force-pushed the improve-get-indices branch 4 times, most recently from 3ea18ca to e92aeda Compare July 26, 2022 10:52
@codecov
Copy link

codecov bot commented Jul 26, 2022

Codecov Report

Merging #4718 (9e81360) into master (0236358) will increase coverage by 4.71%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master    #4718      +/-   ##
==========================================
+ Coverage   85.16%   89.87%   +4.71%     
==========================================
  Files         259      260       +1     
  Lines       19205    19495     +290     
==========================================
+ Hits        16355    17521    +1166     
+ Misses       2850     1974     -876     
Impacted Files Coverage Δ
modin/experimental/batch/pipeline.py 100.00% <ø> (+100.00%) ⬆️
...tal/core/storage_formats/pyarrow/query_compiler.py 0.00% <0.00%> (ø)
modin/core/dataframe/pandas/dataframe/dataframe.py 95.56% <100.00%> (+<0.01%) ⬆️
...dataframe/pandas/partitioning/partition_manager.py 90.15% <100.00%> (+3.77%) ⬆️
modin/core/io/pickle/pickle_dispatcher.py 92.59% <100.00%> (ø)
modin/distributed/dataframe/pandas/partitions.py 88.23% <100.00%> (+1.00%) ⬆️
modin/logging/config.py 94.59% <0.00%> (-1.30%) ⬇️
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
modin/pandas/series.py 94.23% <0.00%> (+0.24%) ⬆️
modin/pandas/series_utils.py 99.43% <0.00%> (+0.56%) ⬆️
... and 40 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
@vnlitvinov vnlitvinov marked this pull request as ready for review July 26, 2022 13:01
@vnlitvinov vnlitvinov requested a review from a team as a code owner July 26, 2022 13:01
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnlitvinov thank you, this is excellent! I left one minor cleanup comment.

[idx.apply(func) for idx in partitions[0]] if len(partitions) else []
)
target = partitions.T if axis == 0 else partitions
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
new_idx = cls.get_objects_from_partitions(new_idx)
# TODO FIX INFORMATION LEAK!!!!1!!1!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any information leak here anymore :) I think the PR that added get_objects_from_partitions deleted it the access to partitions[i]._data. Could you delete this warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager:

return ray.get([partition._data for partition in partitions])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's doing this because partition.get will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into the history of this comment, and I still don't know what it's talking about. The diff from the original PR is here. I don't see any information leak there, where the code uses get() to get the data.

Regarding the _data access in the ray partition manager, I think you're correct that it's incorrect because it doesn't drain the call queue. This use in get_indices happens to work because apply drain the call queue. I think we need to take a stance as to whether we should allow accessing the _data of block/non-full-axis-virtual partitions. We should probably use get(), which drains the call queue, instead.

So back to this comment, I think the right thing would be to file an issue to drain the call queue before getting _data (or maybe just to get rid of all accesses to _data), and also to move this comment to get_objects_from_partitions until that issue is fixed. But I don't think that's a responsibility for this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue #4530

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm - I'll go ahead and approve and merge then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should get rid of ._data access anywhere but in the partition class itself (ideally).

RehanSD
RehanSD previously approved these changes Jul 26, 2022
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - lets merge after @mvashishtha's comments are resolved!

)
query_compiler = PandasQueryCompiler(result_modin_frame)
result_df = pd.DataFrame(query_compiler=query_compiler)
final_results[id] = result_df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - this refactor is way way neater lol!

[idx.apply(func) for idx in partitions[0]] if len(partitions) else []
)
target = partitions.T if axis == 0 else partitions
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
new_idx = cls.get_objects_from_partitions(new_idx)
# TODO FIX INFORMATION LEAK!!!!1!!1!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager:

return ray.get([partition._data for partition in partitions])

[idx.apply(func) for idx in partitions[0]] if len(partitions) else []
)
target = partitions.T if axis == 0 else partitions
new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
new_idx = cls.get_objects_from_partitions(new_idx)
# TODO FIX INFORMATION LEAK!!!!1!!1!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's doing this because partition.get will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?

YarShev
YarShev previously approved these changes Jul 26, 2022
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

mvashishtha
mvashishtha previously approved these changes Jul 26, 2022
RehanSD
RehanSD previously approved these changes Jul 26, 2022
@RehanSD RehanSD dismissed stale reviews from mvashishtha, YarShev, and themself via 9e81360 July 26, 2022 21:23
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@RehanSD RehanSD merged commit 88c3c33 into modin-project:master Jul 26, 2022
YarShev added a commit to YarShev/modin that referenced this pull request Aug 2, 2022
…modin-project#4718)

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Co-authored-by: Rehan Sohail Durrani <rehan@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve PartitionManager.get_indices() usage
4 participants