REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

vnlitvinov · 2022-07-26T10:39:01Z

Signed-off-by: Vasily Litvinov fam1ly.n4me@yandex.ru

What do these changes do?

Make some sane default for index_func=None argument of PartitionManager.get_indices(), which improves its usability.
Also return internal partition indices as well as the total frame index - this allows to compute the row_lengths / column_widths information later on without some separate remote calls.

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Improve PartitionManager.get_indices() usage #4717
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

codecov · 2022-07-26T10:55:34Z

Codecov Report

Merging #4718 (9e81360) into master (0236358) will increase coverage by 4.71%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master    #4718      +/-   ##
==========================================
+ Coverage   85.16%   89.87%   +4.71%     
==========================================
  Files         259      260       +1     
  Lines       19205    19495     +290     
==========================================
+ Hits        16355    17521    +1166     
+ Misses       2850     1974     -876

Impacted Files	Coverage Δ
modin/experimental/batch/pipeline.py	`100.00% <ø> (+100.00%)`	⬆️
...tal/core/storage_formats/pyarrow/query_compiler.py	`0.00% <0.00%> (ø)`
modin/core/dataframe/pandas/dataframe/dataframe.py	`95.56% <100.00%> (+<0.01%)`	⬆️
...dataframe/pandas/partitioning/partition_manager.py	`90.15% <100.00%> (+3.77%)`	⬆️
modin/core/io/pickle/pickle_dispatcher.py	`92.59% <100.00%> (ø)`
modin/distributed/dataframe/pandas/partitions.py	`88.23% <100.00%> (+1.00%)`	⬆️
modin/logging/config.py	`94.59% <0.00%> (-1.30%)`	⬇️
modin/experimental/batch/test/test_pipeline.py	`100.00% <0.00%> (ø)`
modin/pandas/series.py	`94.23% <0.00%> (+0.24%)`	⬆️
modin/pandas/series_utils.py	`99.43% <0.00%> (+0.56%)`	⬆️
... and 40 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

mvashishtha

@vnlitvinov thank you, this is excellent! I left one minor cleanup comment.

mvashishtha · 2022-07-26T13:09:09Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-                [idx.apply(func) for idx in partitions[0]] if len(partitions) else []
-            )
+        target = partitions.T if axis == 0 else partitions
+        new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
        new_idx = cls.get_objects_from_partitions(new_idx)
        # TODO FIX INFORMATION LEAK!!!!1!!1!!


I don't think there's any information leak here anymore :) I think the PR that added get_objects_from_partitions deleted it the access to partitions[i]._data. Could you delete this warning?

I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager:

modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

Line 110 in 49c0398

return ray.get([partition._data for partition in partitions])

I think it's doing this because partition.get will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?

I dug into the history of this comment, and I still don't know what it's talking about. The diff from the original PR is here. I don't see any information leak there, where the code uses get() to get the data.

Regarding the _data access in the ray partition manager, I think you're correct that it's incorrect because it doesn't drain the call queue. This use in get_indices happens to work because apply drain the call queue. I think we need to take a stance as to whether we should allow accessing the _data of block/non-full-axis-virtual partitions. We should probably use get(), which drains the call queue, instead.

So back to this comment, I think the right thing would be to file an issue to drain the call queue before getting _data (or maybe just to get rid of all accesses to _data), and also to move this comment to get_objects_from_partitions until that issue is fixed. But I don't think that's a responsibility for this PR.

Related issue #4530

Sgtm - I'll go ahead and approve and merge then.

Yeah, I think we should get rid of ._data access anywhere but in the partition class itself (ideally).

RehanSD

LGTM - lets merge after @mvashishtha's comments are resolved!

RehanSD · 2022-07-26T17:28:14Z

modin/experimental/batch/pipeline.py

+            )
+            query_compiler = PandasQueryCompiler(result_modin_frame)
+            result_df = pd.DataFrame(query_compiler=query_compiler)
+            final_results[id] = result_df


Thank you - this refactor is way way neater lol!

RehanSD · 2022-07-26T17:30:35Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-                [idx.apply(func) for idx in partitions[0]] if len(partitions) else []
-            )
+        target = partitions.T if axis == 0 else partitions
+        new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
        new_idx = cls.get_objects_from_partitions(new_idx)
        # TODO FIX INFORMATION LEAK!!!!1!!1!!


I just double checked, and the info leak is still there in the PandasOnRayDataframePartitionManager:

modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py

Line 110 in 49c0398

return ray.get([partition._data for partition in partitions])

RehanSD · 2022-07-26T17:32:18Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

-                [idx.apply(func) for idx in partitions[0]] if len(partitions) else []
-            )
+        target = partitions.T if axis == 0 else partitions
+        new_idx = [idx.apply(func) for idx in target[0]] if len(target) else []
        new_idx = cls.get_objects_from_partitions(new_idx)
        # TODO FIX INFORMATION LEAK!!!!1!!1!!


I think it's doing this because partition.get will materialize partitions in serial, so it makes sense for it to do it, but the problem is if the partition has a call queue that will be ignored. Perhaps this is ok though?

YarShev

LGTM!

RehanSD

LGTM!

…modin-project#4718) Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Co-authored-by: Rehan Sohail Durrani <rehan@ponder.io>

vnlitvinov force-pushed the improve-get-indices branch 4 times, most recently from 3ea18ca to e92aeda Compare July 26, 2022 10:52

REFACTOR-modin-project#4717: Improve PartitionMgr.get_indices() usage

b3bacfd

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov force-pushed the improve-get-indices branch from e92aeda to b3bacfd Compare July 26, 2022 11:03

vnlitvinov marked this pull request as ready for review July 26, 2022 13:01

vnlitvinov requested a review from a team as a code owner July 26, 2022 13:01

mvashishtha reviewed Jul 26, 2022

View reviewed changes

RehanSD previously approved these changes Jul 26, 2022

View reviewed changes

Merge branch 'master' into improve-get-indices

e6bb7c5

YarShev dismissed RehanSD’s stale review via e6bb7c5 July 26, 2022 19:37

YarShev previously approved these changes Jul 26, 2022

View reviewed changes

mvashishtha previously approved these changes Jul 26, 2022

View reviewed changes

RehanSD previously approved these changes Jul 26, 2022

View reviewed changes

Merge branch 'master' into improve-get-indices

9e81360

RehanSD dismissed stale reviews from mvashishtha, YarShev, and themself via 9e81360 July 26, 2022 21:23

RehanSD approved these changes Jul 26, 2022

View reviewed changes

RehanSD merged commit 88c3c33 into modin-project:master Jul 26, 2022

vnlitvinov mentioned this pull request Aug 26, 2022

REFACTOR: remove redefinition of _row_lengths and _column_widths functions in PandasOnDaskDataframe. #3780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

vnlitvinov commented Jul 26, 2022 •

edited

Loading

codecov bot commented Jul 26, 2022 •

edited

Loading

mvashishtha left a comment

mvashishtha Jul 26, 2022

RehanSD Jul 26, 2022

RehanSD Jul 26, 2022

mvashishtha Jul 26, 2022

YarShev Jul 26, 2022

RehanSD Jul 26, 2022

vnlitvinov Jul 27, 2022

RehanSD left a comment

RehanSD Jul 26, 2022

RehanSD Jul 26, 2022

RehanSD Jul 26, 2022

YarShev left a comment

RehanSD left a comment

REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

REFACTOR-#4717: Improve PartitionMgr.get_indices() usage #4718

Conversation

vnlitvinov commented Jul 26, 2022 • edited Loading

What do these changes do?

codecov bot commented Jul 26, 2022 • edited Loading

Codecov Report

mvashishtha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev left a comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

vnlitvinov commented Jul 26, 2022 •

edited

Loading

codecov bot commented Jul 26, 2022 •

edited

Loading