FEAT-#4909: Properly implement map operator #5118

noloerino · 2022-10-12T00:50:03Z

What do these changes do?

This PR cleans up the interfaces of the various map, apply, and broadcast dataframe and partition manager methods.

Since reduce and treereduce both use these methods, these are also affected by the aforementioned changes. The changes also incidentally address #4912 and (partially) #5094, but those changes can be separated out fairly easily if this PR is too large.

Overall, the following changes have been made to the dataframe API (the partition manager changes are very similar):

Old method	New method
map and broadcast_apply	map_partitions
apply_full_axis and broadcast_apply_full_axis	map_partition_full_axis
apply_select_indices and broadcast_apply_select_indices	map_select_indices
apply_func_to_indices_both_axis	map_select_indices_both_axes

A lot of logic that used to be in separate functions got moved into nested if/else chains with this refactor: suggestions on how to clean up the code would be appreciated.

Microbenchmarks

All tests were run on an EC2 t2.2xlarge instance (8 CPUs, 32 GiB RAM, 128 GB disk, Ubuntu Jammy AMD64) with the Ray backend, with int64 dataframes of size 2^16 x 2^14. Each test was run 5 times and averaged.

These benchmarks seem to indicate no appreciable performance difference on datasets of this size.

abs

The abs function is changed to map across rows rather than cell-wise.

PR (f5ef6f9e): 0.0354s
master (c070b65): 0.0352s

apply

The test ran df.apply(np.sum, axis=0).

PR: 9.0078s
master: 9.0166

describe

PR: 32.3454s
master: 32.0462

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves FEAT: Properly implement map operator #4909, BUG: first_valid_index errors on dataframe with only None/NaN values #4912, BUG: Passing string as axis argument leads to incorrect behavior #5094
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

codecov · 2022-10-12T01:00:59Z

Codecov Report

Merging #5118 (f5ef6f9) into master (abcf1e9) will increase coverage by 4.51%.
The diff coverage is 94.87%.

@@            Coverage Diff             @@
##           master    #5118      +/-   ##
==========================================
+ Coverage   84.56%   89.08%   +4.51%     
==========================================
  Files         256      257       +1     
  Lines       19349    19613     +264     
==========================================
+ Hits        16363    17472    +1109     
+ Misses       2986     2141     -845

Impacted Files	Coverage Δ
...ns/pandas_on_ray/partitioning/partition_manager.py	`71.73% <ø> (+4.34%)`	⬆️
modin/core/dataframe/base/dataframe/dataframe.py	`95.34% <60.00%> (-4.66%)`	⬇️
...dataframe/pandas/partitioning/partition_manager.py	`88.88% <91.50%> (+1.98%)`	⬆️
modin/core/dataframe/pandas/dataframe/dataframe.py	`95.18% <94.49%> (-0.06%)`	⬇️
modin/core/dataframe/algebra/binary.py	`100.00% <100.00%> (ø)`
modin/core/dataframe/base/dataframe/utils.py	`100.00% <100.00%> (ø)`
...me/pandas/interchange/dataframe_protocol/column.py	`93.33% <100.00%> (ø)`
...pandas/interchange/dataframe_protocol/dataframe.py	`96.72% <100.00%> (ø)`
...odin/core/storage_formats/pandas/query_compiler.py	`96.34% <100.00%> (+0.40%)`	⬆️
...ecution/ray/implementations/pandas_on_ray/io/io.py	`93.33% <100.00%> (ø)`
... and 51 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

lgtm-com · 2022-10-12T18:40:04Z

This pull request introduces 1 alert when merging 5c0478cfb1740123b64fd58ccdd8b2a8604dd2ef into 88f7b27 - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

dchigarev

Left some comments.

BTW, do we really need to combine all of the map functions into a single one? IMO some of them became really huge, complicated, and hard to read. Especially PartitionManager.map_select_indices and PandasDataframe._map_axis.

I would suggest either refactoring them somehow to relax the complexity or splitting some of them into separate methods.

dchigarev · 2022-10-14T13:14:01Z

modin/core/dataframe/base/dataframe/utils.py

+AxisInt = Literal[0, 1]
+"""Type for the two possible integer values of an axis argument (0 or 1)."""
+
+


Why is this needed? I mean, why would we extend internal dataframe API to also be able to accept AxisInt when we already have Axis enum?

A lot of the codebase (mostly the query compiler) is written to call dataframe methods with a literal int rather than the Axis enum. I think it would be easier to re-wrap the axis with the enum from within dataframe methods (as is done now) than to go through and fix every instance where relevant dataframe methods are called to use the enum instead.

I don't see why we need this Axis enum then. I really don't like this mixing of Axis, AxisInt, and actual integers for an axis value. I think we should pick only one of the ways of interpreting an axis and then really stick to this, not introducing a variety of axis types in order to cover an existing zoo of value types.

dchigarev · 2022-10-14T13:15:37Z

modin/core/dataframe/base/dataframe/dataframe.py

+        copy_dtypes : bool, default: False
+            If True, the dtypes of the resulting dataframe are copied from the original,
+            and the ``dtypes`` argument is ignored.


why do we offer a copy_dtypes option only for the map operator but not for reduce and tree_reduce?

I'm not really sure, though my guess is that the frequently dimension-reducing nature of reduce/tree-reduce makes the argument less relevant for those cases. Here, I introduced copy_dtypes as a replacement for dtypes="copy", which is a little hacky.

dchigarev · 2022-10-14T13:25:58Z

modin/core/dataframe/base/dataframe/dataframe.py

        axis: Optional[Union[int, Axis]] = None,
-        dtypes: Optional[str] = None,
+        dtypes: Optional[Union[pandas.Series, type]] = None,


If I remember correctly, there was a discussion regarding limiting the usage of pandas entities in the base classes of Modin internals. Some executions may not require pandas at all and wouldn't like to deal with handling breaking changes introduced by some pandas updates.

May we define the dtypes type as something abstract like collections.abc.Mapping so every execution would use whatever container they like?

Sure, that makes senes. Is there some other generic container that would accept pandas.Series though? It seems like it's not a subclass of Mapping.

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/base/dataframe/dataframe.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

dchigarev · 2022-10-14T14:44:11Z

modin/core/storage_formats/pandas/query_compiler.py

            new_columns=new_columns,
        )
        return self.__constructor__(new_modin_frame)

    # Map partitions operations
    # These operations are operations that apply a function to every partition.
-    abs = Map.register(pandas.DataFrame.abs, dtypes="copy")
+    # Though all these operations are element-wise, some can be sped up by mapping across a row/col


Could you elaborate on how the speed-up is achieved?

IMO the cell-wise execution should be beneficial in the general case against row/col-wise.

My quick and dirty micro-benchmarks show no difference between specifying an axis vs. applying cell-wise, so perhaps it's best to revert back to cell-wise operations. The hope is that for certain operators, being able to apply across a whole axis rather than having to examine each cell would provide a speedup. I will see if any other benchmarks would justify this theory.

noloerino · 2022-11-01T23:41:56Z

Updated benchmarks for this PR (02a16191, slightly older version before a rebase) vs. current master (6f0ff79).

df.abs() (2^16 x 2^14)

02a16191 - 0.0352s
6f0ff79 - 0.0354s

df.apply(np.sum, axis=0) (2^16 x 2^14)

02a16191 - 6.531s
6f0ff79 - 8.072s

df1 + df2 (2^15 x 2^14 each)

02a16191 - 0.0218s
6f0ff79 - 0.0212s

df.describe() (2^16 x 2^14)

02a16191 - 33.1s
6f0ff79 - 32.8s

df.isna().any() (2^16 x 2^14)

02a16191 - 0.0782s
6f0ff79 - 0.0858s

I haven't yet check the sources of speedup (e.g. whether they're from shorter code paths/less partition overhead, or from changing maps to be axis-wise).

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

noloerino · 2022-12-07T02:02:33Z

CI should be passing now (I ran it on my own repository before pushing here).

dchigarev

May I kindly ask, what was the original idea of the PR? It seems that this PR tries to solve too many problems in one piece. It's really hard to review for me and to make the changes here for you.

I feel that the PR covers the following distinct topics:

Align how we use axis argument in low-level dataframe
Introduce new logic for working with dtypes/copy_dtypes parameters
Combine map and broadcast_apply into map_partitions
Combine apply_full_axis and broadcast_apply_full_axis into map_partition_full_axis
Combine apply_select_indices and broadcast_apply_select_indices into map_select_indices
Rework apply_func_to_indices_both_axis into map_select_indices_both_axes

All of these may be solved with small different PRs (rather than one huge). They're probably un-doable in parallel as some of them may block each other, however, I think the changes would make much more sense when introduced by small iterations.

dchigarev · 2022-12-07T11:33:37Z

modin/core/dataframe/base/dataframe/utils.py

+AxisInt = Literal[0, 1]
+"""Type for the two possible integer values of an axis argument (0 or 1)."""
+
+


I don't see why we need this Axis enum then. I really don't like this mixing of Axis, AxisInt, and actual integers for an axis value. I think we should pick only one of the ways of interpreting an axis and then really stick to this, not introducing a variety of axis types in order to cover an existing zoo of value types.

dchigarev · 2022-12-07T11:40:58Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        join_type : str, default: "left"
+            Type of join to apply.


we have a special enum for this, let's use it

modin/modin/core/dataframe/base/dataframe/utils.py

Line 39 in 2ebc9cf

class JoinType(Enum): # noqa: PR01

dchigarev · 2022-12-07T11:48:04Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+            axis=0,
+            other_partitions=None,
+            full_axis=False,
+            apply_indices=[0],
+            other_apply_indices=None,


do we really want these parameters to be specified? it seems that they just duplicate default values

Suggested change

axis=0,

other_partitions=None,

full_axis=False,

apply_indices=[0],

other_apply_indices=None,

axis=0,

apply_indices=[0],

dchigarev · 2022-12-07T11:51:56Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+            the partitions will be concatenated together before the function is called, and then re-split
+            after it returns.
+        join_type : str, default: "left"
+            Type of join to apply.


can you please elaborate? something like this is expected:

Suggested change

Type of join to apply.

Type of join to apply if the concatenation of `self` and `other` would be required.

dchigarev · 2022-12-07T11:53:54Z

modin/core/dataframe/pandas/dataframe/dataframe.py

-            dtypes=dtypes,
-        )
+        if axis == Axis.CELL_WISE:
+            return self._map_cellwise(func, dtypes)


why does cell-wise map ignore all other parameters?

dchigarev · 2022-12-07T12:05:44Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+            new_partitions = self._partition_mgr_cls.map_partitions(
+                self._partitions, func
+            )


why do we ignore axis here? why does the .map_partitions call is inside of _map_axis that's supposed to call function axis-wise only?

dchigarev · 2022-12-07T12:06:44Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        *,
+        axis: Optional[Union[AxisInt, Axis]] = None,
+        other: Optional["PandasDataframe"] = None,
+        full_axis=False,


why do we need this parameter if we have a separate method for this (map_full_axis)?

dchigarev · 2022-12-07T12:13:27Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        kw = self._make_init_labels_args(new_partitions, new_index, new_columns)
+        if copy_dtypes:
+            kw["dtypes"] = self._dtypes
+        elif isinstance(dtypes, type):


judging by the method's signature we are only supposed to allow pandas.Series to be a dtype parameter, why is this logic here then? Let's either change the signature or adapt the logic somehow

dchigarev · 2022-12-07T12:17:57Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+        apply_indices=None,
+        numeric_indices=None,


do we really want these two parameters to exist at the same time? we can easily end-up in an ambiguous situation with this set of parameters:

md_df.map_select_indices( apply_indices=["a", "b"], numeric_indices=[1, 2, 3, 4, 5], ... ) # what's the method supposed to do?

dchigarev · 2022-12-07T12:41:56Z

modin/core/dataframe/pandas/dataframe/dataframe.py


-    def rename(
+    def window(


why we're adding this here if there's no implementation? Shouldn't it be located in the base class then?

noloerino · 2022-12-07T22:26:01Z

Thanks for taking the time to review @dchigarev. Broadly speaking, the purpose of this PR is to make calling the various partition application methods more uniform, and remove misleading "broadcast" nomenclature from the codebase (my understanding is that when the functions were originally written, the intent was for the functions to broadcast arguments to match dimensions like in some numpy functions).

I'll see if I can split this into several smaller PRs; your suggestions for how to break it down makes sense, although this fragmentation might cause some inconsistencies between how different mapping methods are used. I'll double check with @RehanSD (who assigned me to this task) if this is a viable approach.

noloerino · 2022-12-13T00:48:33Z

I've decided to split this into smaller parts as you suggested, starting with #5426 and #5427. Thanks again for the advice @dchigarev.

noloerino marked this pull request as ready for review October 12, 2022 20:32

noloerino requested a review from a team as a code owner October 12, 2022 20:32

dchigarev requested changes Oct 14, 2022

View reviewed changes

noloerino force-pushed the map-operator branch 3 times, most recently from 02a1619 to 363dcfd Compare November 1, 2022 19:42

noloerino force-pushed the map-operator branch from 363dcfd to 429511f Compare November 7, 2022 17:51

noloerino force-pushed the map-operator branch from 429511f to 5c11ab4 Compare November 21, 2022 18:08

noloerino force-pushed the map-operator branch from 5c11ab4 to d1596d8 Compare November 29, 2022 18:01

noloerino added 10 commits December 6, 2022 13:16

FEAT-modin-project#4909: Properly implement map operator

da415ad

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

suppress mypy warning for differing imports

af1926a

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

rename unidist partition method list

be8ff60

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix unidist io map

ea1c425

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

as last commit

0cd70f5

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix apply_full_axis in unidist io

cb5c99b

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

change remaining wrong functions

f506eed

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

lint

2d987b0

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

edit doc

81afa94

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix str copy_dtypes

07aef83

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

noloerino force-pushed the map-operator branch from ab2c1c9 to 07aef83 Compare December 7, 2022 02:01

dchigarev reviewed Dec 7, 2022

View reviewed changes

noloerino mentioned this pull request Dec 13, 2022

REFACTOR: Replace dtypes="copy" with copy_dtypes flag #5424

Open

noloerino marked this pull request as draft December 13, 2022 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#4909: Properly implement map operator #5118

FEAT-#4909: Properly implement map operator #5118

noloerino commented Oct 12, 2022 •

edited

codecov bot commented Oct 12, 2022 •

edited

lgtm-com bot commented Oct 12, 2022

dchigarev left a comment

dchigarev Oct 14, 2022

noloerino Oct 17, 2022

dchigarev Dec 7, 2022

dchigarev Oct 14, 2022

noloerino Oct 17, 2022

dchigarev Oct 14, 2022

noloerino Oct 17, 2022

dchigarev Oct 14, 2022

noloerino Oct 17, 2022

noloerino commented Nov 1, 2022 •

edited

noloerino commented Dec 7, 2022

dchigarev left a comment

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

dchigarev Dec 7, 2022

noloerino commented Dec 7, 2022

noloerino commented Dec 13, 2022

		AxisInt = Literal[0, 1]
		"""Type for the two possible integer values of an axis argument (0 or 1)."""

	Type of join to apply.
	Type of join to apply if the concatenation of `self` and `other` would be required.

FEAT-#4909: Properly implement map operator #5118

Are you sure you want to change the base?

FEAT-#4909: Properly implement map operator #5118

Conversation

noloerino commented Oct 12, 2022 • edited

What do these changes do?

Microbenchmarks

abs

apply

describe

codecov bot commented Oct 12, 2022 • edited

Codecov Report

lgtm-com bot commented Oct 12, 2022

dchigarev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noloerino commented Nov 1, 2022 • edited

noloerino commented Dec 7, 2022

dchigarev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noloerino commented Dec 7, 2022

noloerino commented Dec 13, 2022

noloerino commented Oct 12, 2022 •

edited

codecov bot commented Oct 12, 2022 •

edited

noloerino commented Nov 1, 2022 •

edited