PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

AndreyPavlenko · 2023-08-09T13:18:09Z

Wrap a list-like object into a single-column query compiler before the insertion.

Note: this PR does not cover the HDK backend. For HDK a separate PR is created #6412, but it's not ready yet due to the HDK issue intel/hdk#588.

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Improve performance of list-like objects insertion into DataFrames #6398
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/pandas/test/conftest.py

+                    )
+    except ImportError:
+        # No engine
+        ...


anmyachev

I think we should fix #6399 first.

P.S. The name of the PR should start with PERF-.

modin/core/storage_formats/pandas/query_compiler.py

anmyachev · 2023-08-10T13:37:30Z

modin/pandas/dataframe.py

@@ -2511,7 +2511,7 @@ def setitem_unhashable_key(df, value):
                value = value.T.reshape(-1)
                if len(self) > 0:
                    value = value[: len(self)]
-            if not isinstance(value, (Series, Categorical, np.ndarray)):
+            if not isinstance(value, (Series, Categorical, np.ndarray, list, range)):


list(value) makes a copy if value is already a list?

Yes, it makes an unnecessary copy.

anmyachev · 2023-08-10T14:11:55Z

modin/pandas/test/conftest.py

+                ):
+                    item.add_marker(
+                        pytest.mark.xfail(
+                            reason="https://github.com/modin-project/modin/issues/6399"


I wouldn't like to merge perf fixes that break current tests.

…sertion into DataFrames Wrap a list-like object into a single-column query compiler before the insertion. Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

dchigarev · 2023-08-18T16:08:09Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -2922,6 +2924,7 @@ def _compute_duplicated(df):  # pragma: no cover
    # return a new one from here and let the front end handle the inplace
    # update.
    def insert(self, loc, column, value):
+        value = self._wrap_column_data(value)


Looks fine as a temporary hack.

However, there's a TODO message a few lines below mentioning that the insertion of a non-distributed data can be speeded-up significantly by changing apply_full_axis... call to the apply_select_indices(item_to_distribute=value, ...)

modin/modin/core/storage_formats/pandas/query_compiler.py

Lines 2945 to 2946 in 781fed8

# TODO: rework by passing list-like values to `apply_select_indices`

# as an item to distribute

This way, we wouldn't need to wrap the newly created dataframe but propagate the value directly to the partitions, which, in theory, should be quite fast.

Could you please quickly check the approach from TODO message makes sense? If it'll be too inefficient or too complex to implement, we can stick to the changes introduced by this PR

dchigarev · 2023-08-18T16:14:12Z

modin/pandas/test/conftest.py

+                    "test_dataframe_dt_index[3s-both-DateCol-0]",
+                    "test_dataframe_dt_index[3s-right-DateCol-0]",


it seems that these test fail anytime the execution goes to this if-else branch

modin/modin/pandas/test/test_rolling.py

Lines 162 to 163 in 781fed8

pandas_df[on] = pandas.date_range("22/06/1941", periods=12, freq="T")

modin_df[on] = pd.date_range("22/06/1941", periods=12, freq="T")

can we call .xfail() directly from there?

We can, but we will definitely forget to remove this call when the problem is fixed.

anmyachev · 2023-08-22T11:04:09Z

This PR will affect (at least) the following cases:

We need to consider whether we should fix these cases first. Perhaps for a start, it will be enough to default to pandas according to some heuristic regarding accessing columns that can be in different partitions.

сс @dchigarev @YarShev

dchigarev · 2023-08-22T14:38:40Z

By the way, I believe that both #2511 and #3435 should work fine if the cfg.ExperimentalGroupbyImpl is enabled. Though, I don't think the experimental groupby is mature enough to engage it unconditionally on groupby.apply().

As a temp-fix, I think it might be reasonable to fallback on pandas for groupby.apply() in case there's more than one column partition. However, this would affect performance dramatically for all the .apply() calls, even if the applied function is simple and doesn't trigger the problem, which is quite sad, so I wouldn't agree nor disagree with this approach.

A perfect solution would be to finish the experimental groupby to a level when it would have satisfiable quality to transfer all the .apply() call to it before the release (for that we have to fix #6465 and finish 1 or 2 additional feats, not sure if it will fit in this release).

Garra1980 · 2023-08-22T17:12:43Z

Fallback to pandas option looks bad to me

I suggest then to fix this one and plan #6465 for this release hopefully closing #2511 and #3435 and #6399 will be fixed later :)

@dchigarev

If #6465 can be the solution to the problem, then it is more rational if the final decision is yours @dchigarev, so you are more aware of how much more needs to be done there.

Garra1980 · 2023-08-24T20:52:18Z

Let's merge then

AndreyPavlenko force-pushed the issue-6398-ray branch from ded13c3 to 3c04a06 Compare August 9, 2023 13:54

github-advanced-security bot found potential problems Aug 9, 2023

View reviewed changes

modin/pandas/test/conftest.py

)

except ImportError:

# No engine

...

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.

AndreyPavlenko marked this pull request as ready for review August 9, 2023 18:18

AndreyPavlenko requested a review from a team as a code owner August 9, 2023 18:18

anmyachev previously requested changes Aug 10, 2023

View reviewed changes

AndreyPavlenko changed the title ~~FEAT-#6398: Improved performance of list-like objects insertion into DataFrames~~ PERF-#6398: Improved performance of list-like objects insertion into DataFrames Aug 10, 2023

PERF-modin-project#6398: Improved performance of list-like objects in…

eb6ed27

…sertion into DataFrames Wrap a list-like object into a single-column query compiler before the insertion. Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

AndreyPavlenko force-pushed the issue-6398-ray branch from 3c04a06 to eb6ed27 Compare August 10, 2023 14:28

dchigarev reviewed Aug 18, 2023

View reviewed changes

dchigarev approved these changes Aug 24, 2023

View reviewed changes

dchigarev merged commit da385c9 into modin-project:master Aug 24, 2023
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

AndreyPavlenko commented Aug 9, 2023 •

edited by YarShev

anmyachev left a comment

anmyachev Aug 10, 2023

AndreyPavlenko Aug 10, 2023

anmyachev Aug 10, 2023

dchigarev Aug 18, 2023

dchigarev Aug 18, 2023

AndreyPavlenko Aug 22, 2023

anmyachev commented Aug 22, 2023 •

edited

dchigarev commented Aug 22, 2023

Garra1980 commented Aug 22, 2023

Garra1980 commented Aug 24, 2023

	# TODO: rework by passing list-like values to `apply_select_indices`
	# as an item to distribute

		"test_dataframe_dt_index[3s-both-DateCol-0]",
		"test_dataframe_dt_index[3s-right-DateCol-0]",

	pandas_df[on] = pandas.date_range("22/06/1941", periods=12, freq="T")
	modin_df[on] = pd.date_range("22/06/1941", periods=12, freq="T")

PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

PERF-#6398: Improved performance of list-like objects insertion into DataFrames #6476

Conversation

AndreyPavlenko commented Aug 9, 2023 • edited by YarShev

What do these changes do?

anmyachev left a comment

Choose a reason for hiding this comment

anmyachev Aug 10, 2023

Choose a reason for hiding this comment

AndreyPavlenko Aug 10, 2023

Choose a reason for hiding this comment

anmyachev Aug 10, 2023

Choose a reason for hiding this comment

dchigarev Aug 18, 2023

Choose a reason for hiding this comment

dchigarev Aug 18, 2023

Choose a reason for hiding this comment

AndreyPavlenko Aug 22, 2023

Choose a reason for hiding this comment

anmyachev commented Aug 22, 2023 • edited

dchigarev commented Aug 22, 2023

Garra1980 commented Aug 22, 2023

Garra1980 commented Aug 24, 2023

AndreyPavlenko commented Aug 9, 2023 •

edited by YarShev

anmyachev commented Aug 22, 2023 •

edited