FEAT-#6929: Implement Series.case_when in a distributed way #6972

AndreyPavlenko · 2024-02-27T21:45:23Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Implement Series.case_when in a distributed way #6929
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/core/dataframe/base/dataframe/utils.py

modin/core/storage_formats/pandas/query_compiler.py

modin/core/dataframe/pandas/dataframe/dataframe.py

+                    how="left",
+                    sort=False,
+                    fill_value=fill_value,


modin/pandas/test/test_series.py

AndreyPavlenko · 2024-03-20T13:14:50Z

I still don't like the idea of a method being responsible to cache itself. We should either have a common mechanism that does this automatically or not to have this logic at all. Does the caching gives any significant improvement? Can we remove it?

I've not benchmarked, but it should be more efficient, because, if not cached, the same function is serialized multiple times for each partition on each call.

dchigarev · 2024-03-20T13:59:14Z

but it should be more efficient

There's no doubt that it's more efficient, but my point is that, in reality, nobody runs .case_when() thousand times in a row. Even if one does, they would get ~100-200ms saving because of the caching. This may worth it if implemented properly (like automatical caching in RayWraper._funcs_cache), but I personally can't justify the current approach with a method manually caching itself in a dataframe's attribute.

Seems, for Dask it does not work. If the function is cached, the tests randomly fail with CancelledError.

This should actually be considered as a huge red flag, pointing out that there could be other unexpected side-effects of such implementation of caching with other engines.

if not cached, the same function is serialized multiple times for each partition

Let's call PartitionMaganer.preprocess_func() in the beginning of PandasDataframe.case_when(), this is a common way of how other dataframe methods 'cache' a kernel so each partition would reuse it.

Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

YarShev · 2024-04-03T15:59:59Z

@AndreyPavlenko, what else is required to be done in this PR? Please also resolve conflicts.

YarShev · 2024-04-08T13:24:25Z

@anmyachev, @dchigarev, any comments?

modin/core/storage_formats/pandas/query_compiler.py

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

AndreyPavlenko force-pushed the issue-6929 branch from 0fb5bcd to ac7668a Compare February 27, 2024 21:51

github-advanced-security bot found potential problems Feb 27, 2024

View reviewed changes

modin/core/dataframe/base/dataframe/utils.py Fixed Show fixed Hide fixed

AndreyPavlenko force-pushed the issue-6929 branch 4 times, most recently from 22b4687 to 932b6aa Compare February 28, 2024 00:32

AndreyPavlenko mentioned this pull request Feb 28, 2024

BUG: The test test_series.py::test_case_when fails on Unidist #6973

Open

3 tasks

AndreyPavlenko force-pushed the issue-6929 branch 3 times, most recently from 8076725 to 3d568ab Compare February 28, 2024 15:55

github-advanced-security bot found potential problems Feb 28, 2024

View reviewed changes

modin/core/storage_formats/pandas/query_compiler.py Fixed Show resolved Hide resolved

AndreyPavlenko marked this pull request as ready for review February 28, 2024 17:44

AndreyPavlenko requested review from aregm, gshimansky, ienkovich, Garra1980, YarShev, vnlitvinov, anmyachev, dchigarev, a team, devin-petersohn, mvashishtha and RehanSD as code owners February 28, 2024 17:44

YarShev reviewed Feb 29, 2024

View reviewed changes

YarShev reviewed Mar 1, 2024

View reviewed changes

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Mar 5, 2024

View reviewed changes

AndreyPavlenko force-pushed the issue-6929 branch 2 times, most recently from 242d2ff to b7b9587 Compare March 6, 2024 00:04

dchigarev reviewed Mar 20, 2024

View reviewed changes

modin/pandas/test/test_series.py Outdated Show resolved Hide resolved

AndreyPavlenko and others added 10 commits April 3, 2024 11:41

FEAT-modin-project#6929: Implement Series.case_when in a distributed way

83837da

Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

Use nonlocal is_trivial_idx

ae8934c

Moved implementation to the PandasDataFrame class

bd8e5fb

Apply suggestions from code review

9ea3fa6

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

Apply suggestions from code review

8fb8c52

Use single wrap for Dask

d01cbf3

Apply suggestions from code review

1a77b6f

Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Apply suggestions from code review

af1db58

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

Use construct_modin_df_by_scheme() for repartitioning

46e3988

Use full path in import

97bad86

AndreyPavlenko dismissed YarShev’s stale review via 181cc3e April 4, 2024 12:35

AndreyPavlenko force-pushed the issue-6929 branch from b406652 to 181cc3e Compare April 4, 2024 12:35

AndreyPavlenko mentioned this pull request Apr 4, 2024

BUG: Dask: test_case_when() randomly fails on Windows with CancelledError #7148

Open

3 tasks

AndreyPavlenko force-pushed the issue-6929 branch from 181cc3e to d1b7fa4 Compare April 4, 2024 14:47

Apply changes from code review

7910d48

AndreyPavlenko force-pushed the issue-6929 branch from d1b7fa4 to 7910d48 Compare April 4, 2024 16:47

YarShev previously approved these changes Apr 8, 2024

View reviewed changes

anmyachev reviewed Apr 8, 2024

View reviewed changes

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved

anmyachev previously approved these changes Apr 8, 2024

View reviewed changes

Update modin/core/storage_formats/pandas/query_compiler.py

0707013

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

YarShev dismissed stale reviews from anmyachev and themself via 0707013 April 8, 2024 14:04

YarShev approved these changes Apr 8, 2024

View reviewed changes

anmyachev approved these changes Apr 8, 2024

View reviewed changes

YarShev merged commit a6a8399 into modin-project:master Apr 8, 2024
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6929: Implement Series.case_when in a distributed way #6972

FEAT-#6929: Implement Series.case_when in a distributed way #6972

AndreyPavlenko commented Feb 27, 2024

AndreyPavlenko commented Mar 20, 2024

dchigarev commented Mar 20, 2024 •

edited

YarShev commented Apr 3, 2024

YarShev commented Apr 8, 2024

FEAT-#6929: Implement Series.case_when in a distributed way #6972

FEAT-#6929: Implement Series.case_when in a distributed way #6972

Conversation

AndreyPavlenko commented Feb 27, 2024

What do these changes do?

AndreyPavlenko commented Mar 20, 2024

dchigarev commented Mar 20, 2024 • edited

YarShev commented Apr 3, 2024

YarShev commented Apr 8, 2024

dchigarev commented Mar 20, 2024 •

edited