Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#6929: Implement Series.case_when in a distributed way #6972

Merged
merged 12 commits into from
Apr 8, 2024

Conversation

AndreyPavlenko
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Implement Series.case_when in a distributed way #6929
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Comment on lines +4601 to +4757
how="left",
sort=False,
fill_value=fill_value,

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.core.dataframe.pandas.interchange.dataframe_protocol.dataframe
begins an import cycle.
@AndreyPavlenko AndreyPavlenko force-pushed the issue-6929 branch 2 times, most recently from 242d2ff to b7b9587 Compare March 6, 2024 00:04
@AndreyPavlenko
Copy link
Collaborator Author

I still don't like the idea of a method being responsible to cache itself. We should either have a common mechanism that does this automatically or not to have this logic at all. Does the caching gives any significant improvement? Can we remove it?

I've not benchmarked, but it should be more efficient, because, if not cached, the same function is serialized multiple times for each partition on each call.

@dchigarev
Copy link
Collaborator

dchigarev commented Mar 20, 2024

but it should be more efficient

There's no doubt that it's more efficient, but my point is that, in reality, nobody runs .case_when() thousand times in a row. Even if one does, they would get ~100-200ms saving because of the caching. This may worth it if implemented properly (like automatical caching in RayWraper._funcs_cache), but I personally can't justify the current approach with a method manually caching itself in a dataframe's attribute.

Seems, for Dask it does not work. If the function is cached, the tests randomly fail with CancelledError.

This should actually be considered as a huge red flag, pointing out that there could be other unexpected side-effects of such implementation of caching with other engines.

if not cached, the same function is serialized multiple times for each partition

Let's call PartitionMaganer.preprocess_func() in the beginning of PandasDataframe.case_when(), this is a common way of how other dataframe methods 'cache' a kernel so each partition would reuse it.

@YarShev
Copy link
Collaborator

YarShev commented Apr 3, 2024

@AndreyPavlenko, what else is required to be done in this PR? Please also resolve conflicts.

YarShev
YarShev previously approved these changes Apr 8, 2024
@YarShev
Copy link
Collaborator

YarShev commented Apr 8, 2024

@anmyachev, @dchigarev, any comments?

anmyachev
anmyachev previously approved these changes Apr 8, 2024
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
@YarShev YarShev dismissed stale reviews from anmyachev and themself via 0707013 April 8, 2024 14:04
@YarShev YarShev merged commit a6a8399 into modin-project:master Apr 8, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Series.case_when in a distributed way
4 participants