FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

arunjose696 · 2024-03-04T17:19:28Z

What do these changes do?

Using yield in _deploy_ray_func to return generator instead of list.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Have to use generators when returning from _deploy_ray_func remote function. #7004 ?
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

dchigarev · 2024-03-05T13:13:48Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

+        for r in result:
+            for item in [r, len(r), len(r.columns), ip]:
+                yield item


I think that the tip to use generators only helps in cases where each iteration of the yielding for-loop actually creates some objects/allocates new memory. In this for loop we already have all the objects computed (dfs in result) and the memory for all the results was already allocated.

Can we instead use generators in split_result_of_axis_func_pandas function that actually creates a list of resulting dataframes?

It seems we should use generators here and in split_result_of_axis_func_pandas too.

YarShev · 2024-03-06T19:04:55Z

modin/core/storage_formats/pandas/utils.py

@@ -80,7 +80,8 @@ def split_result_of_axis_func_pandas(axis, num_splits, result, length_list=None)
        Splitted dataframe represented by list of frames.
    """
    if num_splits == 1:
-        return [result]
+        yield result
+        return


why return?

To not execute the rest of code, and exit the function. Or I will have to put the rest of code in an else block.

else branch is good for me

arunjose696 · 2024-03-06T18:33:06Z

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py

    ip = get_node_ip_address()
-    if isinstance(result, pandas.DataFrame):
-        return result, len(result), len(result.columns), ip
-    elif all(isinstance(r, pandas.DataFrame) for r in result):


Here we check if all parts of result are dataframes. What are the scenarios where the result would be heterogeneous( composed of dataframes and non dataframes)?
One possibility I can think of is results could have errors, in this scenario I think it would it be sufficient to send [r, None, None, ip] for the errors and send[r, len(r), len(r.columns), ip]for the results that are dataframes. Would this suffice?

I think this would suffice.

arunjose696 · 2024-03-06T21:06:54Z

modin/core/storage_formats/pandas/utils.py

@@ -80,7 +80,8 @@ def split_result_of_axis_func_pandas(axis, num_splits, result, length_list=None)
        Splitted dataframe represented by list of frames.
    """
    if num_splits == 1:
-        return [result]
+        yield result
+        return


To not execute the rest of code, and exit the function. Or I will have to put the rest of code in an else block.

YarShev · 2024-03-07T15:35:05Z

@arunjose696, once you fix all CI jobs, please convert this PR to ready for review and put some time and memory measurements in the PR description.

…ay_func remote function. Signed-off-by: arunjose696 <arunjose696@gmail.com>

Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 · 2024-03-13T08:04:00Z

@arunjose696, once you fix all CI jobs, please convert this PR to ready for review and put some time and memory measurements in the PR description.

	Memory		Time
	Current	Master	Current	Master
Physionet	20.6	20.7	40.3	41sec
ny_taxi_ml	108	111	12	12.3min
ny_taxi	55.4	53.2	25.9	26.7sec
Census	25.9	26.3	1.12	1.18min
Exfo	81	82.2	3.77	3.81min
Fraud	24.8	24.7	1.56	1.61min

modin/core/storage_formats/pandas/utils.py

dchigarev

LGTM

YarShev · 2024-03-13T08:48:52Z

@AndreyPavlenko, since you are also going to use generators in remote functions for virtual partitions in #6991, can you look at the changes in this PR? How do they affect your changes and should we merge them?

modin/core/dataframe/pandas/partitioning/axis_partition.py

YarShev · 2024-03-13T12:40:33Z

modin/core/dataframe/pandas/partitioning/axis_partition.py

@@ -510,7 +526,18 @@ def deploy_func_between_two_axis_partitions(
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=FutureWarning)
            result = func(lt_frame, rt_frame, *f_args, **f_kwargs)
-        return split_result_of_axis_func_pandas(axis, num_splits, result)
+        if return_generator:
+            return generate_result_of_axis_func_pandas(


Suggested change

return generate_result_of_axis_func_pandas(

yield from generate_result_of_axis_func_pandas(

?

This wouldnt work because using yeild in a function would turn it to a generator.

We do not require generators but lists for some branches of if , For the backends such as dask as we try to return a list of partitions, but as there is yield statement in the function a generator would still be returned and thus partitions would be empty when materialized.

https://stackoverflow.com/questions/26595895/return-and-yield-in-the-same-function

I mean not just yield but yield from. Would it work?

Checked with yield from as well still the function returns a generator when called so the code fails for dask.

YarShev · 2024-03-13T12:41:00Z

CI is failing

Signed-off-by: arunjose696 <arunjose696@gmail.com>

AndreyPavlenko · 2024-03-13T14:17:09Z

@AndreyPavlenko, since you are also going to use generators in remote functions for virtual partitions in #6991, can you look at the changes in this PR? How do they affect your changes and should we merge them?

In #6991 _deploy_ray_func() is not used. A different approach is used there. The virtual partition's apply() functions do not split the resulting frame, but return a list of lazy partitions instead. Each partition has a deferred function, that should get the required piece of the df, i.e. the partition0 receives df.iloc[0:10], partition1 - df.iloc[10:20] ... and so on. These functions are executed lazy. It allows to do not split the entire frame if only a subset of the frame is required in the subsequent operations.

anmyachev · 2024-03-27T14:47:50Z

Using generators to reduce heap memory usage in remote functions.

@YarShev @arunjose696 In what minimal version of Ray did this feature appear? It seems we have implicitly increased it.

UPD: Generators are supported starting from ray 2.1.0: https://github.com/ray-project/ray/releases/tag/ray-2.1.0

YarShev · 2024-03-27T16:34:52Z

@anmyachev, oh, good catch! It seems true that generators are supported starting from ray 2.1.0. We started using generators since we introduced lazy execution for block partitions. It looks like we would have to change a lot of code. @AndreyPavlenko, do you think how much it takes for us to put a check for the ray version to either use generator or not? Or we can just update a minimal supported Ray version?

arunjose696 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners March 4, 2024 17:19

dchigarev reviewed Mar 5, 2024

View reviewed changes

arunjose696 marked this pull request as draft March 6, 2024 18:52

YarShev reviewed Mar 6, 2024

View reviewed changes

arunjose696 marked this pull request as ready for review March 7, 2024 08:11

arunjose696 marked this pull request as draft March 7, 2024 08:13

arunjose696 commented Mar 7, 2024

View reviewed changes

arunjose696 force-pushed the generator branch from 9c50985 to 74e8a13 Compare March 7, 2024 12:15

arunjose696 added 3 commits March 11, 2024 03:02

FEAT-modin-project#7004: use generators when returning from _deploy_r…

0abe5db

…ay_func remote function. Signed-off-by: arunjose696 <arunjose696@gmail.com>

adding generators to split_result_of_axis_func_pandas

e7c3365

pr comments

5fb4465

Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 force-pushed the generator branch from 74e8a13 to b2bb5ed Compare March 11, 2024 08:26

calling split_result_of_axis_func_pandas only for ray

004e6e6

Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 force-pushed the generator branch from b2bb5ed to 004e6e6 Compare March 11, 2024 08:56

arunjose696 marked this pull request as ready for review March 11, 2024 09:42

dchigarev reviewed Mar 13, 2024

View reviewed changes

modin/core/storage_formats/pandas/utils.py Show resolved Hide resolved

dchigarev previously approved these changes Mar 13, 2024

View reviewed changes

arunjose696 dismissed dchigarev’s stale review via 9cb5dc8 March 13, 2024 12:26

dchigarev previously approved these changes Mar 13, 2024

View reviewed changes

github-advanced-security bot found potential problems Mar 13, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/axis_partition.py Fixed Show fixed Hide fixed

modin/core/dataframe/pandas/partitioning/axis_partition.py Fixed Show fixed Hide fixed

YarShev reviewed Mar 13, 2024

View reviewed changes

PR comments

585a403

Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 dismissed dchigarev’s stale review via 585a403 March 13, 2024 12:51

arunjose696 force-pushed the generator branch from 9cb5dc8 to 585a403 Compare March 13, 2024 12:51

YarShev approved these changes Mar 13, 2024

View reviewed changes

YarShev merged commit eb740b9 into modin-project:master Mar 13, 2024
45 checks passed

anmyachev mentioned this pull request Mar 27, 2024

Update minimal supported version of ray up to 2.1.0 #7128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

arunjose696 commented Mar 4, 2024

dchigarev Mar 5, 2024

YarShev Mar 5, 2024

YarShev Mar 6, 2024

arunjose696 Mar 6, 2024

YarShev Mar 7, 2024

arunjose696 Mar 7, 2024

arunjose696 Mar 6, 2024

YarShev Mar 7, 2024

arunjose696 Mar 6, 2024

YarShev commented Mar 7, 2024

arunjose696 commented Mar 13, 2024

dchigarev left a comment

YarShev commented Mar 13, 2024

YarShev Mar 13, 2024

arunjose696 Mar 13, 2024 •

edited

Loading

YarShev Mar 13, 2024

arunjose696 Mar 13, 2024

YarShev commented Mar 13, 2024

AndreyPavlenko commented Mar 13, 2024

anmyachev commented Mar 27, 2024 •

edited

Loading

YarShev commented Mar 27, 2024

	return generate_result_of_axis_func_pandas(
	yield from generate_result_of_axis_func_pandas(

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

FEAT-#7004: use generators when returning from _deploy_ray_func remote function. #7005

Conversation

arunjose696 commented Mar 4, 2024

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Mar 7, 2024

arunjose696 commented Mar 13, 2024

dchigarev left a comment

Choose a reason for hiding this comment

YarShev commented Mar 13, 2024

Choose a reason for hiding this comment

arunjose696 Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Mar 13, 2024

AndreyPavlenko commented Mar 13, 2024

anmyachev commented Mar 27, 2024 • edited Loading

YarShev commented Mar 27, 2024

arunjose696 Mar 13, 2024 •

edited

Loading

anmyachev commented Mar 27, 2024 •

edited

Loading