PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

AndreyPavlenko · 2023-09-27T14:59:58Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves HDK: to_pandas(): Cache pandas DataFrame #6609
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

dchigarev · 2023-09-27T16:02:42Z

Adding a cache to this method could significantly improve performance of the methods, that are defaulting to pandas.

but it also doubles the memory consumption, doesn't it?

dchigarev · 2023-09-27T16:04:46Z

modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py

@@ -2670,6 +2673,7 @@ def to_pandas(self):
        # restrictions on column names.
        df.columns = self.columns

+        self._pandas_df = df


i'm worried about memory consumption, maybe we should make here a weakref here instead?

I doubt weakref is a good solution here. Most probably, the weak referenced frame will never be reused and always garbage collected.

AndreyPavlenko · 2023-09-27T20:02:14Z

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.
Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")

Output on the master branch:

10649378816
14785376256: + 4135997440
16392810496: + 1607434240
17996320768: + 1603510272

Output on this branch

10598752256
14746906624: + 4148154368
14746906624: + 0
14746906624: + 0

ienkovich · 2023-09-27T20:30:27Z

modin/experimental/core/execution/native/implementations/hdk_on_native/dataframe/dataframe.py

@@ -2670,6 +2673,7 @@ def to_pandas(self):
        # restrictions on column names.
        df.columns = self.columns

+        self._pandas_df = df


What about in-place operations executed on returned dataframe? Wouldn't such operations affect stored object?

Good catch! You are right, the stored object will be affected.
A non-deep copy should be returned here. It will share the data with the main frame, but will not change it.

even after the copy, in-place operations still mutate unwanted frames:

import modin.pandas as pd def setitem(df, i, val): df.iloc[i, 0] = val return df df = pd.DataFrame({"a": [1, 2, 3, 4, 5]}) res1 = df._default_to_pandas(lambda df: setitem(df, 0, 10)) res2 = df._default_to_pandas(lambda df: setitem(df, 1, 100)) print(df) # a # 0 10 # 1 100 # 2 3 # 3 4 # 4 5 print(res1) # a # 0 10 # 1 2 # 2 3 # 3 4 # 4 5 print(res2) # a # 0 10 # 1 100 # 2 3 # 3 4 # 4 5

Does it mean that we should copy pandas_df anytime someone requests that field or restrict in-place operations? (i'm mostly for the second option)

You are right, we need to restrict the inplace operations or create a deep copy in case of an inplace operation.

dchigarev · 2023-09-28T10:20:00Z

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done, so to make this scenario more realistic we should delete pdf after each measurement:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
del pdf1

pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
del pdf2

pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
del pdf3

Then on master I get:

8689057792
12090052608: + 3400994816
12083097600: + -6955008
11347464192: + -735633408
(the memory consumption decreases over the calls?)

And for your branch it's:

8684437504
12864876544: + 4180439040
12864876544: + 0
12864876544: + 0

dchigarev · 2023-09-28T10:35:58Z

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.

Well, if an arrow table with certain data can be converted to pandas by simply sharing its buffer, then shouldn't such conversion be almost free? Do you know columns with what data types can be converted that easy way?

AndreyPavlenko · 2023-09-28T10:36:07Z

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done

It depends ... For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.
Also, in this implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

dchigarev · 2023-09-28T11:25:15Z

For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.

Right, this is done so we wouldn't do unnecessary .to_pandas() conversions since we know in advance that all operations will default to pandas. However, how is this related to the changes in this PR? How caching pandas df in normal HDK frames could help in this case?

Also, in #6412 implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

This optimization is quite good, but again, how is this related to this PR?

AndreyPavlenko · 2023-09-28T11:31:23Z

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

dchigarev · 2023-09-28T11:34:05Z

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

I understand that, but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right? My original question was regarding keeping in memory the results of .to_pandas() after a default-to-pandas function is done.

AndreyPavlenko · 2023-09-28T15:55:48Z

but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right?

Not necessary. The frame, returned by to_pandas(), is used to build a new modin frame.
Here is an example:

import psutil
import pandas as pd
# import modin.pandas as pd

df = pd.DataFrame(range(1000000), columns=pd.MultiIndex.from_tuples([(1,2,3)]))
mem0 = psutil.virtual_memory().used
df2 = df.iloc[:-1]
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")

The pandas iloc returns a new frame, that shares the data with the original one, no new memory for the data is allocated. In modin hdk this iloc results in 4 calls to to_pandas(). I.e., we create 4 new pandas frames, 3 of them are garbage collected and the last one is saved in the partitions of the new hdk frame.

anmyachev · 2024-05-05T16:39:27Z

#7234

PERF-modin-project#6609: HDK: to_pandas(): Cache pandas DataFrame

6ac30c4

Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>

AndreyPavlenko marked this pull request as ready for review September 27, 2023 15:58

AndreyPavlenko requested review from a team as code owners September 27, 2023 15:58

dchigarev reviewed Sep 27, 2023

View reviewed changes

ienkovich reviewed Sep 27, 2023

View reviewed changes

Shallow copy the cached frame

49c6819

AndreyPavlenko force-pushed the issue-6609 branch from 2473e28 to 49c6819 Compare September 28, 2023 10:17

anmyachev closed this May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

AndreyPavlenko commented Sep 27, 2023

dchigarev commented Sep 27, 2023

dchigarev Sep 27, 2023

AndreyPavlenko Sep 27, 2023

AndreyPavlenko commented Sep 27, 2023

ienkovich Sep 27, 2023

AndreyPavlenko Sep 27, 2023

dchigarev Sep 28, 2023

AndreyPavlenko Sep 28, 2023

dchigarev commented Sep 28, 2023

dchigarev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

dchigarev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

dchigarev commented Sep 28, 2023 •

edited

AndreyPavlenko commented Sep 28, 2023 •

edited

anmyachev commented May 5, 2024

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

Conversation

AndreyPavlenko commented Sep 27, 2023

What do these changes do?

dchigarev commented Sep 27, 2023

dchigarev Sep 27, 2023

Choose a reason for hiding this comment

AndreyPavlenko Sep 27, 2023

Choose a reason for hiding this comment

AndreyPavlenko commented Sep 27, 2023

ienkovich Sep 27, 2023

Choose a reason for hiding this comment

AndreyPavlenko Sep 27, 2023

Choose a reason for hiding this comment

dchigarev Sep 28, 2023

Choose a reason for hiding this comment

AndreyPavlenko Sep 28, 2023

Choose a reason for hiding this comment

dchigarev commented Sep 28, 2023

dchigarev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

dchigarev commented Sep 28, 2023

AndreyPavlenko commented Sep 28, 2023

dchigarev commented Sep 28, 2023 • edited

AndreyPavlenko commented Sep 28, 2023 • edited

anmyachev commented May 5, 2024

dchigarev commented Sep 28, 2023 •

edited

AndreyPavlenko commented Sep 28, 2023 •

edited