Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610

Closed

Conversation

AndreyPavlenko
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves HDK: to_pandas(): Cache pandas DataFrame  #6609
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>
@AndreyPavlenko AndreyPavlenko marked this pull request as ready for review September 27, 2023 15:58
@AndreyPavlenko AndreyPavlenko requested review from a team as code owners September 27, 2023 15:58
@dchigarev
Copy link
Collaborator

Adding a cache to this method could significantly improve performance of the methods, that are defaulting to pandas.

but it also doubles the memory consumption, doesn't it?

@@ -2670,6 +2673,7 @@ def to_pandas(self):
# restrictions on column names.
df.columns = self.columns

self._pandas_df = df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm worried about memory consumption, maybe we should make here a weakref here instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt weakref is a good solution here. Most probably, the weak referenced frame will never be reused and always garbage collected.

@AndreyPavlenko
Copy link
Collaborator Author

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.
Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")

Output on the master branch:

10649378816
14785376256: + 4135997440
16392810496: + 1607434240
17996320768: + 1603510272

Output on this branch

10598752256
14746906624: + 4148154368
14746906624: + 0
14746906624: + 0

@@ -2670,6 +2673,7 @@ def to_pandas(self):
# restrictions on column names.
df.columns = self.columns

self._pandas_df = df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in-place operations executed on returned dataframe? Wouldn't such operations affect stored object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You are right, the stored object will be affected.
A non-deep copy should be returned here. It will share the data with the main frame, but will not change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even after the copy, in-place operations still mutate unwanted frames:

import modin.pandas as pd

def setitem(df, i, val):
    df.iloc[i, 0] = val
    return df

df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})

res1 = df._default_to_pandas(lambda df: setitem(df, 0, 10))
res2 = df._default_to_pandas(lambda df: setitem(df, 1, 100))

print(df)
#      a
# 0   10
# 1  100
# 2    3
# 3    4
# 4    5

print(res1)
#     a
# 0  10
# 1   2
# 2   3
# 3   4
# 4   5

print(res2)
#      a
# 0   10
# 1  100
# 2    3
# 3    4
# 4    5

Does it mean that we should copy pandas_df anytime someone requests that field or restrict in-place operations? (i'm mostly for the second option)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we need to restrict the inplace operations or create a deep copy in case of an inplace operation.

@dchigarev
Copy link
Collaborator

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done, so to make this scenario more realistic we should delete pdf after each measurement:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
del pdf1

pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
del pdf2

pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
del pdf3

Then on master I get:

8689057792
12090052608: + 3400994816
12083097600: + -6955008
11347464192: + -735633408
(the memory consumption decreases over the calls?)

And for your branch it's:

8684437504
12864876544: + 4180439040
12864876544: + 0
12864876544: + 0

@dchigarev
Copy link
Collaborator

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.

Well, if an arrow table with certain data can be converted to pandas by simply sharing its buffer, then shouldn't such conversion be almost free? Do you know columns with what data types can be converted that easy way?

@AndreyPavlenko
Copy link
Collaborator Author

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done

It depends ... For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.
Also, in this implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

@dchigarev
Copy link
Collaborator

For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.

Right, this is done so we wouldn't do unnecessary .to_pandas() conversions since we know in advance that all operations will default to pandas. However, how is this related to the changes in this PR? How caching pandas df in normal HDK frames could help in this case?

Also, in #6412 implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

This optimization is quite good, but again, how is this related to this PR?

@AndreyPavlenko
Copy link
Collaborator Author

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

@dchigarev
Copy link
Collaborator

dchigarev commented Sep 28, 2023

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

I understand that, but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right? My original question was regarding keeping in memory the results of .to_pandas() after a default-to-pandas function is done.

@AndreyPavlenko
Copy link
Collaborator Author

AndreyPavlenko commented Sep 28, 2023

but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right?

Not necessary. The frame, returned by to_pandas(), is used to build a new modin frame.
Here is an example:

import psutil
import pandas as pd
# import modin.pandas as pd

df = pd.DataFrame(range(1000000), columns=pd.MultiIndex.from_tuples([(1,2,3)]))
mem0 = psutil.virtual_memory().used
df2 = df.iloc[:-1]
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")

The pandas iloc returns a new frame, that shares the data with the original one, no new memory for the data is allocated. In modin hdk this iloc results in 4 calls to to_pandas(). I.e., we create 4 new pandas frames, 3 of them are garbage collected and the last one is saved in the partitions of the new hdk frame.

@anmyachev
Copy link
Collaborator

#7234

@anmyachev anmyachev closed this May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDK: to_pandas(): Cache pandas DataFrame
4 participants