-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame #6610
Conversation
Signed-off-by: Andrey Pavlenko <andrey.a.pavlenko@gmail.com>
but it also doubles the memory consumption, doesn't it? |
@@ -2670,6 +2673,7 @@ def to_pandas(self): | |||
# restrictions on column names. | |||
df.columns = self.columns | |||
|
|||
self._pandas_df = df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm worried about memory consumption, maybe we should make here a weakref here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt weakref is a good solution here. Most probably, the weak referenced frame will never be reused and always garbage collected.
I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Here is a simple test, demonstrating the memory usage: import psutil
import modin.pandas as pd
df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}") Output on the master branch:
Output on this branch
|
@@ -2670,6 +2673,7 @@ def to_pandas(self): | |||
# restrictions on column names. | |||
df.columns = self.columns | |||
|
|||
self._pandas_df = df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about in-place operations executed on returned dataframe? Wouldn't such operations affect stored object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! You are right, the stored object will be affected.
A non-deep copy should be returned here. It will share the data with the main frame, but will not change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even after the copy, in-place operations still mutate unwanted frames:
import modin.pandas as pd
def setitem(df, i, val):
df.iloc[i, 0] = val
return df
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
res1 = df._default_to_pandas(lambda df: setitem(df, 0, 10))
res2 = df._default_to_pandas(lambda df: setitem(df, 1, 100))
print(df)
# a
# 0 10
# 1 100
# 2 3
# 3 4
# 4 5
print(res1)
# a
# 0 10
# 1 2
# 2 3
# 3 4
# 4 5
print(res2)
# a
# 0 10
# 1 100
# 2 3
# 3 4
# 4 5
Does it mean that we should copy pandas_df
anytime someone requests that field or restrict in-place operations? (i'm mostly for the second option)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we need to restrict the inplace operations or create a deep copy in case of an inplace operation.
2473e28
to
49c6819
Compare
But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done, so to make this scenario more realistic we should delete import psutil
import modin.pandas as pd
df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
del pdf1
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
del pdf2
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
del pdf3 Then on master I get:
And for your branch it's:
|
Well, if an arrow table with certain data can be converted to pandas by simply sharing its buffer, then shouldn't such conversion be almost free? Do you know columns with what data types can be converted that easy way? |
It depends ... For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame. |
Right, this is done so we wouldn't do unnecessary
This optimization is quite good, but again, how is this related to this PR? |
Not related. These are just a few examples of when we do |
I understand that, but in those examples pandas dfs origin not from the |
Not necessary. The frame, returned by to_pandas(), is used to build a new modin frame. import psutil
import pandas as pd
# import modin.pandas as pd
df = pd.DataFrame(range(1000000), columns=pd.MultiIndex.from_tuples([(1,2,3)]))
mem0 = psutil.virtual_memory().used
df2 = df.iloc[:-1]
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}") The pandas |
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date