Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#7150: Reduce peak memory consumption #7149

Merged
merged 2 commits into from
Apr 8, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Apr 4, 2024

What do these changes do?

Keeping large objects (in our case, dataframes) in local variables when they are no longer needed is extremely memory-intensive. I propose to review all such places and try to free up memory as early as possible.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Reduce peak memory consumption #7150
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev changed the title PERF-#0000: Reduce peak memory consumption PERF-#7150: Reduce peak memory consumption Apr 5, 2024
@anmyachev anmyachev marked this pull request as ready for review April 5, 2024 08:18
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator Author

@YarShev @dchigarev ready for review

@@ -377,6 +377,8 @@ def deploy_splitting_func(
A list of pandas DataFrames.
"""
dataframe = pandas.concat(list(partitions), axis=axis, copy=False)
# to reduce peak memory consumption
del partitions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?

del statement simply removes a reference to the object, allows the garbage collector to free memory earlier, but does not explicitly call it, so there is no direct impact on performance.

@@ -377,6 +377,8 @@ def deploy_splitting_func(
A list of pandas DataFrames.
"""
dataframe = pandas.concat(list(partitions), axis=axis, copy=False)
# to reduce peak memory consumption
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this reduce peak memory consumption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the size of the objects in partitions. Here we don't keep one extra copy of the entire dataframe until the end of the function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any results on a benchmark we run regularly? It would be great to look at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only for synthetic bench (roughly repeats what happens when performing an operation on axial partitions):

import numpy as np
import pandas as pd
import time

df1 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2))  # ~ 1GB
df2 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2))  # ~ 1GB
partitions = (df1, df2)

dataframe = pd.concat(list(partitions), axis=0, copy=False)
del partitions, df1, df2

result = dataframe.abs()

del dataframe

time.sleep(5)

# case with del save around ~4GB of peak memory consumption

I guess we can just look at the results of the built-in (in CI) memory consumption check?

Copy link
Collaborator

@arunjose696 arunjose696 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In timedf the memory consumption is calculated by computing peak memory consumption by checking the stats similar to htop by computing the memory in child thread for every 0.001 seconds.

The peak memory calculated is printed as max_system_memory in the CI at the end of benchmark run in timedf if we need to compare memory consumption on any of our regular benchmarks.

@anmyachev anmyachev merged commit 4c95e16 into modin-project:master Apr 8, 2024
36 checks passed
@anmyachev anmyachev deleted the peak-memory branch April 8, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce peak memory consumption
3 participants