PERF-#7150: Reduce peak memory consumption #7149

anmyachev · 2024-04-04T22:54:26Z

What do these changes do?

Keeping large objects (in our case, dataframes) in local variables when they are no longer needed is extremely memory-intensive. I propose to review all such places and try to free up memory as early as possible.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Reduce peak memory consumption #7150
tests ~~added and~~ passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev · 2024-04-08T11:22:07Z

@YarShev @dchigarev ready for review

YarShev · 2024-04-08T11:50:40Z

modin/core/dataframe/pandas/partitioning/axis_partition.py

@@ -377,6 +377,8 @@ def deploy_splitting_func(
            A list of pandas DataFrames.
        """
        dataframe = pandas.concat(list(partitions), axis=axis, copy=False)
+        # to reduce peak memory consumption
+        del partitions


Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?

Since we manually delete objects now, how does this affect performance? Did you try running any benchmark?

del statement simply removes a reference to the object, allows the garbage collector to free memory earlier, but does not explicitly call it, so there is no direct impact on performance.

YarShev · 2024-04-08T11:51:17Z

modin/core/dataframe/pandas/partitioning/axis_partition.py

@@ -377,6 +377,8 @@ def deploy_splitting_func(
            A list of pandas DataFrames.
        """
        dataframe = pandas.concat(list(partitions), axis=axis, copy=False)
+        # to reduce peak memory consumption


How much does this reduce peak memory consumption?

It depends on the size of the objects in partitions. Here we don't keep one extra copy of the entire dataframe until the end of the function.

Do you have any results on a benchmark we run regularly? It would be great to look at.

No, only for synthetic bench (roughly repeats what happens when performing an operation on axial partitions):

import numpy as np import pandas as pd import time df1 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2)) # ~ 1GB df2 = pd.DataFrame(np.random.rand(10**6, 2 * 10**2)) # ~ 1GB partitions = (df1, df2) dataframe = pd.concat(list(partitions), axis=0, copy=False) del partitions, df1, df2 result = dataframe.abs() del dataframe time.sleep(5) # case with del save around ~4GB of peak memory consumption

I guess we can just look at the results of the built-in (in CI) memory consumption check?

In timedf the memory consumption is calculated by computing peak memory consumption by checking the stats similar to htop by computing the memory in child thread for every 0.001 seconds.

The peak memory calculated is printed as max_system_memory in the CI at the end of benchmark run in timedf if we need to compare memory consumption on any of our regular benchmarks.

anmyachev changed the title ~~PERF-#0000: Reduce peak memory consumption~~ PERF-#7150: Reduce peak memory consumption Apr 5, 2024

anmyachev marked this pull request as ready for review April 5, 2024 08:18

anmyachev requested review from aregm, gshimansky, ienkovich, Garra1980, YarShev, vnlitvinov, dchigarev, AndreyPavlenko, a team, devin-petersohn, mvashishtha and RehanSD as code owners April 5, 2024 08:18

anmyachev added 2 commits April 8, 2024 12:50

PERF-modin-project#7150: Reduce peak memory consumption

efd27d3

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

some more cases

f8aac27

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the peak-memory branch from 4c9bdf0 to f8aac27 Compare April 8, 2024 10:53

YarShev reviewed Apr 8, 2024

View reviewed changes

YarShev approved these changes Apr 8, 2024

View reviewed changes

anmyachev merged commit 4c95e16 into modin-project:master Apr 8, 2024
36 checks passed

anmyachev deleted the peak-memory branch April 8, 2024 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#7150: Reduce peak memory consumption #7149

PERF-#7150: Reduce peak memory consumption #7149

anmyachev commented Apr 4, 2024 •

edited

Loading

anmyachev commented Apr 8, 2024

YarShev Apr 8, 2024

anmyachev Apr 8, 2024

YarShev Apr 8, 2024

anmyachev Apr 8, 2024

YarShev Apr 8, 2024

anmyachev Apr 8, 2024

arunjose696 Apr 8, 2024 •

edited

Loading

PERF-#7150: Reduce peak memory consumption #7149

PERF-#7150: Reduce peak memory consumption #7149

Conversation

anmyachev commented Apr 4, 2024 • edited Loading

What do these changes do?

anmyachev commented Apr 8, 2024

YarShev Apr 8, 2024

Choose a reason for hiding this comment

anmyachev Apr 8, 2024

Choose a reason for hiding this comment

YarShev Apr 8, 2024

Choose a reason for hiding this comment

anmyachev Apr 8, 2024

Choose a reason for hiding this comment

YarShev Apr 8, 2024

Choose a reason for hiding this comment

anmyachev Apr 8, 2024

Choose a reason for hiding this comment

arunjose696 Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

anmyachev commented Apr 4, 2024 •

edited

Loading

arunjose696 Apr 8, 2024 •

edited

Loading