FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

dchigarev · 2024-04-02T12:18:19Z

What do these changes do?

Adds range-partitioning impl for df.resample(). The new implementation doesn't always work better, so enabling it only when the flag is specified.

script to measure

import pandas
import numpy as np
import modin.pandas as pd
import modin.config as cfg

from timeit import default_timer as timer

from modin.utils import execute

cfg.CpuCount.put(16)

nrows = [1_000_000, 5_000_000, 10_000_000]
ncols = [5, 33]
rules = [
    "500ms", # doubles nrows
    "30s", # decreases nrows in 30 times
    "5min", # decreases nrows in 300
]
use_rparts = [True, False]

cols = pandas.MultiIndex.from_product([rules, ncols, use_rparts], names=["rule", "ncols", "USE RANGE PART"])
rres = pandas.DataFrame(index=nrows, columns=cols)

total_nits = len(nrows) * len(ncols) * len(rules) * len(use_rparts)
i = 0

for nrow in nrows:
    for ncol in ncols:
        index = pandas.date_range("31/12/2000", periods=nrow, freq="s")
        data = {f"col{i}": np.arange(nrow) for i in range(ncol)}
        pd_df = pandas.DataFrame(data, index=index)
        for rule in rules:
            for rparts in use_rparts:
                print(f"{round((i / total_nits) * 100, 2)}%")
                i += 1
                cfg.RangePartitioning.put(rparts)

                df = pd.DataFrame(data, index=index)
                execute(df)

                t1 = timer()
                res = df.resample(rule).sum()
                execute(res)
                ts = timer() - t1
                print(nrow, ncol, rule, rparts, ts)

                rres.loc[nrow, (rule, ncol, rparts)] = ts
                rres.to_excel("resample.xlsx")

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Add range-partitioning implementation for df.resample() #7118
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

YarShev · 2024-04-04T13:45:54Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -1049,11 +1058,23 @@
        PandasQueryCompiler
            New QueryCompiler containing the result of resample aggregation.
        """
+        from modin.core.dataframe.pandas.dataframe.utils import ShuffleResample


Why is this import not at the top of the file?

The answer is right above in the codeQL warning :)

it triggers a circular import otherwise

modin/core/dataframe/pandas/dataframe/utils.py

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-04-03T08:33:18Z

modin/core/dataframe/pandas/dataframe/utils.py

+            df, columns_info, ascending, **kwargs
+        )
+        for i, pivot in enumerate(columns_info[0].pivots):
+            add_attr(result[i], pivot - pandas.Timedelta(1, unit="ns"))


an example of why it's requires

Imagine we have a time series with an Hour resolution:

>>> sh a 2018-01-01 00:00:00 0.0 2018-01-01 01:00:00 1.0 2018-01-01 02:00:00 2.0 2018-01-01 03:00:00 3.0

Resampling this into 30-min intervals gives this:

>>> expected_res = sh.resample("30min").sum() >>> expected_res a 2018-01-01 00:00:00 0.0 2018-01-01 00:30:00 0.0 <---- interpolated value 2018-01-01 01:00:00 1.0 2018-01-01 01:30:00 0.0 <---- interpolated value 2018-01-01 02:00:00 2.0 2018-01-01 02:30:00 0.0 <---- interpolated value 2018-01-01 03:00:00 3.0

Let's now emulate parallel execution of resample and split sh into two partitions:

>>> pd.concat([sh.iloc[:2].resample("30min").sum(), sh.iloc[2:].resample("30min").sum()]) a 2018-01-01 00:00:00 0.0 2018-01-01 00:30:00 0.0 <---- interpolated value 2018-01-01 01:00:00 1.0 *should be an interpolated value here, but it's missing* 2018-01-01 02:00:00 2.0 2018-01-01 02:30:00 0.0 <---- interpolated value 2018-01-01 03:00:00 3.0

The reason for the missing value is that, the first partition only sees an interval from [00:00:00, 01:00:00] and the second sees [02:00:00, 03:00:00], so it's unclear that we should fill the gap between 1h and 2h with 1:30.

One of the solutions is to insert a timestamp dummy value with a slight offset in each partition, so it would know the real bounds of the partition.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

modin/core/storage_formats/pandas/query_compiler.py

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-04-03T18:27:19Z

modin/pandas/test/utils.py

-    "data": {"A": range(12), "B": range(12)},
-    "index": pandas.date_range("31/12/2000", periods=12, freq="h"),
+    "data": {
+        f"col{i}": random_state.randint(RAND_LOW, RAND_HIGH, size=NROWS)


more data to actually use partitioning

dchigarev · 2024-04-03T18:27:38Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -2438,6 +2438,7 @@ def combine_and_apply(
            dtypes=new_dtypes,
        )

+    @lazy_metadata_decorator(apply_axis="both")


was missing before, however is needed to function properly

dchigarev · 2024-04-03T18:35:17Z

modin/core/storage_formats/pandas/query_compiler.py

+            resample_kwargs,
+            "transform",
+            arg=arg,
+            allow_range_impl=False,


this approach doesn't work well with transform operations, so all of them are disabled

YarShev · 2024-04-04T13:40:22Z

modin/core/dataframe/pandas/dataframe/utils.py

@@ -122,6 +124,10 @@ class ShuffleSortFunctions(ShuffleFunctions):
        The ideal number of new partitions.
    level : list of strings or ints, or None
        Index level(s) to use as a key. Can't be specified along with `columns`.
+    closed_on_right : bool, default: False


Suggested change

closed_on_right : bool, default: False

close_to_right : bool, default: False

?

here I refer to the term "closed interval", closed_on_right means, that we have to include the right bound in it

modin/core/dataframe/pandas/dataframe/utils.py

YarShev · 2024-04-04T13:42:59Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -1039,6 +1046,8 @@ def _resample_func(
            Modin frame. If not specified will be computed automaticly.
        df_op : callable(pandas.DataFrame) -> [pandas.DataFrame, pandas.Series], optional
            Preprocessor function to apply to the passed frame before resampling.
+        allow_range_impl : bool, default: True


Suggested change

allow_range_impl : bool, default: True

range_impl : bool, default: True

or use_range_impl?

allow_range_impl=True doesn't necessarily mean, that the range-partition will be used, it also depends on cfg.RangePartitioning value and axis argument. So this parameter indeed only 'allows' for the range-partitioning to be used, not dictates that.

modin/core/storage_formats/pandas/query_compiler.py

dchigarev added the Blocked ❌ A pull request that is blocked label Apr 2, 2024

dchigarev changed the title ~~FEAT-#7718: Add range-partitioning impl for 'df.resample()'~~ FEAT-#7118: Add range-partitioning impl for 'df.resample()' Apr 2, 2024

github-advanced-security bot found potential problems Apr 2, 2024

View reviewed changes

FEAT-#7718: Add range-partitioning impl for 'df.resample()'

2d60fa3

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev force-pushed the issue_7718 branch from 7c752d5 to 2d60fa3 Compare April 3, 2024 08:04

dchigarev commented Apr 3, 2024

View reviewed changes

make prettier

f55eba3

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Apr 3, 2024

View reviewed changes

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved

dchigarev removed the Blocked ❌ A pull request that is blocked label Apr 3, 2024

only enable range-partitioning by request

bc1fc69

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Apr 3, 2024

View reviewed changes

dchigarev marked this pull request as ready for review April 3, 2024 19:25

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners April 3, 2024 19:25

YarShev reviewed Apr 4, 2024

View reviewed changes

YarShev approved these changes Apr 4, 2024

View reviewed changes

YarShev merged commit 8e79122 into modin-project:master Apr 4, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

dchigarev commented Apr 2, 2024 •

edited

Loading

YarShev Apr 4, 2024

dchigarev Apr 4, 2024 •

edited

Loading

dchigarev Apr 3, 2024 •

edited

Loading

dchigarev Apr 3, 2024

dchigarev Apr 3, 2024 •

edited

Loading

dchigarev Apr 3, 2024

YarShev Apr 4, 2024

dchigarev Apr 4, 2024

YarShev Apr 4, 2024

dchigarev Apr 4, 2024

	closed_on_right : bool, default: False
	close_to_right : bool, default: False

	allow_range_impl : bool, default: True
	range_impl : bool, default: True

FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140

Conversation

dchigarev commented Apr 2, 2024 • edited Loading

What do these changes do?

YarShev Apr 4, 2024

Choose a reason for hiding this comment

dchigarev Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

dchigarev Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

dchigarev Apr 3, 2024

Choose a reason for hiding this comment

dchigarev Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

dchigarev Apr 3, 2024

Choose a reason for hiding this comment

YarShev Apr 4, 2024

Choose a reason for hiding this comment

dchigarev Apr 4, 2024

Choose a reason for hiding this comment

YarShev Apr 4, 2024

Choose a reason for hiding this comment

dchigarev Apr 4, 2024

Choose a reason for hiding this comment

dchigarev commented Apr 2, 2024 •

edited

Loading

dchigarev Apr 4, 2024 •

edited

Loading

dchigarev Apr 3, 2024 •

edited

Loading

dchigarev Apr 3, 2024 •

edited

Loading