Enable range-partitioning implementation for `groupby().rolling()` by default #6942

dchigarev · 2024-02-19T12:58:21Z

Doing measurements, I found out that using range-partitioning implementation for groupby().rolling() is always faster than full-axis implementation, so it's worth enabling range-partitioning impl by default for this case (the same as we did for groupby().apply() #6804)

16 cores

44 cores

script to measure

import modin.pandas as pd
import pandas

import numpy as np

import modin.config as cfg
from timeit import default_timer as timer
from modin.utils import execute

NROWS = [10_000, 30_000, 50_000, 100_000, 500_000, 1_000_000, 5_000_000]
NCOLS = [10, 40]
NGROUPS = [10, 10_000]
USE_RANGE_PART = [True, False]
NITERS = 3
WINDOW = 5
method = "mean"

total_iters = len(NROWS) * len(NCOLS) * len(NGROUPS) * len(USE_RANGE_PART) * NITERS
its = 0

res_s = pandas.DataFrame(index=pandas.Index(NROWS, name="num_rows"), columns=pandas.MultiIndex.from_product([NCOLS, NGROUPS, USE_RANGE_PART], names=["num_cols", "num_groups", "use range-part"]))

for nrows in NROWS:
    for ncols in NCOLS:
        for ngroups in NGROUPS:
            data = {
                "key": np.tile(np.arange(ngroups), nrows // ngroups),
                **{f"data_col{i}": np.random.randint(0, 1_000_000, size=nrows) for i in range(ncols - 1)}
            }

            for use_rpart in USE_RANGE_PART:
                cfg.RangePartitioningGroupby.put(use_rpart)
                md_df = pd.DataFrame(data)
                execute(md_df)
                times = []
                for i in range(NITERS):
                    print(f"{round((its / total_iters) * 100, 2)}%")
                    t1 = timer()
                    res = getattr(md_df.groupby("key").rolling(WINDOW), method)()
                    execute(res)
                    md_time = timer() - t1
                    times.append(md_time)
                    its += 1
                tm = np.median(times)
                res_s.loc[nrows, (ncols, ngroups, use_rpart)] = tm

                res_s.to_excel("rolling.xlsx")

The text was updated successfully, but these errors were encountered:

…).rolling()' by default Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

…by default (#6943) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added new feature/request 💬 Requests and pull requests for new features P1 Important tasks that we should complete soon labels Feb 19, 2024

dchigarev self-assigned this Feb 19, 2024

dchigarev added a commit to dchigarev/modin that referenced this issue Feb 19, 2024

FEAT-modin-project#6942: Enable range-partitioning impl for 'groupby(…

0f58679

…).rolling()' by default Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Feb 19, 2024

FEAT-#6942: Enable range-partitioning impl for 'groupby().rolling()' by default #6943

Merged

7 tasks

YarShev closed this as completed in #6943 Feb 19, 2024

YarShev pushed a commit that referenced this issue Feb 19, 2024

FEAT-#6942: Enable range-partitioning impl for 'groupby().rolling()' …

aeb7d0a

…by default (#6943) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable range-partitioning implementation for `groupby().rolling()` by default #6942

Enable range-partitioning implementation for `groupby().rolling()` by default #6942

dchigarev commented Feb 19, 2024

Enable range-partitioning implementation for groupby().rolling() by default #6942

Enable range-partitioning implementation for groupby().rolling() by default #6942

Comments

dchigarev commented Feb 19, 2024

Enable range-partitioning implementation for `groupby().rolling()` by default #6942

Enable range-partitioning implementation for `groupby().rolling()` by default #6942