REGR: groupby.transform with a UDF performance #55256

rhshadrach · 2023-09-23T15:01:05Z

pd.options.mode.copy_on_write = False  # True
size = 10_000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    }
).set_index(['a', 'b']).sort_index()

gb = df.groupby(['a', 'b'])

%timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))

# 2.0.x - CoW=False
# 1.46 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# 2.0.x - CoW=True
# 1.47 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=False
# 4.35 s ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=True
# 9.11 s ± 76.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Encountered this trying to update some code to use CoW. The regression exists without CoW, but is also worse with it. Haven't done any investigation yet as to why.

cc @phofl @jorisvandenbossche

PS: This code have not been using transform with a UDF 😄

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-09-23T16:17:01Z

Spent a while trying to profile this and had difficultly using various methods. It looks like concat is modifying the argument (the indexes I'm guessing).

applied = [foo(group) for _, group in gb]

timer = time.time()
pd.concat(applied)
print(time.time() - timer)
# 5.4200828075408936

timer = time.time()
pd.concat(applied)
print(time.time() - timer)
# 0.22937822341918945

This might be completely expected (not sure), but wanted to mention since it might be helpful in case anyone else works on profiling.

rhshadrach · 2023-09-25T21:45:07Z

@phofl - I think I've tracked it down to an issue with the reference counting in BlockValuesRefs._clear_dead_references. This is called from MultiIndex.levels and I'm seeing len(self.referenced_blocks) there go up to 12k linearly. I'm wondering if this is caused by our use of internals in pandas.core.groupby.ops.FrameSplitter._chop but not sure.

phofl · 2023-09-25T21:47:08Z

Is this on current main?

rhshadrach · 2023-09-25T21:50:01Z

Yes - I'm behind by 1 commit.

rhshadrach · 2023-09-25T21:51:41Z

Also - the more I toy around with these groups, the more I'm seeing the reference counting go up. If I run a few different operations on them (e.g. len(gb) where gb is as in the OP), it goes up even higher.

brynpickering · 2023-09-26T11:34:35Z

@rhshadrach just weighing in here as we saw this issue really hit performance in our implementation. It is caused specifically by grouping over multiple columns(/levels).
The two functions below lead to equivalent results, but the second is much slower in pandas 2.1.1. The first performs the same in both 2.1.1 and 2.0.3:

It requires quite a large dataframe to see the difference (for a small df, O(100) rows, the performance is similar). With a large df (O(1 million) rows), the time difference is more than an order of magnitude.

df_dict = {}
for col1, df_slice_1 in df.groupby("col1"):
    df_dict[col1] = {}
    for col2, df_slice_2 in df_slice_1.groupby("col2"):
        df_dict[col1][col2] = _person_data

df_dict = {x: {} for x in df["col1"].unique()}
for (col1, col2), df_slice in df.groupby(["col1", "col2"]):
    df_dict[col1][col2] = df_slice

jbrockmendel · 2023-09-27T21:07:01Z

Will need more attention to make sure is safe, but I got a 4x speedup in the OP example (with CoW disabled) by pinning result._cache["levels"] = self.levels in MultiIndex._getitem_slice

phofl · 2023-09-27T21:27:21Z

Richard and I discussed the same solution on slack, that should be safe

rhshadrach · 2023-09-27T21:31:32Z

I also checked, this was picked up by our ASVs: https://asv-runner.github.io/asv-collection/pandas/#groupby.Apply.time_copy_function_multi_col

rhshadrach · 2023-10-02T20:48:33Z

I'm going to take this up - if anyone else has started working on this, happy to let you have it.

rhshadrach · 2023-10-02T21:10:15Z

I'm still seeing a 30% slowdown between 2.0 and 2.1 without CoW, and an additional 100% slowdown in 2.1 with CoW. It seems to be due to #32669 (comment) (cc @topper-123). This is hit whenever we copy the MultiIndex, which occurs in the x == x.shift(-1) in the OP.

lithomas1 · 2023-12-03T18:40:12Z

We're down to a ~15% slowdown, now, I think.

In [5]: size = 10_000
   ...: df = pd.DataFrame(
   ...:     {
   ...:         'a': np.random.randint(0, 100, size),
   ...:         'b': np.random.randint(0, 100, size),
   ...:         'c': np.random.randint(0, 100, size),
   ...:     }
   ...: ).set_index(['a', 'b']).sort_index()
   ...: 
   ...: gb = df.groupby(['a', 'b'])

In [6]: %timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))
2.35 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: pd.options.mode.copy_on_write=True

In [8]: size = 10_000
   ...: df = pd.DataFrame(
   ...:     {
   ...:         'a': np.random.randint(0, 100, size),
   ...:         'b': np.random.randint(0, 100, size),
   ...:         'c': np.random.randint(0, 100, size),
   ...:     }
   ...: ).set_index(['a', 'b']).sort_index()
   ...: 
   ...: gb = df.groupby(['a', 'b'])
   ...: 
   ...: %timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))
2.77 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

rhshadrach added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Copy / view semantics Transformations e.g. cumsum, diff, rank labels Sep 23, 2023

rhshadrach added this to the 2.1.2 milestone Sep 23, 2023

zaneselvans mentioned this issue Sep 23, 2023

XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3 catalyst-cooperative/ferc-xbrl-extractor#143

Closed

brynpickering mentioned this issue Sep 26, 2023

Fix long run time and reintroduce timing benchmark test arup-group/pam#258

Merged

3 tasks

rhshadrach self-assigned this Oct 2, 2023

rhshadrach removed their assignment Oct 2, 2023

jorisvandenbossche mentioned this issue Oct 14, 2023

CoW: Clear dead references every time we add a new one #55008

Merged

5 tasks

This was referenced Oct 15, 2023

CoW: Use exponential backoff when clearing dead references #55518

Merged

PERF: Using Series or DataFrame in grouped iterator leads to significant increase in run time #55541

Closed

brynpickering mentioned this issue Oct 24, 2023

Regression in runtime calliope-project/calliope#498

Closed

lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023

jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023

lithomas1 modified the milestones: 2.1.4, 2.2 Dec 3, 2023

lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024

lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024

lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: groupby.transform with a UDF performance #55256

REGR: groupby.transform with a UDF performance #55256

rhshadrach commented Sep 23, 2023 •

edited

rhshadrach commented Sep 23, 2023

rhshadrach commented Sep 25, 2023

phofl commented Sep 25, 2023

rhshadrach commented Sep 25, 2023

rhshadrach commented Sep 25, 2023

brynpickering commented Sep 26, 2023

jbrockmendel commented Sep 27, 2023

phofl commented Sep 27, 2023

rhshadrach commented Sep 27, 2023

rhshadrach commented Oct 2, 2023

rhshadrach commented Oct 2, 2023 •

edited

lithomas1 commented Dec 3, 2023

REGR: groupby.transform with a UDF performance #55256

REGR: groupby.transform with a UDF performance #55256

Comments

rhshadrach commented Sep 23, 2023 • edited

rhshadrach commented Sep 23, 2023

rhshadrach commented Sep 25, 2023

phofl commented Sep 25, 2023

rhshadrach commented Sep 25, 2023

rhshadrach commented Sep 25, 2023

brynpickering commented Sep 26, 2023

jbrockmendel commented Sep 27, 2023

phofl commented Sep 27, 2023

rhshadrach commented Sep 27, 2023

rhshadrach commented Oct 2, 2023

rhshadrach commented Oct 2, 2023 • edited

lithomas1 commented Dec 3, 2023

rhshadrach commented Sep 23, 2023 •

edited

rhshadrach commented Oct 2, 2023 •

edited