Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: groupby.transform with a UDF performance #55256

Open
rhshadrach opened this issue Sep 23, 2023 · 12 comments
Open

REGR: groupby.transform with a UDF performance #55256

rhshadrach opened this issue Sep 23, 2023 · 12 comments
Labels
Copy / view semantics Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Transformations e.g. cumsum, diff, rank
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Sep 23, 2023

pd.options.mode.copy_on_write = False  # True
size = 10_000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    }
).set_index(['a', 'b']).sort_index()

gb = df.groupby(['a', 'b'])

%timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))

# 2.0.x - CoW=False
# 1.46 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# 2.0.x - CoW=True
# 1.47 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=False
# 4.35 s ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=True
# 9.11 s ± 76.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Encountered this trying to update some code to use CoW. The regression exists without CoW, but is also worse with it. Haven't done any investigation yet as to why.

cc @phofl @jorisvandenbossche

PS: This code have not been using transform with a UDF 😄

@rhshadrach rhshadrach added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Copy / view semantics Transformations e.g. cumsum, diff, rank labels Sep 23, 2023
@rhshadrach rhshadrach added this to the 2.1.2 milestone Sep 23, 2023
@rhshadrach
Copy link
Member Author

Spent a while trying to profile this and had difficultly using various methods. It looks like concat is modifying the argument (the indexes I'm guessing).

applied = [foo(group) for _, group in gb]

timer = time.time()
pd.concat(applied)
print(time.time() - timer)
# 5.4200828075408936

timer = time.time()
pd.concat(applied)
print(time.time() - timer)
# 0.22937822341918945

This might be completely expected (not sure), but wanted to mention since it might be helpful in case anyone else works on profiling.

@rhshadrach
Copy link
Member Author

@phofl - I think I've tracked it down to an issue with the reference counting in BlockValuesRefs._clear_dead_references. This is called from MultiIndex.levels and I'm seeing len(self.referenced_blocks) there go up to 12k linearly. I'm wondering if this is caused by our use of internals in pandas.core.groupby.ops.FrameSplitter._chop but not sure.

@phofl
Copy link
Member

phofl commented Sep 25, 2023

Is this on current main?

@rhshadrach
Copy link
Member Author

Yes - I'm behind by 1 commit.

@rhshadrach
Copy link
Member Author

Also - the more I toy around with these groups, the more I'm seeing the reference counting go up. If I run a few different operations on them (e.g. len(gb) where gb is as in the OP), it goes up even higher.

@brynpickering
Copy link

@rhshadrach just weighing in here as we saw this issue really hit performance in our implementation. It is caused specifically by grouping over multiple columns(/levels).
The two functions below lead to equivalent results, but the second is much slower in pandas 2.1.1. The first performs the same in both 2.1.1 and 2.0.3:

It requires quite a large dataframe to see the difference (for a small df, O(100) rows, the performance is similar). With a large df (O(1 million) rows), the time difference is more than an order of magnitude.

df_dict = {}
for col1, df_slice_1 in df.groupby("col1"):
    df_dict[col1] = {}
    for col2, df_slice_2 in df_slice_1.groupby("col2"):
        df_dict[col1][col2] = _person_data
df_dict = {x: {} for x in df["col1"].unique()}
for (col1, col2), df_slice in df.groupby(["col1", "col2"]):
    df_dict[col1][col2] = df_slice

@jbrockmendel
Copy link
Member

Will need more attention to make sure is safe, but I got a 4x speedup in the OP example (with CoW disabled) by pinning result._cache["levels"] = self.levels in MultiIndex._getitem_slice

@phofl
Copy link
Member

phofl commented Sep 27, 2023

Richard and I discussed the same solution on slack, that should be safe

@rhshadrach
Copy link
Member Author

I also checked, this was picked up by our ASVs: https://asv-runner.github.io/asv-collection/pandas/#groupby.Apply.time_copy_function_multi_col

@rhshadrach
Copy link
Member Author

I'm going to take this up - if anyone else has started working on this, happy to let you have it.

@rhshadrach rhshadrach self-assigned this Oct 2, 2023
@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 2, 2023

I'm still seeing a 30% slowdown between 2.0 and 2.1 without CoW, and an additional 100% slowdown in 2.1 with CoW. It seems to be due to #32669 (comment) (cc @topper-123). This is hit whenever we copy the MultiIndex, which occurs in the x == x.shift(-1) in the OP.

@lithomas1
Copy link
Member

We're down to a ~15% slowdown, now, I think.

In [5]: size = 10_000
   ...: df = pd.DataFrame(
   ...:     {
   ...:         'a': np.random.randint(0, 100, size),
   ...:         'b': np.random.randint(0, 100, size),
   ...:         'c': np.random.randint(0, 100, size),
   ...:     }
   ...: ).set_index(['a', 'b']).sort_index()
   ...: 
   ...: gb = df.groupby(['a', 'b'])

In [6]: %timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))
2.35 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: pd.options.mode.copy_on_write=True

In [8]: size = 10_000
   ...: df = pd.DataFrame(
   ...:     {
   ...:         'a': np.random.randint(0, 100, size),
   ...:         'b': np.random.randint(0, 100, size),
   ...:         'c': np.random.randint(0, 100, size),
   ...:     }
   ...: ).set_index(['a', 'b']).sort_index()
   ...: 
   ...: gb = df.groupby(['a', 'b'])
   ...: 
   ...: %timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))
2.77 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 3, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

No branches or pull requests

6 participants