-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in reshape.Cut.time_qcut_timedelta #33921
Comments
I wonder if it's computing the set of labels in d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR428 and d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR440 (haven't looked closely). |
Note there might be an earlier regression in qcut: https://pandas.pydata.org/speed/pandas/index.html#reshape.Cut.time_qcut_datetime?p-bins=10&commits=80d37adcc3d9bfbbe17e8aa626d6b5873465ca98-4f89c261f624305fc7bae6c43ae862663994be34, somewhere in 80d37ad..4f89c26 . |
Looked at this a bit today. Most (all?) of the slowdown is in the pandas/pandas/core/arrays/categorical.py Lines 2594 to 2618 in 506eb54
I was surprised to see that for the following, we actually convert the IntervalIndex to an ndarray of objects In [1]: import pandas as pd
In [2]: idx = pd.interval_range(0, 1, periods=1000)
In [3]: pd.Categorical(idx, categories=idx) I would think that when categories is an Index already we could perhaps just use it's engine to do the coding. Will verify that and push up a PR. It may be worth looking into why |
This has two changes to address a performance regression in cut / qcut. 1. Avoid an unnecessary `set` conversion in cut. 2. Aviod a costly conversion to object in the Categoriacl constructor for dtypes we don't have a hashtable for. ```python In [2]: idx = pd.interval_range(0, 1, periods=10000) In [3]: %timeit pd.Categorical(idx, idx) ``` ``` 10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` And for the qcut ASV ``` 58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Closes pandas-dev#33921
Setup
https://pandas.pydata.org/speed/pandas/index.html#rolling.Methods.time_rolling?p-constructor=%27Series%27&p-window=10&p-dtype=%27float%27&p-method=%27sum%27&commits=265b8420121a66ed18329c7a90d5381aeda5454f-ad4ad22c6804e6925c4eb82f51b974c03c3036a8 points to d04b965 (cc @mabelvj)
The text was updated successfully, but these errors were encountered: