Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in reshape.Cut.time_qcut_timedelta #33921

Closed
TomAugspurger opened this issue May 1, 2020 · 3 comments · Fixed by #34952
Closed

Performance regression in reshape.Cut.time_qcut_timedelta #33921

TomAugspurger opened this issue May 1, 2020 · 3 comments · Fixed by #34952
Labels
cut cut, qcut Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2020

Setup

import numpy as np
import pandas as pd

N = 10 ** 5
bins = 1000
timedelta_series = pd.Series(
    np.random.randint(N, size=N), dtype="timedelta64[ns]"
)


%timeit pd.qcut(timedelta_series, bins)
# 1.0.2
57.7 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
139 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

https://pandas.pydata.org/speed/pandas/index.html#rolling.Methods.time_rolling?p-constructor=%27Series%27&p-window=10&p-dtype=%27float%27&p-method=%27sum%27&commits=265b8420121a66ed18329c7a90d5381aeda5454f-ad4ad22c6804e6925c4eb82f51b974c03c3036a8 points to d04b965 (cc @mabelvj)

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version cut cut, qcut labels May 1, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone May 1, 2020
@TomAugspurger
Copy link
Contributor Author

I wonder if it's computing the set of labels in d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR428 and d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR440 (haven't looked closely).

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger
Copy link
Contributor Author

Looked at this a bit today. Most (all?) of the slowdown is in the Categorical constructor. Mostly in

def _get_codes_for_values(values, categories):
"""
utility routine to turn values into codes given the specified categories
"""
dtype_equal = is_dtype_equal(values.dtype, categories.dtype)
if is_extension_array_dtype(categories.dtype) and is_object_dtype(values):
# Support inferring the correct extension dtype from an array of
# scalar objects. e.g.
# Categorical(array[Period, Period], categories=PeriodIndex(...))
cls = categories.dtype.construct_array_type()
values = maybe_cast_to_extension_array(cls, values)
if not isinstance(values, cls):
# exception raised in _from_sequence
values = ensure_object(values)
categories = ensure_object(categories)
elif not dtype_equal:
values = ensure_object(values)
categories = ensure_object(categories)
hash_klass, vals = _get_data_algo(values)
_, cats = _get_data_algo(categories)
t = hash_klass(len(cats))
t.map_locations(cats)
return coerce_indexer_dtype(t.lookup(vals), cats)
.

I was surprised to see that for the following, we actually convert the IntervalIndex to an ndarray of objects

In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 1, periods=1000)

In [3]: pd.Categorical(idx, categories=idx)

I would think that when categories is an Index already we could perhaps just use it's engine to do the coding. Will verify that and push up a PR.

It may be worth looking into why _get_data_algos is apparently slower than in 1.0.4, at least for IntervalIndex.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 23, 2020
This has two changes to address a performance regression in
cut / qcut.

1. Avoid an unnecessary `set` conversion in cut.
2. Aviod a costly conversion to object in the Categoriacl constructor
   for dtypes we don't have a hashtable for.

```python
In [2]: idx = pd.interval_range(0, 1, periods=10000)
In [3]: %timeit pd.Categorical(idx, idx)
```

```
10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

And for the qcut ASV

```
58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Closes pandas-dev#33921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant