Performance regression in reshape.Cut.time_qcut_timedelta #33921

TomAugspurger · 2020-05-01T15:21:00Z

Setup

import numpy as np
import pandas as pd

N = 10 ** 5
bins = 1000
timedelta_series = pd.Series(
    np.random.randint(N, size=N), dtype="timedelta64[ns]"
)


%timeit pd.qcut(timedelta_series, bins)

# 1.0.2
57.7 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
139 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

https://pandas.pydata.org/speed/pandas/index.html#rolling.Methods.time_rolling?p-constructor=%27Series%27&p-window=10&p-dtype=%27float%27&p-method=%27sum%27&commits=265b8420121a66ed18329c7a90d5381aeda5454f-ad4ad22c6804e6925c4eb82f51b974c03c3036a8 points to d04b965 (cc @mabelvj)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-05-01T15:22:13Z

I wonder if it's computing the set of labels in d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR428 and d04b965#diff-9a7b5c6c5eb0c115f01f8ffc015d129cR440 (haven't looked closely).

TomAugspurger · 2020-05-01T15:37:24Z

Note there might be an earlier regression in qcut: https://pandas.pydata.org/speed/pandas/index.html#reshape.Cut.time_qcut_datetime?p-bins=10&commits=80d37adcc3d9bfbbe17e8aa626d6b5873465ca98-4f89c261f624305fc7bae6c43ae862663994be34, somewhere in 80d37ad..4f89c26 .

TomAugspurger · 2020-06-22T17:59:30Z

Looked at this a bit today. Most (all?) of the slowdown is in the Categorical constructor. Mostly in

pandas/pandas/core/arrays/categorical.py

Lines 2594 to 2618 in 506eb54

    
           def _get_codes_for_values(values, categories): 
        
               """ 
        
               utility routine to turn values into codes given the specified categories 
        
               """ 
        
               dtype_equal = is_dtype_equal(values.dtype, categories.dtype) 
        
               if is_extension_array_dtype(categories.dtype) and is_object_dtype(values): 
        
                   # Support inferring the correct extension dtype from an array of 
        
                   # scalar objects. e.g. 
        
                   # Categorical(array[Period, Period], categories=PeriodIndex(...)) 
        
                   cls = categories.dtype.construct_array_type() 
        
                   values = maybe_cast_to_extension_array(cls, values) 
        
                   if not isinstance(values, cls): 
        
                       # exception raised in _from_sequence 
        
                       values = ensure_object(values) 
        
                       categories = ensure_object(categories) 
        
               elif not dtype_equal: 
        
                   values = ensure_object(values) 
        
                   categories = ensure_object(categories) 
        
               hash_klass, vals = _get_data_algo(values) 
        
               _, cats = _get_data_algo(categories) 
        
               t = hash_klass(len(cats)) 
        
               t.map_locations(cats) 
        
               return coerce_indexer_dtype(t.lookup(vals), cats)

.

I was surprised to see that for the following, we actually convert the IntervalIndex to an ndarray of objects

In [1]: import pandas as pd

In [2]: idx = pd.interval_range(0, 1, periods=1000)

In [3]: pd.Categorical(idx, categories=idx)

I would think that when categories is an Index already we could perhaps just use it's engine to do the coding. Will verify that and push up a PR.

It may be worth looking into why _get_data_algos is apparently slower than in 1.0.4, at least for IntervalIndex.

This has two changes to address a performance regression in cut / qcut. 1. Avoid an unnecessary `set` conversion in cut. 2. Aviod a costly conversion to object in the Categoriacl constructor for dtypes we don't have a hashtable for. ```python In [2]: idx = pd.interval_range(0, 1, periods=10000) In [3]: %timeit pd.Categorical(idx, idx) ``` ``` 10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` And for the qcut ASV ``` 58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Closes pandas-dev#33921

TomAugspurger added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version cut cut, qcut labels May 1, 2020

TomAugspurger added this to the 1.1 milestone May 1, 2020

TomAugspurger mentioned this issue Jun 23, 2020

PERF: Fixed cut regression, improve Categorical #34952

Merged

jreback closed this as completed in #34952 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in reshape.Cut.time_qcut_timedelta #33921

Performance regression in reshape.Cut.time_qcut_timedelta #33921

TomAugspurger commented May 1, 2020 •

edited

Loading

TomAugspurger commented May 1, 2020

TomAugspurger commented May 1, 2020

TomAugspurger commented Jun 22, 2020

Performance regression in reshape.Cut.time_qcut_timedelta #33921

Performance regression in reshape.Cut.time_qcut_timedelta #33921

Comments

TomAugspurger commented May 1, 2020 • edited Loading

TomAugspurger commented May 1, 2020

TomAugspurger commented May 1, 2020

TomAugspurger commented Jun 22, 2020

TomAugspurger commented May 1, 2020 •

edited

Loading