Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equality comparison with a scalar is slow for category (performance regression) #23814

Closed
colinfang opened this issue Nov 20, 2018 · 8 comments · Fixed by #23888
Closed

equality comparison with a scalar is slow for category (performance regression) #23814

colinfang opened this issue Nov 20, 2018 · 8 comments · Fixed by #23888
Labels
Categorical Categorical Data Type good first issue Performance Memory or execution speed performance
Milestone

Comments

@colinfang
Copy link

colinfang commented Nov 20, 2018

Are the following 2 ways to compare a series to a scalar equivalent (ignore missing values)? I have to write the hard way in order to take advantage of the category properties.

```python
x = pd.Series(list('abcd') * 1000000).astype('category')
%timeit x == 'a'
# 10 loops, best of 3: 25.2 ms per loop
%timeit x.cat.codes == x.cat.categories.get_loc('a')
# 1000 loops, best of 3: 750 µs per loop
```
@TomAugspurger
Copy link
Contributor

It seems like a lot of time is spent in Categorical.__init__(Series[Categorical]), when that should essentially just be

if isinstance(data, (ABCSeries, ABCIndexClass)):
    data = data._values
if isinstance(data, type(self)):
    dtype = data.dtype
    codes = data.codes

We would need to handle not-default args for categories, ordered, and codes.

Timer unit: 1e-06 s

Total time: 0.02405 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/arrays/categorical.py
Function: __init__ at line 329

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   329                                               def __init__(self, values, categories=None, ordered=None, dtype=None,
   330                                                            fastpath=False):
   331                                           
   332                                                   # Ways of specifying the dtype (prioritized ordered)
   333                                                   # 1. dtype is a CategoricalDtype
   334                                                   #    a.) with known categories, use dtype.categories
   335                                                   #    b.) else with Categorical values, use values.dtype
   336                                                   #    c.) else, infer from values
   337                                                   #    d.) specifying dtype=CategoricalDtype and categories is an error
   338                                                   # 2. dtype is a string 'category'
   339                                                   #    a.) use categories, ordered
   340                                                   #    b.) use values.dtype
   341                                                   #    c.) infer from values
   342                                                   # 3. dtype is None
   343                                                   #    a.) use categories, ordered
   344                                                   #    b.) use values.dtype
   345                                                   #    c.) infer from values
   346         1          2.0      2.0      0.0          if dtype is not None:
   347                                                       # The dtype argument takes precedence over values.dtype (if any)
   348                                                       if isinstance(dtype, compat.string_types):
   349                                                           if dtype == 'category':
   350                                                               dtype = CategoricalDtype(categories, ordered)
   351                                                           else:
   352                                                               msg = "Unknown `dtype` {dtype}"
   353                                                               raise ValueError(msg.format(dtype=dtype))
   354                                                       elif categories is not None or ordered is not None:
   355                                                           raise ValueError("Cannot specify both `dtype` and `categories`"
   356                                                                            " or `ordered`.")
   357                                           
   358                                                       categories = dtype.categories
   359                                           
   360         1         13.0     13.0      0.1          elif is_categorical(values):
   361                                                       # If no "dtype" was passed, use the one from "values", but honor
   362                                                       # the "ordered" and "categories" arguments
   363         1          5.0      5.0      0.0              dtype = values.dtype._from_categorical_dtype(values.dtype,
   364         1          5.0      5.0      0.0                                                           categories, ordered)
   365                                                   else:
   366                                                       # If dtype=None and values is not categorical, create a new dtype
   367                                                       dtype = CategoricalDtype(categories, ordered)
   368                                           
   369                                                   # At this point, dtype is always a CategoricalDtype
   370                                                   # if dtype.categories is None, we are inferring
   371                                           
   372         1          1.0      1.0      0.0          if fastpath:
   373                                                       self._codes = coerce_indexer_dtype(values, categories)
   374                                                       self._dtype = self._dtype.update_dtype(dtype)
   375                                                       return
   376                                           
   377                                                   # null_mask indicates missing values we want to exclude from inference.
   378                                                   # This means: only missing values in list-likes (not arrays/ndframes).
   379         1         10.0     10.0      0.0          null_mask = np.array(False)
   380                                           
   381                                                   # sanitize input
   382         1          8.0      8.0      0.0          if is_categorical_dtype(values):
   383         1          5.0      5.0      0.0              if dtype.categories is None:
   384                                                           dtype = CategoricalDtype(values.categories, dtype.ordered)
   385                                           
   386                                                   elif not isinstance(values, (ABCIndexClass, ABCSeries)):
   387                                                       # _sanitize_array coerces np.nan to a string under certain versions
   388                                                       # of numpy
   389                                                       values = maybe_infer_to_datetimelike(values, convert_dates=True)
   390                                                       if not isinstance(values, np.ndarray):
   391                                                           values = _convert_to_list_like(values)
   392                                                           from pandas.core.series import _sanitize_array
   393                                                           # By convention, empty lists result in object dtype:
   394                                                           if len(values) == 0:
   395                                                               sanitize_dtype = 'object'
   396                                                           else:
   397                                                               sanitize_dtype = None
   398                                                           null_mask = isna(values)
   399                                                           if null_mask.any():
   400                                                               values = [values[idx] for idx in np.where(~null_mask)[0]]
   401                                                           values = _sanitize_array(values, None, dtype=sanitize_dtype)
   402                                           
   403         1          2.0      2.0      0.0          if dtype.categories is None:
   404                                                       try:
   405                                                           codes, categories = factorize(values, sort=True)
   406                                                       except TypeError:
   407                                                           codes, categories = factorize(values, sort=False)
   408                                                           if dtype.ordered:
   409                                                               # raise, as we don't have a sortable data structure and so
   410                                                               # the user should give us one by specifying categories
   411                                                               raise TypeError("'values' is not ordered, please "
   412                                                                               "explicitly specify the categories order "
   413                                                                               "by passing in a categories argument.")
   414                                                       except ValueError:
   415                                           
   416                                                           # FIXME
   417                                                           raise NotImplementedError("> 1 ndim Categorical are not "
   418                                                                                     "supported at this time")
   419                                           
   420                                                       # we're inferring from values
   421                                                       dtype = CategoricalDtype(categories, dtype.ordered)
   422                                           
   423         1          7.0      7.0      0.0          elif is_categorical_dtype(values):
   424         1        394.0    394.0      1.6              old_codes = (values.cat.codes if isinstance(values, ABCSeries)
   425                                                                    else values.codes)
   426         1          5.0      5.0      0.0              codes = _recode_for_categories(old_codes, values.dtype.categories,
   427         1      23530.0  23530.0     97.8                                             dtype.categories)
   428                                           
   429                                                   else:
   430                                                       codes = _get_codes_for_values(values, dtype.categories)
   431                                           
   432         1         27.0     27.0      0.1          if null_mask.any():
   433                                                       # Reinsert -1 placeholders for previously removed missing values
   434                                                       full_codes = - np.ones(null_mask.shape, dtype=codes.dtype)
   435                                                       full_codes[~null_mask] = codes
   436                                                       codes = full_codes
   437                                           
   438         1         27.0     27.0      0.1          self._dtype = self._dtype.update_dtype(dtype)
   439         1          9.0      9.0      0.0          self._codes = coerce_indexer_dtype(codes, dtype.categories)

@colinfang can you post your pandas version?

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 20, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Nov 20, 2018
@colinfang colinfang changed the title equality comparison with a scalar is slow for category equality comparison with a scalar is slow for category (performance regression) Nov 20, 2018
@colinfang
Copy link
Author

I am using v0.23.4. The performance for v0.20.3 is good.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 20, 2018 via email

@eoveson
Copy link
Contributor

eoveson commented Nov 20, 2018

@TomAugspurger / @colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.

I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?

if isinstance(values, (ABCSeries, ABCIndexClass)):
    values = values._values
if isinstance(values, type(self)):
    self._dtype = values.dtype
    self._codes = values.codes
    return
In [1]: import pandas as pd

In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [3]: %timeit s == 'a'
3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 22, 2018 via email

@eoveson
Copy link
Contributor

eoveson commented Nov 24, 2018

One option I saw may be to short-circuit within _recode_for_categories() (not sure if this would help in any other scenarios also). It may be safer, since all of the other logic for the other parameters runs as it did before? In _recode_for_categories(), I tried adding the following code:

if new_categories.equals(old_categories):
    # Same categories, so no need to actually recode
    return codes.copy()

Which yielded:

In [3]: %timeit s == 'a'
7.29 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.14 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If I instead add the following code basically at the beginning of init() (the approach first discussed):

if categories is None and ordered is None and dtype is None:
    if isinstance(values, (ABCSeries, ABCIndexClass)):
        values = values._values
    if isinstance(values, type(self)):
        self._dtype = values.dtype
        self._codes = values.codes.copy()
        return

This yielded:

In [3]: %timeit s == 'a'
5.38 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, not a huge difference between the two approaches. Also, I'm not sure whether I need to copy() the codes as I have in the above snippets. @TomAugspurger any thoughts? I can just send a PR and we can continue to discuss there if you prefer..

@TomAugspurger
Copy link
Contributor

I think the change to Categorical.__init__ is clearly good.

The downside to adding a .equals check to _recode_for_categories is that you always have to pay the cost to check .equals, even if they aren't actually equal. I suspect that in many cases where you'd end up taking the shortcut in _recode_for_categories, you'll already know whether on not the categories are equal or not. Still, there may be some cases where you can't know that, so it may be OK to just check that.

@eoveson
Copy link
Contributor

eoveson commented Nov 24, 2018

Sounds good, I will go with the init change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants