Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/BUG/API: factorizing categorical data #19938

Merged
merged 14 commits into from Mar 15, 2018

Conversation

@TomAugspurger
Copy link
Contributor

commented Feb 28, 2018

This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now Tuple[ndarray[int], Categorical].

Before

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]: array([0, 1])

after

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]:
[a, b]
Categories (2, object): [a, b]

The implementation is similar to .unique.

  1. The algo (pd.factorize, pd.unique) handles unboxing / dtype coercion
  2. The algo dispatches the actual array factorization for extension types
  3. The algo boxes the output if necessary, depending on the input.

I've implemented this as a new public method on Categorical, mainly since
this is what we do for unique, and I think it's a useful method to have.

This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.

Before

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]: array([ 0,  1, -1])

After

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]:
[a, b]
Categories (2, object): [a, b]

xref #19721. This doesn't close it, as I haven't addressed the issue of what factorize(Series[EA]) should return. Just fixing the categorical bugs.

REF/BUG/API: factorizing categorical data
This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now `Tuple[ndarray[int], Categorical]`.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]: array([0, 1])
```

after

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]:
[a, b]
Categories (2, object): [a, b]
```

The implementation is similar to `.unique`.

1. The algo (`pd.factorize`, `pd.unique`) handles unboxing / dtype coercion
2. The algo dispatches the actual array factorization for extension types
3. The algo boxes the output if necessary, depending on the input.

I've implemented this as a new public method on ``Categorical``, mainly since
this is what we do for unique, and I think it's a useful method to have.

This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]: array([ 0,  1, -1])
```

After

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]:
[a, b]
Categories (2, object): [a, b]
```
@codecov

This comment has been minimized.

Copy link

commented Feb 28, 2018

Codecov Report

Merging #19938 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19938      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         152      152              
  Lines       49185    49206      +21     
==========================================
+ Hits        45140    45163      +23     
+ Misses       4045     4043       -2
Flag Coverage Δ
#multiple 90.16% <100%> (ø) ⬆️
#single 41.83% <62.5%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.29% <100%> (+0.12%) ⬆️
pandas/core/arrays/categorical.py 95.11% <100%> (+0.05%) ⬆️
pandas/util/testing.py 83.95% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cad6dc7...1e006d1. Read the comment docs.

import pandas as pd
import pandas.util.testing as tm


This comment has been minimized.

Copy link
@jreback

jreback Mar 4, 2018

Contributor

you should prob pull the categorical tests out of test_algos.py then

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 5, 2018

Author Contributor

For better or worse, test_algos.py didn't have any tests for Categorical.

This comment has been minimized.

Copy link
@jreback

jreback Mar 7, 2018

Contributor

hmm, yeah i guess just lots of tests for unique.

from pandas.core.algorithms import _factorize_array, take_1d

codes = self.codes.astype('int64')
# We set missing codes, normally -1, to iNaT so that the

This comment has been minimized.

Copy link
@jreback

jreback Mar 4, 2018

Contributor

put the astype after the comment, looks awkward otherwise

why do you think you need to do this? the point of the na_sentinel is to select the missing values

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 5, 2018

Author Contributor

na_sentinel controls the missing marker for the output. We're modifying the input, since the Int64HashTable sees that they're missing, instead of the value -1.

This comment has been minimized.

Copy link
@jreback

jreback Mar 7, 2018

Contributor

this should be fixed generally, e.g. we should have a way to pass in the missing value. can you fix

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 7, 2018

Author Contributor

I'm not familiar with the hashtable code, but at a glance it looks like the null condition is included in the class definition template?

{{py:

# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
          ('UInt64', 'uint64', 'False', False),
          ('Int64', 'int64', 'val == iNaT', False)]

I'm not sure how to pass expressions down to cython as a parameter.

Anyway, do we actually need this to be parameterized? Do we have other cases where we've needed to pass the null condition down?

uniques = self._constructor(self.categories.take(uniques),
categories=self.categories,
ordered=self.ordered)
if sort:

This comment has been minimized.

Copy link
@jreback

jreback Mar 4, 2018

Contributor

I would remove the sort from here and just do it in factorize part (after the else), otherwise logic is in multiple places here

if is_categorical_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize(sort=sort)
dtype = original.dtype

This comment has been minimized.

Copy link
@jreback

jreback Mar 4, 2018

Contributor

see my comment below, but you might simpy dispatch on categricals and just return, mixing the impl is really confusing here.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 4, 2018

can you show what exactly is the 'bug' here?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Mar 4, 2018

can you show what exactly is the 'bug' here?

Did you see Tom's top level post with 'before' and 'after' ? (and the discussion in #19721)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2018

ok I see the rationale behind this, which is ok. still some question regarding the impl. this looks like a bit of a work around.

TomAugspurger added some commits Mar 9, 2018


if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort
uniques, labels = safe_sort(uniques, labels, na_sentinel=na_sentinel,
assume_unique=True)
try:

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 9, 2018

Author Contributor

Removed all the sorting from Categorical.factorize. All that logic is here.

I don't think we want to just call safe_sort for two reasons

  1. that function does a lot of unnescessary work when we know that uniques is an ndarray or EA.
  2. It coerces categoricals to object ndarrays.
  3. EAs (like Categorical) may have special sorting rules.

On some small bencharks (10,000 elements) this is about 25-40% faster. The only slow case, for which we still need safe_sort, is when the array is mixed. In that case things are about 10% slower.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2018

Fixed the bug in my logic from the last commit.

ASV looks good. The first time I ran it I had a 1.20 ratio (slowdown) relative to master on algorithms.Factorize.time_factorize_int(True). 0.8 ratio (speedup) the second time. No change 3rd time.

check_nulls = not is_integer_dtype(original)
labels, uniques = _factorize_array(values, check_nulls,
na_sentinel=na_sentinel,
size_hint=size_hint)

if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort

This comment has been minimized.

Copy link
@jreback

jreback Mar 13, 2018

Contributor

could move to the except (but no big deal)

values = getattr(values, '_values', values)
labels, uniques = values.factorize()
dtype = original.dtype
else:

This comment has been minimized.

Copy link
@jreback

jreback Mar 13, 2018

Contributor

shouldn't this actually be a check on the values if they have a .factorize() method (or check is_extension_array)? instead of specifically checking for categorical? (of course categorical will pass these checks). as this will then make pd.factorize(an_extension_array) work?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 13, 2018

Author Contributor

This PR is just a bugfix for categorical. But the structure will be very similar (I'll just change is_categorical_dtype to is_extension_array_dtype.)

I'll implement EA.factorize today hopefully, but have to get things like unique and argsort working first.

from pandas.core.algorithms import _factorize_array

codes = self.codes.astype('int64')
codes[codes == -1] = iNaT

This comment has been minimized.

Copy link
@jreback

jreback Mar 13, 2018

Contributor

The interface we have to hashtable.get_labels() is very odd right now, IOW we have a check_null flag which then makes the caller know to substitute values to iNaT (for int64) and know which are the sentinels. This is breaking the abstrastion. Rather would either like to be able to pass in the actual sentinel (not the output sentinel, but that's another confusion). e.g . you would simply pass -1 here.

I think its worth re-factoring this (maybe before this PR), though I suppose could be after.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 13, 2018

Author Contributor

#20328

Yes, that'd be nicer.

This comment has been minimized.

Copy link
@jreback

jreback Mar 13, 2018

Contributor

do we actually want this to be public?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 13, 2018

Author Contributor

factorize in general? I don’t see why not. It’s present on series and index.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 14, 2018

Author Contributor

#19938 (comment) was in reference to the API docs. We whitelist the methods on Categorical that are included in the API docs (just __array__ and from_codes for now).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2018

@jorisvandenbossche just confirming, but I assume we don't want to add Categorical.factorize to the API docs?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Mar 13, 2018

@jorisvandenbossche just confirming, but I assume we don't want to add Categorical.factorize to the API docs?

That maybe depends on the discussion we will have on that for ExtensionArray?

from pandas.core.algorithms import _factorize_array

codes = self.codes.astype('int64')
codes[codes == -1] = iNaT

This comment has been minimized.

Copy link
@jreback

jreback Mar 13, 2018

Contributor

do we actually want this to be public?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2018

ahh I see we are already exposing factorize on Series. ok, then this should be a part of the EA interface. (can be separate PR).

otherwise this ok? can you rebase

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2018

Yeah, this should be good to go.

@jreback jreback merged commit 38afa93 into pandas-dev:master Mar 15, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.