REF/BUG/API: factorizing categorical data #19938

Merged
merged 14 commits into from Mar 15, 2018
Commits on Feb 28, 2018
1. TomAugspurger committed Feb 23, 2018
```This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now `Tuple[ndarray[int], Categorical]`.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]: array([0, 1])
```

after

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]:
[a, b]
Categories (2, object): [a, b]
```

The implementation is similar to `.unique`.

1. The algo (`pd.factorize`, `pd.unique`) handles unboxing / dtype coercion
2. The algo dispatches the actual array factorization for extension types
3. The algo boxes the output if necessary, depending on the input.

I've implemented this as a new public method on ``Categorical``, mainly since
this is what we do for unique, and I think it's a useful method to have.

This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]: array([ 0,  1, -1])
```

After

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]:
[a, b]
Categories (2, object): [a, b]
``````
2. TomAugspurger committed Feb 28, 2018
Commits on Mar 6, 2018
1. TomAugspurger committed Mar 6, 2018
2. TomAugspurger committed Mar 6, 2018
`…rize`
3. TomAugspurger committed Mar 6, 2018
Commits on Mar 9, 2018
1. TomAugspurger committed Mar 9, 2018
2. TomAugspurger committed Mar 9, 2018
Commits on Mar 12, 2018
1. TomAugspurger committed Mar 12, 2018
`…rize`
2. TomAugspurger committed Mar 12, 2018
3. TomAugspurger committed Mar 12, 2018
`…rize`
Commits on Mar 13, 2018
1. TomAugspurger committed Mar 13, 2018
2. TomAugspurger committed Mar 13, 2018
`…rize`
3. TomAugspurger committed Mar 13, 2018
`…rize`
Commits on Mar 14, 2018
1. TomAugspurger committed Mar 14, 2018
`…rize`
You can’t perform that action at this time.