Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/BUG/API: factorizing categorical data #19938

Merged
merged 14 commits into from Mar 15, 2018
Commits on Feb 28, 2018
  1. REF/BUG/API: factorizing categorical data

    TomAugspurger committed Feb 23, 2018
    This changes / fixes how Categorical data are factorized. The return value of a
    factorized categorical is now `Tuple[ndarray[int], Categorical]`.
    
    Before
    
    ```python
    In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))
    
    In [3]: l
    Out[3]: array([0, 0, 1])
    
    In [4]: u
    Out[4]: array([0, 1])
    ```
    
    after
    
    ```python
    In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))
    
    In [3]: l
    Out[3]: array([0, 0, 1])
    
    In [4]: u
    Out[4]:
    [a, b]
    Categories (2, object): [a, b]
    ```
    
    The implementation is similar to `.unique`.
    
    1. The algo (`pd.factorize`, `pd.unique`) handles unboxing / dtype coercion
    2. The algo dispatches the actual array factorization for extension types
    3. The algo boxes the output if necessary, depending on the input.
    
    I've implemented this as a new public method on ``Categorical``, mainly since
    this is what we do for unique, and I think it's a useful method to have.
    
    This fixes a bug in factorizing categoricals with missing values. Previously, we
    included -1 in the uniques.
    
    Before
    
    ```python
    In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))
    
    In [3]: u
    Out[3]: array([ 0,  1, -1])
    ```
    
    After
    
    ```python
    In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))
    
    In [3]: u
    Out[3]:
    [a, b]
    Categories (2, object): [a, b]
    ```
  2. Explicit dtype for expected

    TomAugspurger committed Feb 28, 2018
Commits on Mar 6, 2018
  1. Restore sort

    TomAugspurger committed Mar 6, 2018
Commits on Mar 9, 2018
  1. REF: remove sort from Categorical.factorize

    TomAugspurger committed Mar 9, 2018
  2. Updated comment

    TomAugspurger committed Mar 9, 2018
Commits on Mar 12, 2018
  1. Fixed new sort algo

    TomAugspurger committed Mar 12, 2018
  2. Merge remote-tracking branch 'upstream/master' into categorical-facto…

    TomAugspurger committed Mar 12, 2018
    …rize
Commits on Mar 13, 2018
  1. Implement interface

    TomAugspurger committed Mar 13, 2018
  2. Merge remote-tracking branch 'upstream/master' into categorical-facto…

    TomAugspurger committed Mar 13, 2018
    …rize
Commits on Mar 14, 2018
  1. Merge remote-tracking branch 'upstream/master' into categorical-facto…

    TomAugspurger committed Mar 14, 2018
    …rize
You can’t perform that action at this time.