BUG: cut does not respect order of passed labels #16459

Closed
ProsperousHeart opened this Issue May 23, 2017 · 8 comments

Comments

Projects
None yet
4 participants
Contributor

ProsperousHeart commented May 23, 2017 edited by jorisvandenbossche

This issue was found when working on #16432

    Examples
    --------
    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
    ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
    Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < 
    ...        (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,
    ...         9.7       ]))

    >>> result = pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
    ...                          3, labels=["good","medium","bad"])
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]

Problem description

When running pytest -x --doctest-modules core/reshape/tile.py against the work inside of the "reshape-3439" branch, we see the following:

(pandas_dev) C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas>pytest -x --doctest-modules core/reshape/tile.py
============================= test session starts =============================
platform win32 -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas, inifile:
plugins: cov-2.3.1
collected 2 items

core\reshape\tile.py F

================================== FAILURES ===================================
___________________ [doctest] pandas.core.reshape.tile.cut ____________________
072     the resulting Categorical object
073
074
075     Examples
076     --------
077     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
078     ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
079     Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
080
081     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
Expected:
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]
Got:
    [good, good, good, medium, bad, good]
    Categories (3, object): [bad < good < medium]

C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas\core\reshape\tile.py:81: DocTestFailure

This is backwards.

Expected Output

Should clear this - this is not an issue.

Output of pd.show_versions()

Unable to run the pd.show_versions()

jorisvandenbossche added this to the 0.20.2 milestone May 23, 2017

jorisvandenbossche changed the title from DocTest Shows Wrong Order to BUG: cut does not respect order of passed labels May 23, 2017

In [3]: pd.__version__
Out[3]: '0.19.2'

In [4]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[4]: 
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
In [12]: pd.__version__
Out[12]: '0.21.0.dev+68.gcc734ba'

In [13]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[13]: 
[good, good, good, medium, bad, good]
Categories (3, object): [bad < good < medium]
Contributor

economy commented May 23, 2017 edited

I'm looking into this.

core/reshape/tile.py line 255: looks like pd.Categorical(ordered=True) orders the labels alphabetically regardless of how they're passed in.

In [14]: labels = ['okay', 'good', 'bad']
In [15]: pd.Categorical(labels, ordered=True)

Out [15]: [okay, good, bad]
Categories (3, object): [bad < good < okay]
Contributor

economy commented May 23, 2017

If I have a categorical, and I want it to be ordered, and I pass a list of strings as labels, it doesn't seem like I'd ever want the sorted version of those labels to be my rank ordering. Is there a case where I'd actually want to sort those strings alphabetically?

Contributor

TomAugspurger commented May 23, 2017

@economy pd.Categorical(labels, categories=labels, ordered=True). I thought it respected the observed order with ordered=True when categories=None, but I guess not.

Contributor

economy commented May 23, 2017

So, if I'm reading this correctly, updating the test to explicitly set categories fixes the issue and follows the appropriate way to use pd.Categorical

Contributor

TomAugspurger commented May 23, 2017

I think we need a new test in test_tile.py You can just copy the example #16459 (comment) and assert that it matches the output from 0.19.2 (with the order "good" < "medium" < "bad".

Ideally you would

  1. checkout a new branch
  2. write the test
  3. verify that the test fails
  4. write the fix (by passing categories=labels
  5. document the change
Contributor

economy commented May 23, 2017 edited

@TomAugspurger, I can create a new test for this, but pd.cut() still doesn't act as expected.

If I update core/reshape/tile.py line 255 to call pd.Categorical using the categories attribute correctly, the original test passes and output for pd.cut() is as described (expected) above.

Contributor

TomAugspurger commented May 23, 2017

the original test passes and output for pd.cut() is as described (expected) above.

Do you mean the one reported by @ProsperousHeart? That is a doctest, which aren't setup to run automatically (yet, @ProsperousHeart is getting them fixed up so that we can do that). In the meantime we need a regular unit test to verify that the fix works.

@economy economy added a commit to economy/pandas that referenced this issue May 23, 2017

@economy economy adding whatsnew update for #16459 d8bda87

@TomAugspurger TomAugspurger added a commit to economy/pandas that referenced this issue May 30, 2017

@economy @TomAugspurger economy + TomAugspurger adding whatsnew update for #16459 ac78a64

jreback closed this in d419be4 Jun 1, 2017

@TomAugspurger TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 1, 2017

@economy @TomAugspurger economy + TomAugspurger BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)
4806a11

@TomAugspurger TomAugspurger added a commit that referenced this issue Jun 4, 2017

@economy @TomAugspurger economy + TomAugspurger BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)
418dcb3

@Kiv Kiv added a commit to Kiv/pandas that referenced this issue Jun 11, 2017

@economy @Kiv economy + Kiv BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
785887a

@stangirala stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

@economy @stangirala economy + stangirala BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
2cecd75

@guillemborrell guillemborrell added a commit to guillemborrell/pandas that referenced this issue Jul 7, 2017

@economy @guillemborrell economy + guillemborrell BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
98ce9eb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment