BUG: cut does not respect order of passed labels #16459

Closed
ProsperousHeart opened this Issue May 23, 2017 · 8 comments

Comments

Projects
None yet
4 participants
@ProsperousHeart
Contributor

ProsperousHeart commented May 23, 2017

This issue was found when working on #16432

    Examples
    --------
    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
    ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
    Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < 
    ...        (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,
    ...         9.7       ]))

    >>> result = pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
    ...                          3, labels=["good","medium","bad"])
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]

Problem description

When running pytest -x --doctest-modules core/reshape/tile.py against the work inside of the "reshape-3439" branch, we see the following:

(pandas_dev) C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas>pytest -x --doctest-modules core/reshape/tile.py
============================= test session starts =============================
platform win32 -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas, inifile:
plugins: cov-2.3.1
collected 2 items

core\reshape\tile.py F

================================== FAILURES ===================================
___________________ [doctest] pandas.core.reshape.tile.cut ____________________
072     the resulting Categorical object
073
074
075     Examples
076     --------
077     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
078     ([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
079     Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]], array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
080
081     >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
Expected:
    [good, good, good, medium, bad, good]
    Categories (3, object): [good < medium < bad]
Got:
    [good, good, good, medium, bad, good]
    Categories (3, object): [bad < good < medium]

C:\Users\kkeeton\AppData\Local\conda\conda\envs\pandas_dev\Lib\site-packages\pandas\core\reshape\tile.py:81: DocTestFailure

This is backwards.

Expected Output

Should clear this - this is not an issue.

Output of pd.show_versions()

Unable to run the pd.show_versions()

@jorisvandenbossche jorisvandenbossche added this to the 0.20.2 milestone May 23, 2017

@jorisvandenbossche jorisvandenbossche changed the title from DocTest Shows Wrong Order to BUG: cut does not respect order of passed labels May 23, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 23, 2017

Member
In [3]: pd.__version__
Out[3]: '0.19.2'

In [4]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[4]: 
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
In [12]: pd.__version__
Out[12]: '0.21.0.dev+68.gcc734ba'

In [13]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[13]: 
[good, good, good, medium, bad, good]
Categories (3, object): [bad < good < medium]
Member

jorisvandenbossche commented May 23, 2017

In [3]: pd.__version__
Out[3]: '0.19.2'

In [4]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[4]: 
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]
In [12]: pd.__version__
Out[12]: '0.21.0.dev+68.gcc734ba'

In [13]: pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
Out[13]: 
[good, good, good, medium, bad, good]
Categories (3, object): [bad < good < medium]
@economy

This comment has been minimized.

Show comment
Hide comment
@economy

economy May 23, 2017

Contributor

I'm looking into this.

core/reshape/tile.py line 255: looks like pd.Categorical(ordered=True) orders the labels alphabetically regardless of how they're passed in.

In [14]: labels = ['okay', 'good', 'bad']
In [15]: pd.Categorical(labels, ordered=True)

Out [15]: [okay, good, bad]
Categories (3, object): [bad < good < okay]
Contributor

economy commented May 23, 2017

I'm looking into this.

core/reshape/tile.py line 255: looks like pd.Categorical(ordered=True) orders the labels alphabetically regardless of how they're passed in.

In [14]: labels = ['okay', 'good', 'bad']
In [15]: pd.Categorical(labels, ordered=True)

Out [15]: [okay, good, bad]
Categories (3, object): [bad < good < okay]
@economy

This comment has been minimized.

Show comment
Hide comment
@economy

economy May 23, 2017

Contributor

If I have a categorical, and I want it to be ordered, and I pass a list of strings as labels, it doesn't seem like I'd ever want the sorted version of those labels to be my rank ordering. Is there a case where I'd actually want to sort those strings alphabetically?

Contributor

economy commented May 23, 2017

If I have a categorical, and I want it to be ordered, and I pass a list of strings as labels, it doesn't seem like I'd ever want the sorted version of those labels to be my rank ordering. Is there a case where I'd actually want to sort those strings alphabetically?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 23, 2017

Contributor

@economy pd.Categorical(labels, categories=labels, ordered=True). I thought it respected the observed order with ordered=True when categories=None, but I guess not.

Contributor

TomAugspurger commented May 23, 2017

@economy pd.Categorical(labels, categories=labels, ordered=True). I thought it respected the observed order with ordered=True when categories=None, but I guess not.

@economy

This comment has been minimized.

Show comment
Hide comment
@economy

economy May 23, 2017

Contributor

So, if I'm reading this correctly, updating the test to explicitly set categories fixes the issue and follows the appropriate way to use pd.Categorical

Contributor

economy commented May 23, 2017

So, if I'm reading this correctly, updating the test to explicitly set categories fixes the issue and follows the appropriate way to use pd.Categorical

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 23, 2017

Contributor

I think we need a new test in test_tile.py You can just copy the example #16459 (comment) and assert that it matches the output from 0.19.2 (with the order "good" < "medium" < "bad".

Ideally you would

  1. checkout a new branch
  2. write the test
  3. verify that the test fails
  4. write the fix (by passing categories=labels
  5. document the change
Contributor

TomAugspurger commented May 23, 2017

I think we need a new test in test_tile.py You can just copy the example #16459 (comment) and assert that it matches the output from 0.19.2 (with the order "good" < "medium" < "bad".

Ideally you would

  1. checkout a new branch
  2. write the test
  3. verify that the test fails
  4. write the fix (by passing categories=labels
  5. document the change
@economy

This comment has been minimized.

Show comment
Hide comment
@economy

economy May 23, 2017

Contributor

@TomAugspurger, I can create a new test for this, but pd.cut() still doesn't act as expected.

If I update core/reshape/tile.py line 255 to call pd.Categorical using the categories attribute correctly, the original test passes and output for pd.cut() is as described (expected) above.

Contributor

economy commented May 23, 2017

@TomAugspurger, I can create a new test for this, but pd.cut() still doesn't act as expected.

If I update core/reshape/tile.py line 255 to call pd.Categorical using the categories attribute correctly, the original test passes and output for pd.cut() is as described (expected) above.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 23, 2017

Contributor

the original test passes and output for pd.cut() is as described (expected) above.

Do you mean the one reported by @ProsperousHeart? That is a doctest, which aren't setup to run automatically (yet, @ProsperousHeart is getting them fixed up so that we can do that). In the meantime we need a regular unit test to verify that the fix works.

Contributor

TomAugspurger commented May 23, 2017

the original test passes and output for pd.cut() is as described (expected) above.

Do you mean the one reported by @ProsperousHeart? That is a doctest, which aren't setup to run automatically (yet, @ProsperousHeart is getting them fixed up so that we can do that). In the meantime we need a regular unit test to verify that the fix works.

economy added a commit to economy/pandas that referenced this issue May 23, 2017

TomAugspurger added a commit to economy/pandas that referenced this issue May 30, 2017

@jreback jreback closed this in d419be4 Jun 1, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 1, 2017

BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)

TomAugspurger added a commit that referenced this issue Jun 4, 2017

BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

(cherry picked from commit d419be4)

Kiv added a commit to Kiv/pandas that referenced this issue Jun 11, 2017

BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: fixed wrong order of ordered labels in pd.cut()
closes #16459

Author: economy <the.economy@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes #16466 from economy/fix_cut and squashes the following commits:

29128b3 [economy] comments and whatsnew edits
3898b72 [economy] BUG: fixed wrong order of ordered labels in pd.cut()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment