Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: CategoricalIndex allows reindexing with non-unique CategoricalIndex #23963

Merged
merged 1 commit into from Dec 2, 2018

Conversation

Projects
None yet
3 participants
@qwhelan
Copy link
Contributor

commented Nov 28, 2018

CategoricalIndex.reindex() throws a ValueError if the target index is non-unique and not also a CategoricalIndex. This means a non-unique CategoricalIndex can be used to reindex, resulting in duplicated data:

In [1]: import pandas as pd
        categorical_index = pd.CategoricalIndex(['bar', 'foo', 'bar', 'foo'], ['bar', 'foo'])
        s = pd.Series(range(4), index=categorical_index)

In [2]: s
Out[2]: bar    0
        foo    1
        bar    2
        foo    3
        dtype: int64

In [3]: s.reindex(categorical_index.as_ordered())
Out[3]: bar    0
        bar    2
        foo    1
        foo    3
        bar    0
        bar    2
        foo    1
        foo    3
        dtype: int64

This PR removes the extraneous type check and adds tests checking that an exception is thrown for non-unique target indexes, regardless of type.

  • closes #21809
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@pep8speaks

This comment has been minimized.

Copy link

commented Nov 28, 2018

Hello @qwhelan! Thanks for updating the PR.

Comment last updated on November 28, 2018 at 22:26 Hours UTC
@codecov

This comment has been minimized.

Copy link

commented Nov 28, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@6dd130a). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #23963   +/-   ##
=========================================
  Coverage          ?    92.3%           
=========================================
  Files             ?      161           
  Lines             ?    51558           
  Branches          ?        0           
=========================================
  Hits              ?    47592           
  Misses            ?     3966           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.7% <100%> (?)
#single 42.43% <57.14%> (?)
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.88% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6dd130a...9550880. Read the comment docs.

@jreback
Copy link
Contributor

left a comment

can you a whatsnew entry, and replicate the df.round() (that should work now) in the OP

Show resolved Hide resolved pandas/tests/indexes/test_category.py

@qwhelan qwhelan force-pushed the qwhelan:non_unique_categoricalindex branch from 54ab47e to 8366590 Nov 28, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2018

@jreback I stumbled across this bug while writing some asv benchmarks and only noticed #21809 after opening this PR, so it ended up being a few more lines to handle the df.round() case.

As we're now raising on non-unique CategoricalIndexes, we need to whitelist non-unique targets that are also the current index to allow df.round() to run. As the construction of the result involves a .reindex() call, we'd otherwise get ValueErrors when calling df.round() on a non-uniquely indexed DataFrame.

This is already done in Index.reindex(), so this whitelisting is bringing CategoricalIndex back in line with the base implementation.

@qwhelan qwhelan force-pushed the qwhelan:non_unique_categoricalindex branch from 8366590 to eec207a Nov 28, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2018

Additionally, this is the fixed output from #21809:

print(dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
Show resolved Hide resolved pandas/tests/indexes/test_category.py Outdated

@qwhelan qwhelan force-pushed the qwhelan:non_unique_categoricalindex branch 2 times, most recently from 6450827 to 0087ccf Nov 28, 2018

Show resolved Hide resolved doc/source/whatsnew/v0.24.0.rst Outdated
idx = pd.CategoricalIndex(['low'] * 3 + ['hi'] * 3)
dfb = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'),
index=idx)
assert dfb.shape == (6, 3)

This comment has been minimized.

Copy link
@jreback

jreback Nov 29, 2018

Contributor

can you construct the result and use tm.assert_frame_equal here

This comment has been minimized.

Copy link
@qwhelan

qwhelan Nov 29, 2018

Author Contributor

Added a tm.assert_frame_equal() check but keeping the explicit dims check as I don't want to inadvertently rely on CategoricalIndex.reindex.

@qwhelan qwhelan force-pushed the qwhelan:non_unique_categoricalindex branch from 0087ccf to 95337f9 Nov 29, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

@jreback Changes implemented, the one failing test is a flaky hypothesis failure in an unrelated file.

@jreback jreback added this to the 0.24.0 milestone Dec 2, 2018

@jreback
Copy link
Contributor

left a comment

cosmestic change in tests. ping when green.

Show resolved Hide resolved pandas/tests/frame/test_analytics.py Outdated

@qwhelan qwhelan force-pushed the qwhelan:non_unique_categoricalindex branch from 95337f9 to 9550880 Dec 2, 2018

@qwhelan

This comment has been minimized.

Copy link
Contributor Author

commented Dec 2, 2018

@jreback Green other than the unrelated flaky test

@jreback

jreback approved these changes Dec 2, 2018

@jreback jreback merged commit 5259e33 into pandas-dev:master Dec 2, 2018

2 of 3 checks passed

ci/circleci Your tests failed on CircleCI
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181202.49 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2018

thanks @qwhelan

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.