Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalIndex reidex duplicates values #21809

Closed
jesrael opened this issue Jul 8, 2018 · 4 comments · Fixed by #23963
Closed

CategoricalIndex reidex duplicates values #21809

jesrael opened this issue Jul 8, 2018 · 4 comments · Fixed by #23963
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@jesrael
Copy link

jesrael commented Jul 8, 2018

From SO question:

Sample:

np.random.seed(57)

idx = pd.CategoricalIndex(['low'] * 3 + ['hi'] * 3)
dfb = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'), index=idx)
print (dfb)
            a         b         c
low  0.087350  0.230477  0.411061
low  0.310783  0.565956  0.545064
low  0.807099  0.918155  0.522091
hi   0.424687  0.071804  0.898529
hi   0.420514  0.582170  0.214154
hi   0.447486  0.467864  0.100637

Round incorrectly explode rows:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

Expected output:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

print (pd.show_versions())

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.27.3
numpy: 1.14.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None
@TomAugspurger
Copy link
Contributor

Seems to be an issue with the DataFrame constructor.

In [11]: idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])

In [12]: pd.DataFrame(pd.DataFrame({"A": [1, 2, 3, 4]}, index=idx), index=idx)
Out[12]:
     A
low  1
low  2
low  1
low  2
hi   3
hi   4
hi   3
hi   4

Would welcome further investigate from others!

@TomAugspurger TomAugspurger changed the title Round explode rows in DataFrame with duplicated CategoricalIndex DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex Jul 8, 2018
@TomAugspurger TomAugspurger added Bug Categorical Categorical Data Type labels Jul 8, 2018
@fjdiod
Copy link
Contributor

fjdiod commented Jul 11, 2018

The problem seems to be with the reindex method:

>>> idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])
>>> idx.reindex(idx)[0].values
[low, low, low, low, hi, hi, hi, hi]
Categories (2, object): [hi, low]

Update:
Problem is here:

def get_indexer_non_unique(self, target):
target = ibase._ensure_index(target)
if isinstance(target, CategoricalIndex):
# Indexing on codes is more efficient if categories are the same:
if target.categories is self.categories:
target = target.codes
indexer, missing = self._engine.get_indexer_non_unique(target)
return _ensure_platform_int(indexer), missing
target = target.values
codes = self.categories.get_indexer(target)
indexer, missing = self._engine.get_indexer_non_unique(codes)
return _ensure_platform_int(indexer), missing

changing line 601 to:

indexer, missing = self._engine.get_indexer_non_unique(np.unique(codes))

helps and doesn't break tests, but I'm not sure that it's correct

@TomAugspurger TomAugspurger changed the title DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex CategoricalIndex reidex duplicates values Jul 20, 2018
@TomAugspurger
Copy link
Contributor

Thanks for investigating further.

We probably don't want to unique the codes, as that can be expensive... Let me see if there's a simpler way.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 20, 2018

Hmm, Index.reindex with duplicates seems strange :/ Will have to come back to this later. Other's are welcome to continue investigating of course :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants