Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalIndex reidex duplicates values #21809

Closed
jesrael opened this issue Jul 8, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@jesrael
Copy link

commented Jul 8, 2018

From SO question:

Sample:

np.random.seed(57)

idx = pd.CategoricalIndex(['low'] * 3 + ['hi'] * 3)
dfb = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'), index=idx)
print (dfb)
            a         b         c
low  0.087350  0.230477  0.411061
low  0.310783  0.565956  0.545064
low  0.807099  0.918155  0.522091
hi   0.424687  0.071804  0.898529
hi   0.420514  0.582170  0.214154
hi   0.447486  0.467864  0.100637

Round incorrectly explode rows:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

Expected output:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

print (pd.show_versions())

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.27.3
numpy: 1.14.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2018

Seems to be an issue with the DataFrame constructor.

In [11]: idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])

In [12]: pd.DataFrame(pd.DataFrame({"A": [1, 2, 3, 4]}, index=idx), index=idx)
Out[12]:
     A
low  1
low  2
low  1
low  2
hi   3
hi   4
hi   3
hi   4

Would welcome further investigate from others!

@TomAugspurger TomAugspurger changed the title Round explode rows in DataFrame with duplicated CategoricalIndex DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex Jul 8, 2018

@fjdiod

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2018

The problem seems to be with the reindex method:

>>> idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])
>>> idx.reindex(idx)[0].values
[low, low, low, low, hi, hi, hi, hi]
Categories (2, object): [hi, low]

Update:
Problem is here:

def get_indexer_non_unique(self, target):
target = ibase._ensure_index(target)
if isinstance(target, CategoricalIndex):
# Indexing on codes is more efficient if categories are the same:
if target.categories is self.categories:
target = target.codes
indexer, missing = self._engine.get_indexer_non_unique(target)
return _ensure_platform_int(indexer), missing
target = target.values
codes = self.categories.get_indexer(target)
indexer, missing = self._engine.get_indexer_non_unique(codes)
return _ensure_platform_int(indexer), missing

changing line 601 to:

indexer, missing = self._engine.get_indexer_non_unique(np.unique(codes))

helps and doesn't break tests, but I'm not sure that it's correct

@TomAugspurger TomAugspurger changed the title DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex CategoricalIndex reidex duplicates values Jul 20, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2018

Thanks for investigating further.

We probably don't want to unique the codes, as that can be expensive... Let me see if there's a simpler way.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2018

Hmm, Index.reindex with duplicates seems strange :/ Will have to come back to this later. Other's are welcome to continue investigating of course :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.