CategoricalIndex reidex duplicates values #21809

jesrael · 2018-07-08T05:18:31Z

Sample:

np.random.seed(57)

idx = pd.CategoricalIndex(['low'] * 3 + ['hi'] * 3)
dfb = pd.DataFrame(np.random.rand(6, 3), columns=list('abc'), index=idx)
print (dfb)
            a         b         c
low  0.087350  0.230477  0.411061
low  0.310783  0.565956  0.545064
low  0.807099  0.918155  0.522091
hi   0.424687  0.071804  0.898529
hi   0.420514  0.582170  0.214154
hi   0.447486  0.467864  0.100637

Round incorrectly explode rows:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

Expected output:

print (dfb.round(3))
         a      b      c
low  0.087  0.230  0.411
low  0.311  0.566  0.545
low  0.807  0.918  0.522
hi   0.425  0.072  0.899
hi   0.421  0.582  0.214
hi   0.447  0.468  0.101

print (pd.show_versions())

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.27.3
numpy: 1.14.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-07-08T12:56:04Z

Seems to be an issue with the DataFrame constructor.

In [11]: idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])

In [12]: pd.DataFrame(pd.DataFrame({"A": [1, 2, 3, 4]}, index=idx), index=idx)
Out[12]:
     A
low  1
low  2
low  1
low  2
hi   3
hi   4
hi   3
hi   4

Would welcome further investigate from others!

fjdiod · 2018-07-11T16:44:20Z

The problem seems to be with the reindex method:

>>> idx = pd.CategoricalIndex(['low', 'low', 'hi', 'hi'])
>>> idx.reindex(idx)[0].values
[low, low, low, low, hi, hi, hi, hi]
Categories (2, object): [hi, low]

Update:
Problem is here:

pandas/pandas/core/indexes/category.py

Lines 589 to 602 in bdb6168

    
           def get_indexer_non_unique(self, target): 
        
               target = ibase._ensure_index(target) 
        
               if isinstance(target, CategoricalIndex): 
        
                   # Indexing on codes is more efficient if categories are the same: 
        
                   if target.categories is self.categories: 
        
                       target = target.codes 
        
                       indexer, missing = self._engine.get_indexer_non_unique(target) 
        
                       return _ensure_platform_int(indexer), missing 
        
                   target = target.values 
        
               codes = self.categories.get_indexer(target) 
        
               indexer, missing = self._engine.get_indexer_non_unique(codes) 
        
               return _ensure_platform_int(indexer), missing

changing line 601 to:

indexer, missing = self._engine.get_indexer_non_unique(np.unique(codes))

helps and doesn't break tests, but I'm not sure that it's correct

TomAugspurger · 2018-07-20T18:54:44Z

Thanks for investigating further.

We probably don't want to unique the codes, as that can be expensive... Let me see if there's a simpler way.

TomAugspurger · 2018-07-20T19:11:50Z

Hmm, Index.reindex with duplicates seems strange :/ Will have to come back to this later. Other's are welcome to continue investigating of course :)

TomAugspurger mentioned this issue Jul 8, 2018

Why does pandas Round method explodes my data frame? #21810

Closed

TomAugspurger changed the title ~~Round explode rows in DataFrame with duplicated CategoricalIndex~~ DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex Jul 8, 2018

TomAugspurger added Bug Categorical Categorical Data Type labels Jul 8, 2018

fjdiod mentioned this issue Jul 20, 2018

CategoricalIndex reidex duplicates values #21999

Closed

TomAugspurger changed the title ~~DataFrame constructor duplicates values when passed DataFrame with CategoricalIndex~~ CategoricalIndex reidex duplicates values Jul 20, 2018

qwhelan mentioned this issue Nov 28, 2018

BUG: CategoricalIndex allows reindexing with non-unique CategoricalIndex #23963

Merged

4 tasks

jreback added this to the 0.24.0 milestone Nov 28, 2018

jreback closed this as completed in #23963 Dec 2, 2018

batterseapower mentioned this issue Aug 22, 2019

CategoricalIndex.reindex raises on duplicate indexer #25459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CategoricalIndex reidex duplicates values #21809

CategoricalIndex reidex duplicates values #21809

jesrael commented Jul 8, 2018

TomAugspurger commented Jul 8, 2018

fjdiod commented Jul 11, 2018 •

edited

Loading

TomAugspurger commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 •

edited

Loading

CategoricalIndex reidex duplicates values #21809

CategoricalIndex reidex duplicates values #21809

Comments

jesrael commented Jul 8, 2018

TomAugspurger commented Jul 8, 2018

fjdiod commented Jul 11, 2018 • edited Loading

TomAugspurger commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 • edited Loading

fjdiod commented Jul 11, 2018 •

edited

Loading

TomAugspurger commented Jul 20, 2018 •

edited

Loading