Categorical equality check raises ValueError in DataFrame #12564

pganssle · 2016-03-08T22:29:11Z

Apparently there's an issue when comparing the equality of a scalar value against a categorical column as part of a DataFrame. In the example below, I'm checking against -np.inf, but comparing to a string or integer gives the same results.

This raises ValueError: Wrong number of dimensions.

Code Sample

from sys import version
import pandas as pd     # Version 0.17.1 on Linux and Windows
import numpy as np
print(version)
print(pd.__version__)

# Arbitrary data set
columns = ['Name', 'Type', 'Age', 'Weight (kg)', 'Cuteness']
dataset = [['Snuggles', 'Cat', 5.2, 4.2, 9.7],
           ['Rex', 'Dog', 2.1, 12, 2.1],
           ['Mrs. Quiggleworth', 'Cat', 7.4, 3, 7],
           ['Squirmy', 'Snake', 1.1, 0.2, 0.1],
           ['Tarantula', 'Legs', 0.2, 0.01, -np.inf],
           ['Groucho', 'Dog', 6.9, 8, 5.1]]

df = pd.DataFrame(dataset, columns=columns).set_index(['Name'])

# Works fine
print("String type - are any of the columns negative infinity?")
neg_inf = (df == -np.inf)
print(neg_inf.any(axis=1))

# Convert 'type' to a categorical
df['Type'] = df['Type'].astype('category')

print("Categorical type - is the Type column negative infinity?")
print(df['Type'] == -np.inf)    # Works fine

print("Categorical type in dataframe - are any of them negative infinity?")
print(df[['Type']] == -np.inf)  # Danger, Will Robinson!

Expected Output

3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
0.17.1
String type - are any of the columns negative infinity?
Name
Snuggles             False
Rex                  False
Mrs. Quiggleworth    False
Squirmy              False
Tarantula             True
Groucho              False
dtype: bool
Categorical type - is the Type column negative infinity?
Name
Snuggles             False
Rex                  False
Mrs. Quiggleworth    False
Squirmy              False
Tarantula            False
Groucho              False
Name: Type, dtype: bool
Categorical type in dataframe - are any of them negative infinity?
Traceback (most recent call last):
  File "pandas_demo.py", line 30, in <module>
    print(df[['Type']] == -np.inf)  # Danger, Will Robinson!
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py", line 1115, in f
    res = self._combine_const(other, func, raise_on_error=False)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 3482, in _combine_const
    new_data = self._data.eval(func=func, other=other, raise_on_error=raise_on_error)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2840, in eval
    return self.apply('eval', **kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2823, in apply
    applied = getattr(b, f)(**kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 1155, in eval
    fastpath=True,)]
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 169, in make_block
    return make_block(values, placement=placement, ndim=ndim, **kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2454, in make_block
    placement=placement)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 78, in __init__
    raise ValueError('Wrong number of dimensions')
ValueError: Wrong number of dimensions

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-51-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-08T23:31:42Z

here's a simpler example. Yeah I suppose this should work. Note that comparing a vs a DataFrame is not typically useful. You almost always compare against a Series (and then inddex).

pull-requests are welcome.

In [23]: df = DataFrame({'A' : ['foo','bar','baz']})

In [24]: df['B'] = df['A'].astype('category')

In [25]: df['A'] == 'foo'
Out[25]: 
0     True
1    False
2    False
Name: A, dtype: bool

In [26]: df['B'] == 'foo'
Out[26]: 
0     True
1    False
2    False
Name: B, dtype: bool

In [27]: df[['A']] == 'foo'
Out[27]: 
       A
0   True
1  False
2  False

In [28]: df[['B']] == 'foo'
ValueError: Wrong number of dimensions

pganssle · 2016-03-09T00:32:51Z

My use case, if you want to know, was that I just wanted to discard rows with np.inf in any column (though you could imagine the same thing with 0 or something). I find it easier and more readable to just broadcast the comparison across the entire DataFrame, even though logically speaking I know that the comparison really only needs to be applied to the float columns.

facaiy · 2016-03-22T12:08:32Z

>>> df[['A']].ndim
2
>>> df[['A']]._data.blocks[0].values
array([['foo', 'bar', 'baz']], dtype=object)
>>> df[['A']]._data.blocks[0].values.ndim
2
>>> df[['B']].ndim
2
>>> df[['B']]._data.blocks[0].values
[foo, bar, baz]
Categories (3, object): [bar, baz, foo]
>>> df[['B']]._data.blocks[0].values.ndim
1

interesting,
I think that's why pandas raises ValueError: Wrong number of dimensions.
right?

jreback · 2016-03-22T13:04:12Z

@ningchi that is only a manifestation of the issue, not the cause. CategoricalBlocks only hold a 1-d structure. The comparison path goes thru core/internals/Block/eval.

you can prob get away with changing this:

transf = (lambda x: x.T) if is_transposed else (lambda x: x)

to something like

def transf(x):
     transposer = lambda x: x.T if is_transposed: lambda x: x
     return lambda x: _block_shape(transposer(x), ndim=self.ndim)

facaiy · 2016-03-22T14:52:56Z

Thanks, @jreback

because CategoricalBlocks only hold a 1-d structure, I have no idea how to extend its ndim, except using to_dense() like NonConsolidatableMixIn.get_values.

On the another hand, is it strange to convert self.dim to self.value.dim? as we expect a Dataframe, rather than a Series.

jreback · 2016-03-22T15:05:15Z

did you try the code above?

facaiy · 2016-03-23T01:51:40Z

@jreback Sorry, I didn't test your suggestion completely. Many thanks.

…as DataFrame

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice Categorical Categorical Data Type labels Mar 8, 2016

jreback added this to the 0.18.1 milestone Mar 8, 2016

facaiy mentioned this issue Mar 23, 2016

Fix #12564 for Categorical: consistent result if comparing as DataFrame #12698

Closed

4 tasks

facaiy added a commit to facaiy/pandas that referenced this issue Apr 3, 2016

Fix pandas-dev#12564 for Categorical: consistent result if comparing …

6cab622

…as DataFrame

jreback closed this as completed in ad8ade8 Apr 3, 2016

jreback mentioned this issue Apr 11, 2016

BUG: indexing with boolean array and categoricals #12861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical equality check raises ValueError in DataFrame #12564

Categorical equality check raises ValueError in DataFrame #12564

pganssle commented Mar 8, 2016

jreback commented Mar 8, 2016

pganssle commented Mar 9, 2016

facaiy commented Mar 22, 2016

jreback commented Mar 22, 2016

facaiy commented Mar 22, 2016

jreback commented Mar 22, 2016

facaiy commented Mar 23, 2016

Categorical equality check raises ValueError in DataFrame #12564

Categorical equality check raises ValueError in DataFrame #12564

Comments

pganssle commented Mar 8, 2016

Code Sample

Expected Output

output of pd.show_versions()

jreback commented Mar 8, 2016

pganssle commented Mar 9, 2016

facaiy commented Mar 22, 2016

jreback commented Mar 22, 2016

facaiy commented Mar 22, 2016

jreback commented Mar 22, 2016

facaiy commented Mar 23, 2016

output of `pd.show_versions()`