Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical equality check raises ValueError in DataFrame #12564

Closed
pganssle opened this issue Mar 8, 2016 · 7 comments
Closed

Categorical equality check raises ValueError in DataFrame #12564

pganssle opened this issue Mar 8, 2016 · 7 comments
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@pganssle
Copy link
Contributor

pganssle commented Mar 8, 2016

Apparently there's an issue when comparing the equality of a scalar value against a categorical column as part of a DataFrame. In the example below, I'm checking against -np.inf, but comparing to a string or integer gives the same results.

This raises ValueError: Wrong number of dimensions.

Code Sample

from sys import version
import pandas as pd     # Version 0.17.1 on Linux and Windows
import numpy as np
print(version)
print(pd.__version__)

# Arbitrary data set
columns = ['Name', 'Type', 'Age', 'Weight (kg)', 'Cuteness']
dataset = [['Snuggles', 'Cat', 5.2, 4.2, 9.7],
           ['Rex', 'Dog', 2.1, 12, 2.1],
           ['Mrs. Quiggleworth', 'Cat', 7.4, 3, 7],
           ['Squirmy', 'Snake', 1.1, 0.2, 0.1],
           ['Tarantula', 'Legs', 0.2, 0.01, -np.inf],
           ['Groucho', 'Dog', 6.9, 8, 5.1]]

df = pd.DataFrame(dataset, columns=columns).set_index(['Name'])

# Works fine
print("String type - are any of the columns negative infinity?")
neg_inf = (df == -np.inf)
print(neg_inf.any(axis=1))

# Convert 'type' to a categorical
df['Type'] = df['Type'].astype('category')

print("Categorical type - is the Type column negative infinity?")
print(df['Type'] == -np.inf)    # Works fine

print("Categorical type in dataframe - are any of them negative infinity?")
print(df[['Type']] == -np.inf)  # Danger, Will Robinson!

Expected Output

3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
0.17.1
String type - are any of the columns negative infinity?
Name
Snuggles             False
Rex                  False
Mrs. Quiggleworth    False
Squirmy              False
Tarantula             True
Groucho              False
dtype: bool
Categorical type - is the Type column negative infinity?
Name
Snuggles             False
Rex                  False
Mrs. Quiggleworth    False
Squirmy              False
Tarantula            False
Groucho              False
Name: Type, dtype: bool
Categorical type in dataframe - are any of them negative infinity?
Traceback (most recent call last):
  File "pandas_demo.py", line 30, in <module>
    print(df[['Type']] == -np.inf)  # Danger, Will Robinson!
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py", line 1115, in f
    res = self._combine_const(other, func, raise_on_error=False)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 3482, in _combine_const
    new_data = self._data.eval(func=func, other=other, raise_on_error=raise_on_error)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2840, in eval
    return self.apply('eval', **kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2823, in apply
    applied = getattr(b, f)(**kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 1155, in eval
    fastpath=True,)]
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 169, in make_block
    return make_block(values, placement=placement, ndim=ndim, **kwargs)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2454, in make_block
    placement=placement)
  File "~/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 78, in __init__
    raise ValueError('Wrong number of dimensions')
ValueError: Wrong number of dimensions

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-51-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None
@jreback
Copy link
Contributor

jreback commented Mar 8, 2016

here's a simpler example. Yeah I suppose this should work. Note that comparing a vs a DataFrame is not typically useful. You almost always compare against a Series (and then inddex).

pull-requests are welcome.

In [23]: df = DataFrame({'A' : ['foo','bar','baz']})

In [24]: df['B'] = df['A'].astype('category')

In [25]: df['A'] == 'foo'
Out[25]: 
0     True
1    False
2    False
Name: A, dtype: bool

In [26]: df['B'] == 'foo'
Out[26]: 
0     True
1    False
2    False
Name: B, dtype: bool

In [27]: df[['A']] == 'foo'
Out[27]: 
       A
0   True
1  False
2  False

In [28]: df[['B']] == 'foo'
ValueError: Wrong number of dimensions

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Novice Categorical Categorical Data Type labels Mar 8, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 8, 2016
@pganssle
Copy link
Contributor Author

pganssle commented Mar 9, 2016

My use case, if you want to know, was that I just wanted to discard rows with np.inf in any column (though you could imagine the same thing with 0 or something). I find it easier and more readable to just broadcast the comparison across the entire DataFrame, even though logically speaking I know that the comparison really only needs to be applied to the float columns.

@facaiy
Copy link
Contributor

facaiy commented Mar 22, 2016

>>> df[['A']].ndim
2
>>> df[['A']]._data.blocks[0].values
array([['foo', 'bar', 'baz']], dtype=object)
>>> df[['A']]._data.blocks[0].values.ndim
2
>>> df[['B']].ndim
2
>>> df[['B']]._data.blocks[0].values
[foo, bar, baz]
Categories (3, object): [bar, baz, foo]
>>> df[['B']]._data.blocks[0].values.ndim
1

interesting,
I think that's why pandas raises ValueError: Wrong number of dimensions.
right?

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@ningchi that is only a manifestation of the issue, not the cause. CategoricalBlocks only hold a 1-d structure. The comparison path goes thru core/internals/Block/eval.

you can prob get away with changing this:

transf = (lambda x: x.T) if is_transposed else (lambda x: x)

to something like

def transf(x):
     transposer = lambda x: x.T if is_transposed: lambda x: x
     return lambda x: _block_shape(transposer(x), ndim=self.ndim)

@facaiy
Copy link
Contributor

facaiy commented Mar 22, 2016

Thanks, @jreback

because CategoricalBlocks only hold a 1-d structure, I have no idea how to extend its ndim, except using to_dense() like NonConsolidatableMixIn.get_values.

On the another hand, is it strange to convert self.dim to self.value.dim? as we expect a Dataframe, rather than a Series.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

did you try the code above?

@facaiy
Copy link
Contributor

facaiy commented Mar 23, 2016

@jreback Sorry, I didn't test your suggestion completely. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants