Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

astype errors with Categorical #16697

Closed
jbrockmendel opened this issue Jun 14, 2017 · 5 comments
Closed

astype errors with Categorical #16697

jbrockmendel opened this issue Jun 14, 2017 · 5 comments
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas good first issue

Comments

@jbrockmendel
Copy link
Member

import pandas as pd
cat = pd.Categorical(['CA', 'AL'])
df = pd.DataFrame([['CA', 'CA'], ['AL', 'CA']], index=['foo', 'bar'], columns=range(2))
df[1].astype(cat)

with 0.20.1, this raises:

    if dtype == CategoricalDtype():
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This doesn't appear to be quite the intended usage. .astype("category", categories=cat) also fails, though .astype("category", categories=cat.categories) is OK.

I suspect this is related to similar errors in trying to identify which columns of a DataFrame are categorical (possible repeat of #16659):

df.dtypes[colname] == 'category' evaluates as True for categorical columns and raises TypeError: data type "category" not understood for np.float64 columns.

df.dtypes == pd.Categorical raises TypeError: Could not compare <type 'type'> type with Series

Also related: #15078

@chris-b1
Copy link
Contributor

The correct api is the one that works - .astype("category", categories=cat.categories)

cat is a categorical array, not type, so what you are trying is in effect is something like this:

    In [15]: pd.Series([1,2,3]).astype(np.array([1., 2.,  3.]))
    TypeError: data type not understood

So we could give a better message, but I think this is basically working as it should.

@jreback jreback added Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas labels Jun 14, 2017
@jreback
Copy link
Contributor

jreback commented Jun 14, 2017

this is purely a user error. You are incorrectly passing things as shown by @chris-b1 .

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index f2a7ac7..b4c3e16 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -138,8 +138,11 @@ class Block(PandasObject):
         returns a boolean if we are a categorical
         """
         if is_categorical_dtype(dtype):
-            if dtype == CategoricalDtype():
-                return True
+            try:
+                if dtype == CategoricalDtype():
+                    return True
+            except:  # no pramga
+                pass
 
             # this is a pd.Categorical, but is not
             # a valid type for astypeing

fixes. this is a bit tricky as categorical is right now a singleton type.

@jreback jreback added this to the Next Major Release milestone Jun 14, 2017
@jbrockmendel
Copy link
Member Author

Makes sense, thanks. Two follow-up questions.

Is there a canonical way to pick out categorical columns within a DataFrame?

Is there a way to pin a pre-existing Category to a Series? Here I'm thinking of cases where multiple columns of a DataFrame should be explicitly tied to the same dtype.

@jorisvandenbossche
Copy link
Member

Is there a canonical way to pick out categorical columns within a DataFrame?

You can use select_dtypes

The problem with df.dtypes == 'category' is that this fails (as you have noted) for other dtypes (numpy datatypes), which is beyond pandas to fix (it is due to a specification how numpy dtype comparison works).

Is there a way to pin a pre-existing Category to a Series?

You mean a pre-existing set of categories? Then you can do the .astype("category", categories=cat.categories)

@rafiqhasan
Copy link

rafiqhasan commented Jun 30, 2018

Simple solution , which works for me always:

str( df.dtypes[colname] ) == 'category'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas good first issue
Projects
None yet
Development

No branches or pull requests

6 participants