New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: astype(CategoricalDtype) has no effect #15078

Closed
arita37 opened this Issue Jan 7, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@arita37

arita37 commented Jan 7, 2017

The following fails silently to convert the dtype:

In [14]: s = pd.Series(['a', 'b', 'c'])

In [15]: s
Out[15]: 
0    a
1    b
2    c
dtype: object

In [16]: s.astype(pd.types.dtypes.CategoricalDtype)
Out[16]: 
0    a
1    b
2    c

I would think this either to work or either to raise an error that the dtype is not undertood.


Code Sample

When using dictionnary of dtype, it does not convert the dataframe :
Types are not modified whereas df['col']= df['col].astype(type1) works....

dtype0= {'brand': np.dtype('int64'),
 'category': np.dtype('int64'),
 'chain': np.dtype('int64'),
 'company': np.dtype('int64'),
 'date': np.dtype('O'),
 'dept':  pandas.types.dtypes.CategoricalDtype,
 'id': np.dtype('int64')}
df= df.astype(dtype0)
df.dtypes

Problem description

When using dictionnary of dtype, it does not convert the dataframe :
Types are not modified whereas df['col']= df['col].astype(type1) works....

Expected Output

columns are converted into the desire types.

Output of pd.show_versions()

Pandas 0.19.2

# Paste the output here pd.show_versions() here
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 7, 2017

Member

Can you show a reproducible example that shows the problem?

For me this works with a simple example:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':['a','b', 'c']})

In [3]: df.dtypes
Out[3]: 
a     int64
b    object
dtype: object

In [4]:  df.astype({'a': 'float64', 'b': 'category'}).dtypes
Out[4]: 
a     float64
b    category
dtype: object
Member

jorisvandenbossche commented Jan 7, 2017

Can you show a reproducible example that shows the problem?

For me this works with a simple example:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':['a','b', 'c']})

In [3]: df.dtypes
Out[3]: 
a     int64
b    object
dtype: object

In [4]:  df.astype({'a': 'float64', 'b': 'category'}).dtypes
Out[4]: 
a     float64
b    category
dtype: object
@arita37

This comment has been minimized.

Show comment
Hide comment
@arita37

arita37 Jan 7, 2017

Your solution is working.
It seems pandas.types.dtypes.CategoricalDtype is not recognized when doing the casting,
so better to use 'category'

arita37 commented Jan 7, 2017

Your solution is working.
It seems pandas.types.dtypes.CategoricalDtype is not recognized when doing the casting,
so better to use 'category'

@arita37 arita37 closed this Jan 7, 2017

@jorisvandenbossche jorisvandenbossche changed the title from Bug: dtypes conversion issue to BUG: astype(CategoricalDtype) has no effect Jan 7, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 7, 2017

Member

Hmm, that should actually work I think. Or otherwise an error. I reopened the issue and updated the top post with an example.

Member

jorisvandenbossche commented Jan 7, 2017

Hmm, that should actually work I think. Or otherwise an error. I reopened the issue and updated the top post with an example.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 7, 2017

Contributor

maybe though this is an internal type (and not directly exposed to the user)

Contributor

jreback commented Jan 7, 2017

maybe though this is an internal type (and not directly exposed to the user)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 7, 2017

Member

Yeah, and it should not necessarily work for me, but if we don't accept it as a dtype, then it should raise an error IMO.

Member

jorisvandenbossche commented Jan 7, 2017

Yeah, and it should not necessarily work for me, but if we don't accept it as a dtype, then it should raise an error IMO.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 7, 2017

Contributor

this is with our new default
errors='raise'?

Contributor

jreback commented Jan 7, 2017

this is with our new default
errors='raise'?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 7, 2017

Member

It does not depend on that (and the default was already to raise, the keyword name changed but not the default value).

The reason is that the provided dtype is converted to a numpy dtype:

In [24]: np.dtype(list)
Out[24]: dtype('O')

In [25]: np.dtype(pd.types.dtypes.CategoricalDtype)
Out[25]: dtype('O')

For that reason also something like s.astype(list) does work but does not do anything (for an object series).

It can just a bit confusing to users I think, as an instantiated CategoricalDtype actually works:

In [26]: s.astype(pd.types.dtypes.CategoricalDtype())
Out[26]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

In [27]: s.astype(pd.types.dtypes.CategoricalDtype)
Out[27]: 
0    a
1    b
2    c
dtype: object

which comes down to this difference:

In [28]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype)
Out[28]: False

In [29]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype())
Out[29]: True
Member

jorisvandenbossche commented Jan 7, 2017

It does not depend on that (and the default was already to raise, the keyword name changed but not the default value).

The reason is that the provided dtype is converted to a numpy dtype:

In [24]: np.dtype(list)
Out[24]: dtype('O')

In [25]: np.dtype(pd.types.dtypes.CategoricalDtype)
Out[25]: dtype('O')

For that reason also something like s.astype(list) does work but does not do anything (for an object series).

It can just a bit confusing to users I think, as an instantiated CategoricalDtype actually works:

In [26]: s.astype(pd.types.dtypes.CategoricalDtype())
Out[26]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

In [27]: s.astype(pd.types.dtypes.CategoricalDtype)
Out[27]: 
0    a
1    b
2    c
dtype: object

which comes down to this difference:

In [28]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype)
Out[28]: False

In [29]: pd.types.common.is_categorical_dtype(pd.types.dtypes.CategoricalDtype())
Out[29]: True
@arita37

This comment has been minimized.

Show comment
Hide comment
@arita37

arita37 Jan 8, 2017

arita37 commented Jan 8, 2017

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jan 9, 2017

Contributor

We spent some time to find where is the real object of 'category' as
this is not really mentionned in the documentation.

At the moment, the CategoricalDtype isn't part of the public API. There are plans to refactor it a bit before exposing it.

Contributor

TomAugspurger commented Jan 9, 2017

We spent some time to find where is the real object of 'category' as
this is not really mentionned in the documentation.

At the moment, the CategoricalDtype isn't part of the public API. There are plans to refactor it a bit before exposing it.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 25, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 30, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 31, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 6, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 10, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 17, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 20, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

@jreback jreback added this to the 0.21.0 milestone Sep 23, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

@jreback jreback closed this in #16015 Sep 23, 2017

jreback added a commit that referenced this issue Sep 23, 2017

Categorical type (#16015)
Closes #14711
Closes #15078
Closes #14676

alanbato added a commit to alanbato/pandas that referenced this issue Nov 10, 2017

No-Stream added a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment