Inconsistent treatment of categories in multi-level unstack #15239

Kevin-McIsaac · 2017-01-27T05:06:29Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'Date': pd.date_range(start='2016/1/1', periods=3, freq='W'), 
                   'Station':['Kings Cross', 'Sydney', 'Newtown'], 
                   'Hours': ["1PM", "2PM", "3PM"], 
                   'Exit':[10, 30, 50], 'Entry':[0, 60, 20]})
df.Station = df.Station.astype('category')
df.Hours = df.Hours.astype('category')

df1= (df.set_index(['Date','Hours', 'Station'])[['Entry', 'Exit']].
                 unstack(['Hours', 'Station']).
                   stack(['Hours', 'Station'], dropna=False).
           reset_index())
print(df1.info())

assert df.Hours.dtype == df1.Hours.dtype, "Hours dtype has changed" 
assert df.Station.dtype == df1.Station.dtype, "Station dtype has changed"

Problem description

Hours and Station are categories. After the stack/unstack Hours remains a category but Station becomes an object.

If I switch the order of the Station and Hours in unstack the Station remains categorical and Hours becomes an object.

Expected Output

Hours and Station both remain categorical (prefered) or both become the underlying type of.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-01-28T22:01:30Z

@Kevin-McIsaac Thanks for the report, that sounds like a bug.

From a quick look, the categorical is lost in the stack call.

jorisvandenbossche · 2017-01-28T22:15:22Z

@Kevin-McIsaac BTW, another way to do this instead of using the sequential unstacking/stacking is to reindex with the desired index (which you can construct with MultiIndex.from_product):

In [45]: temp = df.set_index(['Date', 'Hours', 'Station'])

In [46]: temp.reindex(pd.MultiIndex.from_product(temp.index.levels, names=['Date', 'Hour', 'Stations'])).reset_index()
Out[46]: 
         Date Hour     Stations  Entry  Exit
0  2016-01-03  1PM  Kings Cross    0.0  10.0
1  2016-01-03  1PM      Newtown    NaN   NaN
2  2016-01-03  1PM       Sydney    NaN   NaN
3  2016-01-03  2PM  Kings Cross    NaN   NaN
4  2016-01-03  2PM      Newtown    NaN   NaN
5  2016-01-03  2PM       Sydney    NaN   NaN
6  2016-01-03  3PM  Kings Cross    NaN   NaN
7  2016-01-03  3PM      Newtown    NaN   NaN
8  2016-01-03  3PM       Sydney    NaN   NaN
9  2016-01-10  1PM  Kings Cross    NaN   NaN
10 2016-01-10  1PM      Newtown    NaN   NaN
11 2016-01-10  1PM       Sydney    NaN   NaN
12 2016-01-10  2PM  Kings Cross    NaN   NaN
13 2016-01-10  2PM      Newtown    NaN   NaN
14 2016-01-10  2PM       Sydney   60.0  30.0
15 2016-01-10  3PM  Kings Cross    NaN   NaN
16 2016-01-10  3PM      Newtown    NaN   NaN
17 2016-01-10  3PM       Sydney    NaN   NaN
18 2016-01-17  1PM  Kings Cross    NaN   NaN
19 2016-01-17  1PM      Newtown    NaN   NaN
20 2016-01-17  1PM       Sydney    NaN   NaN
21 2016-01-17  2PM  Kings Cross    NaN   NaN
22 2016-01-17  2PM      Newtown    NaN   NaN
23 2016-01-17  2PM       Sydney    NaN   NaN
24 2016-01-17  3PM  Kings Cross    NaN   NaN
25 2016-01-17  3PM      Newtown   20.0  50.0
26 2016-01-17  3PM       Sydney    NaN   NaN

In [47]: temp.reindex(pd.MultiIndex.from_product(temp.index.levels, names=['Date', 'Hour', 'Stations'])).reset_index().dtypes
Out[47]: 
Date        datetime64[ns]
Hour              category
Stations          category
Entry              float64
Exit               float64
dtype: object

jreback · 2017-01-28T22:18:10Z

this is a dupe of #14018

jorisvandenbossche · 2017-01-28T22:18:11Z

A smaller reproducible example:

Stacking one categorical level keeps the categorical intact:

In [48]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2, 
                                           pd.Categorical(list('abab'))])

In [49]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)

In [50]: df.stack()
Out[50]: 
     A  B
0 a  0  2
  b  1  3
1 a  4  6
  b  5  7

In [51]: df.stack().index.get_level_values(1)
Out[51]: CategoricalIndex(['a', 'b', 'a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

Stacking two categorical levels: the second becomes a plain index:

In [52]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2, 
                                           pd.Categorical(list('abab')), 
                                           pd.Categorical(list('ccdd'))])

In [53]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)

In [54]: df.stack([1,2])
Out[54]: 
         A    B
0 a c  0.0  NaN
    d  NaN  2.0
  b c  1.0  NaN
    d  NaN  3.0
1 a c  4.0  NaN
    d  NaN  6.0
  b c  5.0  NaN
    d  NaN  7.0

In [55]: df.stack([1,2]).index.get_level_values(1)
Out[55]: CategoricalIndex(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [56]: df.stack([1,2]).index.get_level_values(2)
Out[56]: Index(['c', 'd', 'c', 'd', 'c', 'd', 'c', 'd'], dtype='object')

jreback · 2017-01-28T22:18:43Z

@jorisvandenbossche its the .unstack

jorisvandenbossche · 2017-01-28T22:19:13Z

@jreback the other issue is about the values, here it is index itself, see my example in the comment above

Kevin-McIsaac · 2017-01-29T00:19:46Z

@jorisvandenbossche thanks for sharing a much simpler way to get the same result. I learnt something useful.

mroeschke · 2021-05-08T00:41:41Z

This looks to be correct on master now. Could use a test

In [30]: In [52]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2,
    ...:                                            pd.Categorical(list('abab')),
    ...:                                            pd.Categorical(list('ccdd'))])
    ...:
    ...: In [53]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)
    ...:
    ...: In [54]: df.stack([1,2])
Out[30]:
         A    B
0 a c  0.0  NaN
    d  NaN  2.0
  b c  1.0  NaN
    d  NaN  3.0
1 a c  4.0  NaN
    d  NaN  6.0
  b c  5.0  NaN
    d  NaN  7.0

In [31]: df.stack([1,2]).index.get_level_values(2)
Out[31]: CategoricalIndex(['c', 'd', 'c', 'd', 'c', 'd', 'c', 'd'], categories=['c', 'd'], ordered=False, dtype='category')

jorisvandenbossche added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 28, 2017

jreback closed this as completed Jan 28, 2017

jorisvandenbossche reopened this Jan 28, 2017

jorisvandenbossche added this to the Next Major Release milestone Jan 28, 2017

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2021

mroeschke mentioned this issue May 15, 2021

TST: Add tests for old issues 2 #41493

Merged

9 tasks

mroeschke modified the milestones: Contributions Welcome, 1.3 May 16, 2021

jreback closed this as completed in #41493 May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent treatment of categories in multi-level unstack #15239

Inconsistent treatment of categories in multi-level unstack #15239

Kevin-McIsaac commented Jan 27, 2017

jorisvandenbossche commented Jan 28, 2017

jorisvandenbossche commented Jan 28, 2017

jreback commented Jan 28, 2017

jorisvandenbossche commented Jan 28, 2017 •

edited

jreback commented Jan 28, 2017 •

edited

jorisvandenbossche commented Jan 28, 2017 •

edited

Kevin-McIsaac commented Jan 29, 2017

mroeschke commented May 8, 2021

Inconsistent treatment of categories in multi-level unstack #15239

Inconsistent treatment of categories in multi-level unstack #15239

Comments

Kevin-McIsaac commented Jan 27, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jorisvandenbossche commented Jan 28, 2017

jorisvandenbossche commented Jan 28, 2017

jreback commented Jan 28, 2017

jorisvandenbossche commented Jan 28, 2017 • edited

jreback commented Jan 28, 2017 • edited

jorisvandenbossche commented Jan 28, 2017 • edited

Kevin-McIsaac commented Jan 29, 2017

mroeschke commented May 8, 2021

Output of `pd.show_versions()`

jorisvandenbossche commented Jan 28, 2017 •

edited

jreback commented Jan 28, 2017 •

edited

jorisvandenbossche commented Jan 28, 2017 •

edited