Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent treatment of categories in multi-level unstack #15239

Closed
Kevin-McIsaac opened this issue Jan 27, 2017 · 8 comments · Fixed by #41493
Closed

Inconsistent treatment of categories in multi-level unstack #15239

Kevin-McIsaac opened this issue Jan 27, 2017 · 8 comments · Fixed by #41493
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@Kevin-McIsaac
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'Date': pd.date_range(start='2016/1/1', periods=3, freq='W'), 
                   'Station':['Kings Cross', 'Sydney', 'Newtown'], 
                   'Hours': ["1PM", "2PM", "3PM"], 
                   'Exit':[10, 30, 50], 'Entry':[0, 60, 20]})
df.Station = df.Station.astype('category')
df.Hours = df.Hours.astype('category')

df1= (df.set_index(['Date','Hours', 'Station'])[['Entry', 'Exit']].
                 unstack(['Hours', 'Station']).
                   stack(['Hours', 'Station'], dropna=False).
           reset_index())
print(df1.info())

assert df.Hours.dtype == df1.Hours.dtype, "Hours dtype has changed" 
assert df.Station.dtype == df1.Station.dtype, "Station dtype has changed" 

Problem description

Hours and Station are categories. After the stack/unstack Hours remains a category but Station becomes an object.

If I switch the order of the Station and Hours in unstack the Station remains categorical and Hours becomes an object.

Expected Output

Hours and Station both remain categorical (prefered) or both become the underlying type of.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

@Kevin-McIsaac Thanks for the report, that sounds like a bug.

From a quick look, the categorical is lost in the stack call.

@jorisvandenbossche jorisvandenbossche added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 28, 2017
@jorisvandenbossche
Copy link
Member

@Kevin-McIsaac BTW, another way to do this instead of using the sequential unstacking/stacking is to reindex with the desired index (which you can construct with MultiIndex.from_product):

In [45]: temp = df.set_index(['Date', 'Hours', 'Station'])

In [46]: temp.reindex(pd.MultiIndex.from_product(temp.index.levels, names=['Date', 'Hour', 'Stations'])).reset_index()
Out[46]: 
         Date Hour     Stations  Entry  Exit
0  2016-01-03  1PM  Kings Cross    0.0  10.0
1  2016-01-03  1PM      Newtown    NaN   NaN
2  2016-01-03  1PM       Sydney    NaN   NaN
3  2016-01-03  2PM  Kings Cross    NaN   NaN
4  2016-01-03  2PM      Newtown    NaN   NaN
5  2016-01-03  2PM       Sydney    NaN   NaN
6  2016-01-03  3PM  Kings Cross    NaN   NaN
7  2016-01-03  3PM      Newtown    NaN   NaN
8  2016-01-03  3PM       Sydney    NaN   NaN
9  2016-01-10  1PM  Kings Cross    NaN   NaN
10 2016-01-10  1PM      Newtown    NaN   NaN
11 2016-01-10  1PM       Sydney    NaN   NaN
12 2016-01-10  2PM  Kings Cross    NaN   NaN
13 2016-01-10  2PM      Newtown    NaN   NaN
14 2016-01-10  2PM       Sydney   60.0  30.0
15 2016-01-10  3PM  Kings Cross    NaN   NaN
16 2016-01-10  3PM      Newtown    NaN   NaN
17 2016-01-10  3PM       Sydney    NaN   NaN
18 2016-01-17  1PM  Kings Cross    NaN   NaN
19 2016-01-17  1PM      Newtown    NaN   NaN
20 2016-01-17  1PM       Sydney    NaN   NaN
21 2016-01-17  2PM  Kings Cross    NaN   NaN
22 2016-01-17  2PM      Newtown    NaN   NaN
23 2016-01-17  2PM       Sydney    NaN   NaN
24 2016-01-17  3PM  Kings Cross    NaN   NaN
25 2016-01-17  3PM      Newtown   20.0  50.0
26 2016-01-17  3PM       Sydney    NaN   NaN

In [47]: temp.reindex(pd.MultiIndex.from_product(temp.index.levels, names=['Date', 'Hour', 'Stations'])).reset_index().dtypes
Out[47]: 
Date        datetime64[ns]
Hour              category
Stations          category
Entry              float64
Exit               float64
dtype: object

@jreback
Copy link
Contributor

jreback commented Jan 28, 2017

this is a dupe of #14018

@jreback jreback closed this as completed Jan 28, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 28, 2017

A smaller reproducible example:

Stacking one categorical level keeps the categorical intact:

In [48]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2, 
                                           pd.Categorical(list('abab'))])

In [49]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)

In [50]: df.stack()
Out[50]: 
     A  B
0 a  0  2
  b  1  3
1 a  4  6
  b  5  7

In [51]: df.stack().index.get_level_values(1)
Out[51]: CategoricalIndex(['a', 'b', 'a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

Stacking two categorical levels: the second becomes a plain index:

In [52]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2, 
                                           pd.Categorical(list('abab')), 
                                           pd.Categorical(list('ccdd'))])

In [53]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)

In [54]: df.stack([1,2])
Out[54]: 
         A    B
0 a c  0.0  NaN
    d  NaN  2.0
  b c  1.0  NaN
    d  NaN  3.0
1 a c  4.0  NaN
    d  NaN  6.0
  b c  5.0  NaN
    d  NaN  7.0

In [55]: df.stack([1,2]).index.get_level_values(1)
Out[55]: CategoricalIndex(['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [56]: df.stack([1,2]).index.get_level_values(2)
Out[56]: Index(['c', 'd', 'c', 'd', 'c', 'd', 'c', 'd'], dtype='object')

@jreback
Copy link
Contributor

jreback commented Jan 28, 2017

@jorisvandenbossche its the .unstack

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 28, 2017

@jreback the other issue is about the values, here it is index itself, see my example in the comment above

@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Jan 28, 2017
@Kevin-McIsaac
Copy link
Author

@jorisvandenbossche thanks for sharing a much simpler way to get the same result. I learnt something useful.

@mroeschke
Copy link
Member

This looks to be correct on master now. Could use a test

In [30]: In [52]: midx = pd.MultiIndex.from_arrays([['A']*2 + ['B']*2,
    ...:                                            pd.Categorical(list('abab')),
    ...:                                            pd.Categorical(list('ccdd'))])
    ...:
    ...: In [53]: df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=midx)
    ...:
    ...: In [54]: df.stack([1,2])
Out[30]:
         A    B
0 a c  0.0  NaN
    d  NaN  2.0
  b c  1.0  NaN
    d  NaN  3.0
1 a c  4.0  NaN
    d  NaN  6.0
  b c  5.0  NaN
    d  NaN  7.0

In [31]: df.stack([1,2]).index.get_level_values(2)
Out[31]: CategoricalIndex(['c', 'd', 'c', 'd', 'c', 'd', 'c', 'd'], categories=['c', 'd'], ordered=False, dtype='category')

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2021
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants