Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_table over Categorical columns #15193

Closed
Kevin-McIsaac opened this issue Jan 22, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@Kevin-McIsaac
Copy link

commented Jan 22, 2017

Code Sample, a copy-pastable example if possible

stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                 'Town Hall Station', 'Central Station', 'Circular Quay Station', 
                 'Martin Place Station', 'Museum Station', 'St James Station', 
                 'Bondi Junction Station', 'North Sydney Station']

df = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                               'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                   'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017',
                                             '3/1/2017', '3/1/2017', '3/1/2017',]),
                   'Exit': range(0, 9)})

df.Station = df.Station.astype('category', ordered=True, categories=stations)
df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)

Problem description

When the column is a Categorical the output of pivot_table

  1. Includes columns that are all NaN which should not be the case as dropna=True
  2. Includes columns for categories that aren't in the input Data Frame, which is strange.
  3. Is not the same as when the output before the column is converted to a categorical

This was not the behaviour in earlier versions (18?)

Expected Output

Should be the same as the output when the column is not a categorical

Station Kings Cross Station Newtown Station Parramatta Station
Date
2017-01-01 0 1 2
2017-02-01 3 4 5
2017-03-01 6 7 8

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.41-36.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2017

we had this exact discussion here: #12298

with a categorical

In[1]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[1]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station  Town Hall Station  Central Station  Circular Quay Station  Martin Place Station  Museum Station  St James Station  \
Date                                                                                                                                                                                      
2017-01-01                  0.0              1.0                 2.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-02-01                  3.0              4.0                 5.0                NaN              NaN                    NaN                   NaN             NaN               NaN   
2017-03-01                  6.0              7.0                 8.0                NaN              NaN                    NaN                   NaN             NaN               NaN   

Station     Bondi Junction Station  North Sydney Station  
Date                                                      
2017-01-01                     NaN                   NaN  
2017-02-01                     NaN                   NaN  
2017-03-01                     NaN                   NaN  

as object

In [4]: df.assign(Station = df.Station.astype(object)).pivot_table(index = 'Date', columns= 'Station', values = 'Exit', dropna=True)
Out[4]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                    0                1                   2
2017-02-01                    3                4                   5
2017-03-01                    6                7                   8

so responding to your points

  1. Includes columns that are all NaN which should not be the case as dropna=True

I suppose this is a bug as this is the point of dropna , so this appears not to be working, e.g.
is the expected result

In [14]: df.pivot_table(index = 'Date', columns= 'Station', values = 'Exit').dropna(how='all', axis=1)
Out[14]: 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                  0.0              1.0                 2.0
2017-02-01                  3.0              4.0                 5.0
2017-03-01                  6.0              7.0                 8.0
  1. Includes columns for categories that aren't in the input Data Frame, which is strange.

this is by definition and the point of a categorical, you get all of the categories. furthermore these are specifically ordered.

  1. Is not the same as when the output before the column is converted to a categorical

not sure what you mean here

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2017

So I think that we can mark this as a bug w.r.t. point 1 above, IOW, dropna is not doing anything here as its applied to drop the all-na columns.

@jorisvandenbossche

@verhalenn

This comment has been minimized.

Copy link

commented Feb 23, 2017

Would you be opposed to a simple

if isinstance(table, DataFrame) and dropna: 
    table = table.dropna(how='all', axis=1)

added to the end of the function?

It passes all current tests.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2017

@verhalenn that might work. of course need new tests as well.

@verhalenn verhalenn referenced this issue Feb 26, 2017

Closed

BUG: pivot_table over Categorical Columns #15511

3 of 3 tasks complete

@jreback jreback closed this in ed2a2e4 Mar 4, 2017

AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

BUG: pivot_table over Categorical Columns
closes pandas-dev#15193

Author: Nicholas Ver Halen <verhalenn@yahoo.com>

Closes pandas-dev#15511 from verhalenn/issue15193 and squashes the following commits:

bf0fdeb [Nicholas Ver Halen] Added description to code change.
adf8616 [Nicholas Ver Halen] Added whatsnew for issue 15193
a643267 [Nicholas Ver Halen] Added test for issue 15193
d605251 [Nicholas Ver Halen] Made sure pivot_table propped na columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.