New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test/fix case where use of "All" causes problems in pivot_table with margins=True #3335

Closed
wesm opened this Issue Apr 13, 2013 · 7 comments

Comments

Projects
None yet
4 participants
@wesm
Member

wesm commented Apr 13, 2013

No description provided.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014

@jreback jreback modified the milestone: 0.16.0 Jan 26, 2015

@TheInan

This comment has been minimized.

Show comment
Hide comment
@TheInan

TheInan Mar 30, 2015

I'd like to take a stab at this if no one minds

TheInan commented Mar 30, 2015

I'd like to take a stab at this if no one minds

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 30, 2015

Contributor

go for it!

Contributor

jreback commented Mar 30, 2015

go for it!

@TheInan

This comment has been minimized.

Show comment
Hide comment
@TheInan

TheInan Apr 27, 2015

Would it be an appropriate solution to have any instances of "All" used as an index or column changed to "All_" in the original dataframe and the resulting pivot table so that having a user use "All" as a column name will not cause issues anymore?

TheInan commented Apr 27, 2015

Would it be an appropriate solution to have any instances of "All" used as an index or column changed to "All_" in the original dataframe and the resulting pivot table so that having a user use "All" as a column name will not cause issues anymore?

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Apr 27, 2015

Member

I'm not sure exactly what "problems" are caused by this issue, but generally speaking it's a better idea to raise an exception and force users to take some fallback action themselves rather than silently proceeding with some guess about user intent.

Member

shoyer commented Apr 27, 2015

I'm not sure exactly what "problems" are caused by this issue, but generally speaking it's a better idea to raise an exception and force users to take some fallback action themselves rather than silently proceeding with some guess about user intent.

@TheInan

This comment has been minimized.

Show comment
Hide comment
@TheInan

TheInan Apr 27, 2015

Alright then, I think I'll just scan the DataFrame for any instances of All being used as a value or column title, and if there is one, I'll raise an exception. The problem was that when margins was passed in as true, columns with the title "All" were added to the dataframe, and then add a later point, instances of "All" columns had aggregates data filled into them, so if the original data used "All" it was being replaced I think

TheInan commented Apr 27, 2015

Alright then, I think I'll just scan the DataFrame for any instances of All being used as a value or column title, and if there is one, I'll raise an exception. The problem was that when margins was passed in as true, columns with the title "All" were added to the dataframe, and then add a later point, instances of "All" columns had aggregates data filled into them, so if the original data used "All" it was being replaced I think

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 28, 2015

Contributor

need to define what is the incorrect behavior here first.

Contributor

jreback commented Apr 28, 2015

need to define what is the incorrect behavior here first.

@jreback jreback added this to the Next Major Release milestone Apr 28, 2015

@TheInan

This comment has been minimized.

Show comment
Hide comment
@TheInan

TheInan May 2, 2015

So this is what I have so far showing the issue. What happens is that when aggregating the data, as the script is going through the columns, it looks for All columns to aggregate data under. When "foo" was changed to "All" in the next DataFrame, the output is clearly nothing like that of the first pivot table. As a result, its probably most appropriate to raise an error when "All" is being used as a data point since it would be more trouble than its worth to rewrite the code around just this corner case.

import pandas as pd
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : [1,2,3,4,5,6,7,8,9,10,11,12],
                    'E' : [1,2,3,4,5,6,7,8,9,10,11,12]})
df
A B C D E
0 one A foo 1 1
1 one B foo 2 2
2 two C foo 3 3
3 three A bar 4 4
4 one B bar 5 5
5 one C bar 6 6
6 two A foo 7 7
7 three B foo 8 8
8 one C foo 9 9
9 one A bar 10 10
10 two B bar 11 11
11 three C bar 12 12
pd.pivot_table(df, values= 'D', index=['A','B'], columns=['C'], margins=True)
C bar foo All
A B
one A 10 1 5.5
B 5 2 3.5
C 6 9 7.5
three A 4 NaN 4.0
B NaN 8 8.0
C 12 NaN 12.0
two A NaN 7 7.0
B 11 NaN 11.0
C NaN 3 3.0
All 8 5 6.5
df2 = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['All', 'All', 'All', 'bar', 'bar', 'bar'] * 2,
                    'D' : [1,2,3,4,5,6,7,8,9,10,11,12],
                    'E' : [1,2,3,4,5,6,7,8,9,10,11,12]})


df2
A B C D E
0 one A All 1 1
1 one B All 2 2
2 two C All 3 3
3 three A bar 4 4
4 one B bar 5 5
5 one C bar 6 6
6 two A All 7 7
7 three B All 8 8
8 one C All 9 9
9 one A bar 10 10
10 two B bar 11 11
11 three C bar 12 12
pd.pivot_table(df2, values= 'D', index=['A','B'], columns=['C'], margins=True)
C All bar
A B
one A 5.5 10
B 3.5 5
C 7.5 6
three A 4.0 4
B 8.0 NaN
C 12.0 12
two A 7.0 NaN
B 11.0 11
C 3.0 NaN
All 6.0 8

TheInan commented May 2, 2015

So this is what I have so far showing the issue. What happens is that when aggregating the data, as the script is going through the columns, it looks for All columns to aggregate data under. When "foo" was changed to "All" in the next DataFrame, the output is clearly nothing like that of the first pivot table. As a result, its probably most appropriate to raise an error when "All" is being used as a data point since it would be more trouble than its worth to rewrite the code around just this corner case.

import pandas as pd
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : [1,2,3,4,5,6,7,8,9,10,11,12],
                    'E' : [1,2,3,4,5,6,7,8,9,10,11,12]})
df
A B C D E
0 one A foo 1 1
1 one B foo 2 2
2 two C foo 3 3
3 three A bar 4 4
4 one B bar 5 5
5 one C bar 6 6
6 two A foo 7 7
7 three B foo 8 8
8 one C foo 9 9
9 one A bar 10 10
10 two B bar 11 11
11 three C bar 12 12
pd.pivot_table(df, values= 'D', index=['A','B'], columns=['C'], margins=True)
C bar foo All
A B
one A 10 1 5.5
B 5 2 3.5
C 6 9 7.5
three A 4 NaN 4.0
B NaN 8 8.0
C 12 NaN 12.0
two A NaN 7 7.0
B 11 NaN 11.0
C NaN 3 3.0
All 8 5 6.5
df2 = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['All', 'All', 'All', 'bar', 'bar', 'bar'] * 2,
                    'D' : [1,2,3,4,5,6,7,8,9,10,11,12],
                    'E' : [1,2,3,4,5,6,7,8,9,10,11,12]})


df2
A B C D E
0 one A All 1 1
1 one B All 2 2
2 two C All 3 3
3 three A bar 4 4
4 one B bar 5 5
5 one C bar 6 6
6 two A All 7 7
7 three B All 8 8
8 one C All 9 9
9 one A bar 10 10
10 two B bar 11 11
11 three C bar 12 12
pd.pivot_table(df2, values= 'D', index=['A','B'], columns=['C'], margins=True)
C All bar
A B
one A 5.5 10
B 3.5 5
C 7.5 6
three A 4.0 4
B 8.0 NaN
C 12.0 12
two A 7.0 NaN
B 11.0 11
C 3.0 NaN
All 6.0 8

lexual added a commit to lexual/pandas that referenced this issue Jun 6, 2015

ENH: pandas-dev#3335 Pivot table support for setting name of margins …
…column.

ref pandas-dev#3335.

Adds margin_column parameter to pivot_table so that user can set it to
something other than 'All'.
Raises ValueError exception if there is a conflict between the value of
margin_column and one of the other values appearing in the indices of
the pivot table.

lexual added a commit to lexual/pandas that referenced this issue Nov 15, 2015

ENH: pandas-dev#3335 Pivot table support for setting name of margins …
…column.

ref pandas-dev#3335.

Adds margin_name parameter to pivot_table so that user can set it to
something other than 'All'.
Raises ValueError exception if there is a conflict between the value of
margin_column and one of the other values appearing in the indices of
the pivot table.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Nov 15, 2015

jreback added a commit that referenced this issue Nov 15, 2015

Merge pull request #11581 from lexual/issue_3335_pivot_handle_all_for…
…_margins

ENH: #3335 Pivot table support for setting name of margins column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment