'groupby' multiple columns and 'sum' multiple columns with different types #13821

pmckelvy1 · 2016-07-27T15:45:39Z

Code Sample, a copy-pastable example if possible

from decimal import *
import pandas as pd
df = pd.DataFrame(
                  {'name': ['foo', 'bar', 'foo', 'bar'], 
                   'title': ['boo', 'far', 'boo', 'far'], 
                   'id': [123, 456, 123, 456], 
                   'int_column': [1, 2, 3, 4], 
                   'dec_column1': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')], 
                   'dec_column2': [Decimal('0.20'), Decimal('0.30'), Decimal('0.55'), Decimal('0.60')]
                  },
                  columns=['name','title','id','int_column','dec_column1','dec_column2']
                 )
df.groupby(['name', 'title', 'id'], as_index=False).sum()

Expected Output

i have dataframe that looks something like this...

...that has multiple rows with the same name, title, and id, but different values for the 3 number columns (int_column, dec_column1, dec_column2).
int_column == column of integers
dec_column1 == column of decimals
dec_column2 == column of decimals
I would like to be able to groupby the first three columns, and sum the last 3. I would expect to be able to do the following:

df = df.groupby(['name', 'title', 'id'], as_index=False).sum()

however, the only column that gets summed and ends up in the final dataframe is the int_column.

if i explicitly name the columns, i can get the statement to target the decimal columns either on their own or together....

df = df.groupby(['name', 'title', 'id'], as_index=False)['dec_column1'].sum()
returns...
| name | title | id | dec_column1 |
and...
df = df.groupby(['name', 'title', 'id'], as_index=False)['dec_column1', 'dec_column2'].sum()
returns...
| name | title | id | dec_column1 | dec_column1 |
however...
df = df.groupby(['name', 'title', 'id'], as_index=False)['dec_column1', 'dec_column2', 'user_num'].sum()
or...
df = df.groupby(['name', 'title', 'id'], as_index=False)['dec_column1', 'user_num', 'dec_column2'].sum()
or...
df = df.groupby(['name', 'title', 'id'], as_index=False)['user_num', 'dec_column1', 'dec_column2'].sum()
returns...
| name | title | id | int_column |

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.11.1
scipy: None
statsmodels: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: 2.3.5
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: 0.7.5.None
psycopg2: 2.5.5 (dt dec pq3 ext)

The text was updated successfully, but these errors were encountered:

JoaoAparicio · 2016-07-27T18:45:12Z

Cleaning up code sample a bit:

from decimal import *
import pandas as pd
df = pd.DataFrame(
                  {'name': ['foo', 'bar', 'foo', 'bar'], 
                   'title': ['boo', 'far', 'boo', 'far'], 
                   'id': [123, 456, 123, 456], 
                   'int_column': [1, 2, 3, 4], 
                   'dec_column1': [Decimal('0.50'), Decimal('0.15'), Decimal('0.25'), Decimal('0.40')], 
                   'dec_column2': [Decimal('0.20'), Decimal('0.30'), Decimal('0.55'), Decimal('0.60')]
                  },
                  columns=['name','title','id','int_column','dec_column1','dec_column2']
                 )
df.groupby(['name', 'title', 'id'], as_index=False).sum()

TomAugspurger · 2016-07-27T19:31:28Z

@JoaoAparicio thanks, I'll edit that into the original

Slightly related to #13157, since it's a Decimal issue. In general, support around Decimal types is hit or miss. I'm assuming it gets excluded as a non-numeric column before any aggregation occurs. You can see this since operating on just that column seems to work

In [21]: df.groupby(['name', 'title', 'id']).dec_column1.sum()
Out[21]:
name  title  id
bar   far    456    0.55
foo   boo    123    0.75

I'm -0 on whether this is worth fixing at the moment.

JoaoAparicio · 2016-07-27T19:44:13Z

Correct, it's the decimals. If you were to replace them with floats:

from decimal import *
import pandas as pd
df = pd.DataFrame(
                  {'name': ['foo', 'bar', 'foo', 'bar'],
                   'title': ['boo', 'far', 'boo', 'far'],
                   'id': [123, 456, 123, 456],
                   'int_column': [1, 2, 3, 4],
                   'dec_column1': [0.5,0.15,0.25,0.4],
                   'dec_column2': [0.2,0.3,0.55,0.6]
                  },
                  columns=['name','title','id','int_column','dec_column1','dec_column2']
                 )
df.groupby(['name', 'title', 'id'], as_index=False).sum()

then everything works as it should

  name title   id  int_column  dec_column1  dec_column2
0  bar   far  456           6         0.55         0.90
1  foo   boo  123           4         0.75         0.75

TomAugspurger · 2016-07-27T19:50:13Z

Actually, I think fixing this is a no-go since not all agg operations work on Decimal. We can't have this start causing Exceptions because gr.dec_column1.mean() doesn't work.

How about this: we officially document Decimal columns as "nuisance" columns (columns that .agg automatically excludes) in groupby. The documentation should note that if you do wish to aggregate them, you must do so explicitly:

gr.agg({"dec_column1": "sum", "dec_column2": "sum"})

pdpark · 2017-12-23T01:47:46Z

I use Pandas, but I'm still new to contributing, so apologies if this isn't the right approach, but I'm thinking of adding a sentence or two to the "Note" section here: https://pandas.pydata.org/pandas-docs/stable/groupby.html?highlight=groupby#aggregation.

If that sounds good I can take this one.

TomAugspurger · 2017-12-24T20:50:00Z

Yes, that sounds good.

pdpark · 2017-12-27T08:32:13Z

Groupby documentation updated with additional note and example code; pull requested.

TomAugspurger added Groupby Dtype Conversions Unexpected or buggy dtype conversions labels Jul 27, 2016

TomAugspurger added Docs Difficulty Novice labels Jul 27, 2016

TomAugspurger added this to the 0.19.0 milestone Jul 27, 2016

jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

pdpark mentioned this issue Dec 27, 2017

DOC: Added note about groupby excluding Decimal columns by default #18953

Merged

jreback modified the milestones: Next Major Release, 0.23.0 Dec 27, 2017

jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

jorisvandenbossche closed this as completed in #18953 Nov 8, 2018

jorisvandenbossche modified the milestones: Contributions Welcome, 0.24.0 Nov 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'groupby' multiple columns and 'sum' multiple columns with different types #13821

'groupby' multiple columns and 'sum' multiple columns with different types #13821

pmckelvy1 commented Jul 27, 2016 •

edited by TomAugspurger

Loading

JoaoAparicio commented Jul 27, 2016 •

edited

Loading

TomAugspurger commented Jul 27, 2016

JoaoAparicio commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016 •

edited

Loading

pdpark commented Dec 23, 2017

TomAugspurger commented Dec 24, 2017

pdpark commented Dec 27, 2017

'groupby' multiple columns and 'sum' multiple columns with different types #13821

'groupby' multiple columns and 'sum' multiple columns with different types #13821

Comments

pmckelvy1 commented Jul 27, 2016 • edited by TomAugspurger Loading

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

JoaoAparicio commented Jul 27, 2016 • edited Loading

TomAugspurger commented Jul 27, 2016

JoaoAparicio commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016 • edited Loading

pdpark commented Dec 23, 2017

TomAugspurger commented Dec 24, 2017

pdpark commented Dec 27, 2017

pmckelvy1 commented Jul 27, 2016 •

edited by TomAugspurger

Loading

output of `pd.show_versions()`

JoaoAparicio commented Jul 27, 2016 •

edited

Loading

TomAugspurger commented Jul 27, 2016 •

edited

Loading