DataFrame.groupby().sum() treating Nan as 0.0 #20824

amanhanda · 2018-04-25T20:50:44Z

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0

Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-04-25T20:59:59Z

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

TomAugspurger · 2018-04-25T21:00:31Z

This is a bit surprising

In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

Something strange w/ the skipna keyword there.

amanhanda · 2018-04-25T21:10:36Z

Thanks! Did not think of removing skipna=False. skipna behavior should be consistent.

TomAugspurger · 2018-04-25T21:14:34Z

I think there are two intertwined issues

DataFrameGroupby.sum doesn't accept skipna
DataFrameGroupby.sum doesn't validate its kwargs, and falls back to a secondary method

In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

so passing skipna forces the fallback, which apparently ignores the kwargs.

TomAugspurger · 2018-04-25T21:14:58Z

#15675 for the skipna part.

gsganden · 2018-06-06T15:17:55Z

String values trigger the fallback too:

In [16]: pd.DataFrame([[1, np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[16]:
     0   1
0  1.0 NaN

In [17]: pd.DataFrame([['a', np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[17]:
   0    1
0  a  0.0

I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):

try:
    return self._cython_agg_general(
        alias, alt=npfunc, **kwargs)
    except AssertionError as e:
        raise SpecificationError(str(e))
    except Exception:
        result = self.aggregate(
            lambda x: npfunc(x, axis=self.axis))

That except Exception is dangerous.

mukundm19 · 2019-04-08T17:51:23Z

Does this issue still need to be resolved. If so I'd like to look into this.

amanhanda · 2019-04-08T17:57:55Z

Yes please. From: mukundm19 <notifications@github.com> Sent: Monday, April 8, 2019 12:52 PM To: pandas-dev/pandas <pandas@noreply.github.com> Cc: Handa, Aman <Aman.Handa@citadel.com>; Author <author@noreply.github.com> Subject: [EXT] Re: [pandas-dev/pandas] DataFrame.groupby().sum() treating Nan as 0.0 (#20824) Does this issue still need to be resolved. If so I'd like to look into this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_20824-23issuecomment-2D480935840&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=Cbs63AslxT5mes6buZUkygBOYrfofo4qiL8xzyR0PWs&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEQYhFt3PrCNpLwawINPWjBLhruFX8bYks5ve4GxgaJpZM4TkIlb&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=6DYjKZwrpvONq__spnUOuiStJWdIJOBi8rKV1nvEUlo&e=>.

…

________________________________ CONFIDENTIALITY AND SECURITY NOTICE The contents of this message and any attachments may be confidential and proprietary. If you are not an intended recipient, please inform the sender of the transmission error and delete this message immediately without reading, distributing or copying the contents.

mukundm19 · 2019-04-21T21:05:29Z

As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero.

mukundm19 · 2019-04-22T02:25:56Z

Showing some of my debugging to help anyone who might be able to take this on. Was able to find that the issue coming from the _aggregate() function.

kitschen · 2020-02-10T12:11:04Z

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=2), None,    timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
        ] 

df = pd.DataFrame(data, columns = ['date', 'duration', 'total']) 
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)



Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

------
Found a workaround, namely to use

`res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))`

instead of

`res=df.groupby(level=0).sum(min_count=1)`

edn01 · 2020-05-08T09:53:07Z

Still an issue in v1.0.3
There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero.
Also a problem for mean and std as well as sum. For example

df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
  col1  col2
0    a   NaN
1    a   2.0
2    b   NaN
3    c   3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
      sum_col2  mean_col2
col1                     
a          2.0        2.0
b          0.0        NaN
c          3.0        3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan

Therefore, I would expect df_2 to be

      sum_col2  mean_col2
col1                     
a          NaN        NaN
b          NaN        NaN
c          3.0        3.0

Same unexpected result with

df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean))

Also the min_count=1 suggestion does not solve the problem, for example

df_4 = pd.DataFrame({
    'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'), 
     'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
      col2
col1      
a      2.0
b      NaN
c      3.0
d      9.0
e      6.0

where I where expect df_5 to be

      col2
col1      
a      NaN
b      NaN
c      3.0
d      NaN
e      NaN

Also problems with std, but that seems more confusing.
Should this be split out into separate issues or it is the same underlying problem and can be kept as one?

pd.__version__
'1.0.3'
np.__version__
'1.18.4'

lucasthim · 2020-05-28T23:06:41Z

I might try a Pull Request to solve this.
But for now, I was able to by pass the NaN as 0 problem.

I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby.

Then, I assign numpy.nan to everything that resulted in numpy.inf.

There's an example I posted in this stackoverflow discussion:

https://stackoverflow.com/questions/62069979/pandas-merge-with-conditionnal-aggregation/62071652#62071652

attack68 · 2020-10-08T07:06:20Z

If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:


def np_sum(g):
    return np.sum(g.values)

df = pd.DataFrame(data={'name': ['a', 'a', 'b', 'b', 'c'], 'data': [1, np.nan, 1, 1, np.nan]})
g = df.groupby(['name'])
g.agg(col1=('data', 'sum'), col2=('data', np_sum), col3=('data', np.sum), col4=('data', np.nansum))

col1	col2	col3	col4
1	NaN	1	1
2	2	2	2
0	NaN	0	0

jorisvandenbossche · 2020-11-20T13:15:40Z

The actual improvement to add skipna support to groupby reductions is covered in #15675, so I think this can be closed or repurposed to a documentation issue, to ensure we have clear documentation about this common tricky case.

TomAugspurger added the Groupby label Apr 25, 2018

TomAugspurger added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 7, 2018

TomAugspurger modified the milestones: 0.24.0, Next Major Release Jun 7, 2018

TomAugspurger added Effort Medium labels Jun 7, 2018

This was referenced Apr 21, 2019

BUG: fallback from issue 20824 no longer occurs due to skipna flag #26176

Closed

BUG: Resolving fallback from skipna flag in groupby().sum() #26179

Closed

kasparthommen mentioned this issue Oct 4, 2019

DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mattwelborn mentioned this issue Feb 6, 2020

Dataset: fix bug revealed by pandas 1.0 MolSSI/QCFractal#561

Merged

2 tasks

mroeschke added the Docs label Jun 19, 2021

pwwang mentioned this issue Oct 4, 2021

mean() with option na_rm=False does not work pwwang/datar#65

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.groupby().sum() treating Nan as 0.0 #20824

DataFrame.groupby().sum() treating Nan as 0.0 #20824

amanhanda commented Apr 25, 2018

INSTALLED VERSIONS

TomAugspurger commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

amanhanda commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

gsganden commented Jun 6, 2018

mukundm19 commented Apr 8, 2019

amanhanda commented Apr 8, 2019 via email

mukundm19 commented Apr 21, 2019

mukundm19 commented Apr 22, 2019

kitschen commented Feb 10, 2020 •

edited

Loading

edn01 commented May 8, 2020

lucasthim commented May 28, 2020

attack68 commented Oct 8, 2020

jorisvandenbossche commented Nov 20, 2020

DataFrame.groupby().sum() treating Nan as 0.0 #20824

DataFrame.groupby().sum() treating Nan as 0.0 #20824

Comments

amanhanda commented Apr 25, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

amanhanda commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

TomAugspurger commented Apr 25, 2018

gsganden commented Jun 6, 2018

mukundm19 commented Apr 8, 2019

amanhanda commented Apr 8, 2019 via email

mukundm19 commented Apr 21, 2019

mukundm19 commented Apr 22, 2019

kitschen commented Feb 10, 2020 • edited Loading

edn01 commented May 8, 2020

lucasthim commented May 28, 2020

attack68 commented Oct 8, 2020

jorisvandenbossche commented Nov 20, 2020

Output of `pd.show_versions()`

kitschen commented Feb 10, 2020 •

edited

Loading