Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.groupby().sum() treating Nan as 0.0 #20824

Open
amanhanda opened this issue Apr 25, 2018 · 15 comments
Open

DataFrame.groupby().sum() treating Nan as 0.0 #20824

amanhanda opened this issue Apr 25, 2018 · 15 comments
Labels

Comments

@amanhanda
Copy link

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0

Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

@TomAugspurger
Copy link
Contributor

This is a bit surprising

In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

Something strange w/ the skipna keyword there.

@amanhanda
Copy link
Author

Thanks! Did not think of removing skipna=False. skipna behavior should be consistent.

@TomAugspurger
Copy link
Contributor

I think there are two intertwined issues

  1. DataFrameGroupby.sum doesn't accept skipna
  2. DataFrameGroupby.sum doesn't validate its kwargs, and falls back to a secondary method
In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

so passing skipna forces the fallback, which apparently ignores the kwargs.

@TomAugspurger
Copy link
Contributor

#15675 for the skipna part.

@gsganden
Copy link
Contributor

gsganden commented Jun 6, 2018

String values trigger the fallback too:

In [16]: pd.DataFrame([[1, np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[16]:
     0   1
0  1.0 NaN

In [17]: pd.DataFrame([['a', np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[17]:
   0    1
0  a  0.0

I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):

try:
    return self._cython_agg_general(
        alias, alt=npfunc, **kwargs)
    except AssertionError as e:
        raise SpecificationError(str(e))
    except Exception:
        result = self.aggregate(
            lambda x: npfunc(x, axis=self.axis))

That except Exception is dangerous.

@TomAugspurger TomAugspurger added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 7, 2018
@TomAugspurger TomAugspurger modified the milestones: 0.24.0, Next Major Release Jun 7, 2018
@mukundm19
Copy link

Does this issue still need to be resolved. If so I'd like to look into this.

@amanhanda
Copy link
Author

amanhanda commented Apr 8, 2019 via email

@mukundm19
Copy link

As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero.

@mukundm19
Copy link

Screen Shot 2019-04-21 at 2 14 51 PM

Showing some of my debugging to help anyone who might be able to take this on. Was able to find that the issue coming from the _aggregate() function.

@kitschen
Copy link

kitschen commented Feb 10, 2020

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=2), None,    timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
        ] 

df = pd.DataFrame(data, columns = ['date', 'duration', 'total']) 
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)



Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

------
Found a workaround, namely to use

`res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))`

instead of

`res=df.groupby(level=0).sum(min_count=1)`

@edn01
Copy link

edn01 commented May 8, 2020

Still an issue in v1.0.3
There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero.
Also a problem for mean and std as well as sum. For example

df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
  col1  col2
0    a   NaN
1    a   2.0
2    b   NaN
3    c   3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
      sum_col2  mean_col2
col1                     
a          2.0        2.0
b          0.0        NaN
c          3.0        3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan

Therefore, I would expect df_2 to be

      sum_col2  mean_col2
col1                     
a          NaN        NaN
b          NaN        NaN
c          3.0        3.0

Same unexpected result with

df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean))

Also the min_count=1 suggestion does not solve the problem, for example

df_4 = pd.DataFrame({
    'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'), 
     'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
      col2
col1      
a      2.0
b      NaN
c      3.0
d      9.0
e      6.0

where I where expect df_5 to be

      col2
col1      
a      NaN
b      NaN
c      3.0
d      NaN
e      NaN

Also problems with std, but that seems more confusing.
Should this be split out into separate issues or it is the same underlying problem and can be kept as one?

pd.__version__
'1.0.3'
np.__version__
'1.18.4'

@lucasthim
Copy link

I might try a Pull Request to solve this.
But for now, I was able to by pass the NaN as 0 problem.

I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby.

Then, I assign numpy.nan to everything that resulted in numpy.inf.

There's an example I posted in this stackoverflow discussion:

https://stackoverflow.com/questions/62069979/pandas-merge-with-conditionnal-aggregation/62071652#62071652

@attack68
Copy link
Contributor

attack68 commented Oct 8, 2020

If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:


def np_sum(g):
    return np.sum(g.values)

df = pd.DataFrame(data={'name': ['a', 'a', 'b', 'b', 'c'], 'data': [1, np.nan, 1, 1, np.nan]})
g = df.groupby(['name'])
g.agg(col1=('data', 'sum'), col2=('data', np_sum), col3=('data', np.sum), col4=('data', np.nansum))
col1 col2 col3 col4
1 NaN 1 1
2 2 2 2
0 NaN 0 0

@jorisvandenbossche
Copy link
Member

The actual improvement to add skipna support to groupby reductions is covered in #15675, so I think this can be closed or repurposed to a documentation issue, to ensure we have clear documentation about this common tricky case.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet