ENH: pivot/groupby index with nan #3729

jreback · 2013-05-31T15:07:17Z

ENH: maybe for now just provide a warning if dropping the nan rows when pivotting...

rom ml

http://stackoverflow.com/questions/16860172/python-pandas-pivot-table-silently-drops-indices-with-nans

This is effectivly trying to groupby on a NaN, currently not allowed

In [13]: a = [['a', 'b', 12, 12, 12], ['a', nan, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]

In [14]: df = DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])

In [15]: df.groupby(['a','b']).sum()
Out[15]: 
          c    d   e
a b                 
a b   13.00   13  13
b a  123.23  123   1

Workaround to fill the index with a dummy, pivot, and replace


    In [31]: df2 = df.copy()

    In [32]: df2['dummy'] = np.nan

    In [33]: df2['b'] = df2['b'].fillna('dummy')

    In [34]: df2
    Out[34]: 
       a      b       c    d   e  dummy
    0  a      b   12.00   12  12    NaN
    1  a  dummy   12.30  233  12    NaN
    2  b      a  123.23  123   1    NaN
    3  a      b    1.00    1   1    NaN

    In [35]: df2.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
    Out[35]: 
       a      b       c    d   e
    0  a      b   13.00   13  13
    1  a  dummy   12.30  233  12
    2  b      a  123.23  123   1

    In [36]: df2.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum).replace('dummy',np.nan)
    Out[36]: 
       a    b       c    d   e
    0  a    b   13.00   13  13
    1  a  NaN   12.30  233  12
    2  b    a  123.23  123   1

The text was updated successfully, but these errors were encountered:

jreback · 2013-05-31T16:08:38Z

@wesm so this brings up the issue of groupby with nan in the index. I suppose could include the nan group via a groupby option?

hayd · 2013-08-25T17:02:23Z

http://stackoverflow.com/questions/18429491/grouping-getting-rows-with-nan-values-divided-from-all-other-rows-in-pandas/18431417#18431417

hayd · 2013-08-25T18:23:02Z

adding dropna=True to groupby seems reasonable.

Behaviour currently in docs as "this is how R works", but doesn't really say why...

mattvivier · 2015-06-23T19:11:45Z

@jreback - I apologize, I'm a bit new to github, but is your last action indicating that this (adding dropna=True) should've been implemented in 0.16.0? If so, should I file another bug as I'm experiencing similar behavior in 0.16.2?

hayd · 2015-06-23T19:25:05Z

modified the milestone: 0.16.0, Next Major Release on Mar 5

Github's a bit unclear, but implementing this was moved from 0.16.0 to next major release (0.17).

mattvivier · 2015-06-23T19:41:32Z

@hayd thanks for the clarification

Zenadix · 2015-10-29T21:17:57Z

+1 for adding dropna=True to groupby

peterpanmj · 2015-12-14T03:37:22Z

@hayd I am using 0.17, but it seems there is still no dropna= True option for groupby

jreback · 2015-12-14T03:40:55Z

this is still an open issue
pull requests to implement are welcome

multiloc · 2016-01-13T10:10:05Z

+1 for optional dropna in groupby. The automatic drops can lead to strange behavior:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, np.nan], 'qux': [None, None, None]})

In [3]:
df

Out[3]:
bar foo qux
0   4   1   None
1   5   2   None
2   NaN 3   None

In [4]:
grouped = df.groupby(by=['foo', 'bar'])

In [5]:
keys = grouped.groups.keys()
keys

Out[5]:
[(3, nan), (2, 5.0), (1, 4.0)]

In [6]:
grouped.get_group(keys[0])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-8c0d3ec7c89a> in <module>()
----> 1 grouped.get_group(keys[0])

.../env/local/lib/python2.7/site-packages/pandas/core/groupby.py in get_group(self, name, obj)
    646         inds = self._get_index(name)
    647         if not len(inds):
--> 648             raise KeyError(name)
    649 
    650         return obj.take(inds, axis=self.axis, convert=False)

KeyError: (3, nan)

hmgaudecker · 2016-03-07T14:27:28Z

+1 for the dropna option. Just got bitten by this again...

stefanw · 2016-03-09T09:33:47Z

+1 for the dropna option as well, got bitten by it with some NaT values. This is most unexpected, especially with multiple column groupbys. Explicitly excluding NaN rows before grouping seems more pythonic to me, but a backwards-compatible option with dropna=False would work as well.

nbonnotte · 2016-03-13T20:00:52Z

I'm working on a solution.

Problem: there is already a dropna parameter in DataFrame.pivot_table, which is used for something slightly different:

dropna : boolean, default True
    Do not include columns whose entries are all NaN

I'll gladly accept any input. The best solution would probably to just call that option with another name, both for .groupby and .pivot_table. Is there any other similar parameter somewhere in pandas whose name could be used again?

madeleinemf · 2019-11-08T09:58:11Z

I would also really like an option that allows dropna=False to allow a group keyed by Nan values.

TomAugspurger · 2019-11-19T19:44:15Z

In #29716, @sorenwolfers noted that this affects .sum(level=...) as well

import pandas as pd
import numpy as np
df = pd.DataFrame({'ind':[np.nan,np.nan,'a','a'],'col':[0,1,2,3]}).set_index('ind')
df.sum(level='ind')

franz101 · 2019-12-05T09:40:49Z

Any update?
Getting this error especially with using groupie and transform

franz101 · 2019-12-05T09:40:52Z

Any update?
Getting this error especially with using groupie and transform

jreback · 2019-12-05T11:25:54Z

@franz101 happy to have a PR for this
there are several attempts so lots of source material

charlesdong1991 · 2019-12-29T22:12:41Z

take

json2d · 2020-04-20T22:27:01Z

seems really easy to get caught by this one

OFFICE365USER · 2020-04-28T01:49:21Z

Got bitten by this. Again. Can't believe this has been open since 2013. Consistency with R is not a reason to do something!

stevenlis · 2020-04-29T21:53:05Z

Could you guys pls consider adding a warning for the users at least?

Btw, I'm using 1.0.3 and set dropna=Fasle will make pivot_table extremely slow.

andrewdalecramer · 2020-04-29T22:21:06Z

People are aware that it's an issue. It's not a difficult issue to solve, I've done it before and it worked fine, the main impediment there is just getting the option through all the layers of indirection down to the actual work code. The reason it hasn't been solved by me is all the testing and work that goes around it, which I was unable to complete. If someone is familiar with the testing apparatus I'd be glad to talk them through my old PR (which would need to be updated as it is quite old).

On my part, I've rolled off the project using pandas a long time ago and am primarily using R now (which handles this just fine) so I don't have a strong reason to spend the days required to learn how testing works in pandas.

jreback · 2020-04-29T22:26:59Z

if you look at #30584 this issue is completed for groupby, wasn't hard actually. will be in 1.1 (soon).

but just like anything else in pandas, its all volunteer, if you want an issue fixed then pushing a PR is the best way.

OFFICE365USER · 2020-04-29T22:34:49Z

People are aware that it's an issue. It's not a difficult issue to solve, I've done it before and it worked fine, the main impediment there is just getting the option through all the layers of indirection down to the actual work code. The reason it hasn't been solved by me is all the testing and work that goes around it, which I was unable to complete. If someone is familiar with the testing apparatus I'd be glad to talk them through my old PR (which would need to be updated as it is quite old).

On my part, I've rolled off the project using pandas a long time ago and am primarily using R now (which handles this just fine) so I don't have a strong reason to spend the days required to learn how testing works in pandas.

People are not aware of this issue. An interim solution would be groupby() to throw an error for an unsupported case with a flag to allow the broken exclude-nan "feature".

andrewdalecramer · 2020-04-30T06:09:31Z

Sorry, contributers are aware it's an issue. You're welcome to become one and submit a PR.

zxymath · 2020-08-24T16:42:37Z

There are some problem when groupby category column

categories = ['x%s' % str(i) for i in range(2)]
d = pd.Categorical(np.random.choice(categories, size=100), categories=categories)
d[80:] = np.nan
data = pd.DataFrame(np.random.rand(len(d), 1))
print(data.groupby(d, dropna=False).count())

Out:
     0
x0  35
x1  45

LaveryRM · 2021-07-11T14:31:08Z

As a business analyst I was trained to apply two QC rules:

never let the number of rows, or the var of interest total, change w/o knowing why.
Check every path through nested IFs and through conditional transforms.

This was basic Business Analyst /data scientist QC and should happens on EVERY merge/filter step in a in a project. Because of the need to do this frequently, I suggest EVERY summation should default to preserving the count and the total.

An option to remove NaNs in an index is appropriate but not as the default.

colomb8 · 2022-02-14T11:23:48Z

why is this issue closed? the problem persist on pandas 1.4.0.

soerenwolfers · 2022-12-07T12:04:30Z

On 1.5.2 (and possibly lower) you can pass dropna=False, and it will work with both regular indices and multi-indices.

jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014

jreback added the Missing-data label Mar 11, 2014

This was referenced Apr 28, 2014

BUG: Groupby NaT Handling #6992

Closed

BUG: GroupBy.get_group raises ValueError when group key contains NaT #6996

Merged

behzadnouri mentioned this issue Aug 30, 2014

BUG: reindexing a frame fails to pick nan value #8108

Merged

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback modified the milestones: 0.18.1, Next Major Release Mar 9, 2016

nbonnotte mentioned this issue Mar 13, 2016

WIP: ENH: pivot/groupby index with nan #12607

Closed

36 tasks

jorisvandenbossche mentioned this issue Apr 11, 2016

MetaIssue: Pandas tries too hard to be smart #12865

Closed

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

rjzamora mentioned this issue Nov 11, 2019

[REVIEW] Add dropna argument to groupby dask/dask#5579

Merged

2 tasks

TomAugspurger mentioned this issue Nov 18, 2019

Groupby excludes groups with nan entry in group-defining columns #29691

Closed

WillAyd mentioned this issue Nov 18, 2019

Add NA handling option to GroupBy #443

Closed

TomAugspurger mentioned this issue Nov 19, 2019

sum excludes rows where multi-index has nan #29716

Closed

TomAugspurger mentioned this issue Dec 20, 2019

Pivot Table Margins aren't correctly showing NaN values despite dummy variables #30351

Open

github-actions bot assigned charlesdong1991 Dec 29, 2019

charlesdong1991 mentioned this issue Dec 31, 2019

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged

5 tasks

jsignell mentioned this issue Feb 10, 2020

Feature: implement dropna for Series.value_counts() dask/dask#5306

Closed

andersy005 mentioned this issue Apr 6, 2020

Problems combining members with different runtime intake/intake-esm#225

Closed

jreback modified the milestones: Contributions Welcome, 1.1 May 6, 2020

jreback closed this as completed in #30584 May 9, 2020

AgnesBaud mentioned this issue Oct 6, 2020

BUG: groupby dropna for multiindex dataframe #36914

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pivot/groupby index with nan #3729

ENH: pivot/groupby index with nan #3729

jreback commented May 31, 2013

jreback commented May 31, 2013

hayd commented Aug 25, 2013

hayd commented Aug 25, 2013

mattvivier commented Jun 23, 2015

hayd commented Jun 23, 2015

mattvivier commented Jun 23, 2015

Zenadix commented Oct 29, 2015

peterpanmj commented Dec 14, 2015

jreback commented Dec 14, 2015

multiloc commented Jan 13, 2016

hmgaudecker commented Mar 7, 2016

stefanw commented Mar 9, 2016

nbonnotte commented Mar 13, 2016

madeleinemf commented Nov 8, 2019

TomAugspurger commented Nov 19, 2019

franz101 commented Dec 5, 2019

franz101 commented Dec 5, 2019

jreback commented Dec 5, 2019

charlesdong1991 commented Dec 29, 2019

json2d commented Apr 20, 2020

OFFICE365USER commented Apr 28, 2020

stevenlis commented Apr 29, 2020 •

edited

Loading

andrewdalecramer commented Apr 29, 2020

jreback commented Apr 29, 2020

OFFICE365USER commented Apr 29, 2020

andrewdalecramer commented Apr 30, 2020

zxymath commented Aug 24, 2020

LaveryRM commented Jul 11, 2021

colomb8 commented Feb 14, 2022

soerenwolfers commented Dec 7, 2022 •

edited

Loading

ENH: pivot/groupby index with nan #3729

ENH: pivot/groupby index with nan #3729

Comments

jreback commented May 31, 2013

jreback commented May 31, 2013

hayd commented Aug 25, 2013

hayd commented Aug 25, 2013

mattvivier commented Jun 23, 2015

hayd commented Jun 23, 2015

mattvivier commented Jun 23, 2015

Zenadix commented Oct 29, 2015

peterpanmj commented Dec 14, 2015

jreback commented Dec 14, 2015

multiloc commented Jan 13, 2016

hmgaudecker commented Mar 7, 2016

stefanw commented Mar 9, 2016

nbonnotte commented Mar 13, 2016

madeleinemf commented Nov 8, 2019

TomAugspurger commented Nov 19, 2019

franz101 commented Dec 5, 2019

franz101 commented Dec 5, 2019

jreback commented Dec 5, 2019

charlesdong1991 commented Dec 29, 2019

json2d commented Apr 20, 2020

OFFICE365USER commented Apr 28, 2020

stevenlis commented Apr 29, 2020 • edited Loading

andrewdalecramer commented Apr 29, 2020

jreback commented Apr 29, 2020

OFFICE365USER commented Apr 29, 2020

andrewdalecramer commented Apr 30, 2020

zxymath commented Aug 24, 2020

LaveryRM commented Jul 11, 2021

colomb8 commented Feb 14, 2022

soerenwolfers commented Dec 7, 2022 • edited Loading

stevenlis commented Apr 29, 2020 •

edited

Loading

soerenwolfers commented Dec 7, 2022 •

edited

Loading