Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pivot/groupby index with nan #3729

Closed
jreback opened this issue May 31, 2013 · 54 comments · Fixed by #30584
Closed

ENH: pivot/groupby index with nan #3729

jreback opened this issue May 31, 2013 · 54 comments · Fixed by #30584
Assignees
Labels
Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented May 31, 2013

ENH: maybe for now just provide a warning if dropping the nan rows when pivotting...

rom ml

http://stackoverflow.com/questions/16860172/python-pandas-pivot-table-silently-drops-indices-with-nans

This is effectivly trying to groupby on a NaN, currently not allowed

In [13]: a = [['a', 'b', 12, 12, 12], ['a', nan, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]

In [14]: df = DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])

In [15]: df.groupby(['a','b']).sum()
Out[15]: 
          c    d   e
a b                 
a b   13.00   13  13
b a  123.23  123   1

Workaround to fill the index with a dummy, pivot, and replace


    In [31]: df2 = df.copy()

    In [32]: df2['dummy'] = np.nan

    In [33]: df2['b'] = df2['b'].fillna('dummy')

    In [34]: df2
    Out[34]: 
       a      b       c    d   e  dummy
    0  a      b   12.00   12  12    NaN
    1  a  dummy   12.30  233  12    NaN
    2  b      a  123.23  123   1    NaN
    3  a      b    1.00    1   1    NaN

    In [35]: df2.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
    Out[35]: 
       a      b       c    d   e
    0  a      b   13.00   13  13
    1  a  dummy   12.30  233  12
    2  b      a  123.23  123   1

    In [36]: df2.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum).replace('dummy',np.nan)
    Out[36]: 
       a    b       c    d   e
    0  a    b   13.00   13  13
    1  a  NaN   12.30  233  12
    2  b    a  123.23  123   1

@jreback
Copy link
Contributor Author

jreback commented May 31, 2013

@wesm so this brings up the issue of groupby with nan in the index. I suppose could include the nan group via a groupby option?

@hayd
Copy link
Contributor

hayd commented Aug 25, 2013

adding dropna=True to groupby seems reasonable.

Behaviour currently in docs as "this is how R works", but doesn't really say why...

@mattvivier
Copy link

@jreback - I apologize, I'm a bit new to github, but is your last action indicating that this (adding dropna=True) should've been implemented in 0.16.0? If so, should I file another bug as I'm experiencing similar behavior in 0.16.2?

@hayd
Copy link
Contributor

hayd commented Jun 23, 2015

modified the milestone: 0.16.0, Next Major Release on Mar 5

Github's a bit unclear, but implementing this was moved from 0.16.0 to next major release (0.17).

@mattvivier
Copy link

@hayd thanks for the clarification

@Zenadix
Copy link

Zenadix commented Oct 29, 2015

+1 for adding dropna=True to groupby

@peterpanmj
Copy link
Contributor

@hayd I am using 0.17, but it seems there is still no dropna= True option for groupby

@jreback
Copy link
Contributor Author

jreback commented Dec 14, 2015

this is still an open issue
pull requests to implement are welcome

@multiloc
Copy link
Contributor

+1 for optional dropna in groupby. The automatic drops can lead to strange behavior:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': [4, 5, np.nan], 'qux': [None, None, None]})

In [3]:
df

Out[3]:
bar foo qux
0   4   1   None
1   5   2   None
2   NaN 3   None

In [4]:
grouped = df.groupby(by=['foo', 'bar'])

In [5]:
keys = grouped.groups.keys()
keys

Out[5]:
[(3, nan), (2, 5.0), (1, 4.0)]

In [6]:
grouped.get_group(keys[0])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-8c0d3ec7c89a> in <module>()
----> 1 grouped.get_group(keys[0])

.../env/local/lib/python2.7/site-packages/pandas/core/groupby.py in get_group(self, name, obj)
    646         inds = self._get_index(name)
    647         if not len(inds):
--> 648             raise KeyError(name)
    649 
    650         return obj.take(inds, axis=self.axis, convert=False)

KeyError: (3, nan)

@hmgaudecker
Copy link

+1 for the dropna option. Just got bitten by this again...

@stefanw
Copy link

stefanw commented Mar 9, 2016

+1 for the dropna option as well, got bitten by it with some NaT values. This is most unexpected, especially with multiple column groupbys. Explicitly excluding NaN rows before grouping seems more pythonic to me, but a backwards-compatible option with dropna=False would work as well.

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 9, 2016
@nbonnotte
Copy link
Contributor

I'm working on a solution.

Problem: there is already a dropna parameter in DataFrame.pivot_table, which is used for something slightly different:

dropna : boolean, default True
    Do not include columns whose entries are all NaN

I'll gladly accept any input. The best solution would probably to just call that option with another name, both for .groupby and .pivot_table. Is there any other similar parameter somewhere in pandas whose name could be used again?

@madeleinemf
Copy link

I would also really like an option that allows dropna=False to allow a group keyed by Nan values.

@TomAugspurger
Copy link
Contributor

In #29716, @sorenwolfers noted that this affects .sum(level=...) as well

import pandas as pd
import numpy as np
df = pd.DataFrame({'ind':[np.nan,np.nan,'a','a'],'col':[0,1,2,3]}).set_index('ind')
df.sum(level='ind')

@franz101
Copy link

franz101 commented Dec 5, 2019

Any update?
Getting this error especially with using groupie and transform

1 similar comment
@franz101
Copy link

franz101 commented Dec 5, 2019

Any update?
Getting this error especially with using groupie and transform

@jreback
Copy link
Contributor Author

jreback commented Dec 5, 2019

@franz101 happy to have a PR for this
there are several attempts so lots of source material

@charlesdong1991
Copy link
Member

take

@json2d
Copy link

json2d commented Apr 20, 2020

seems really easy to get caught by this one

@OFFICE365USER
Copy link

Got bitten by this. Again. Can't believe this has been open since 2013. Consistency with R is not a reason to do something!

@stevenlis
Copy link

stevenlis commented Apr 29, 2020

Could you guys pls consider adding a warning for the users at least?

Btw, I'm using 1.0.3 and set dropna=Fasle will make pivot_table extremely slow.

@andrewdalecramer
Copy link

People are aware that it's an issue. It's not a difficult issue to solve, I've done it before and it worked fine, the main impediment there is just getting the option through all the layers of indirection down to the actual work code. The reason it hasn't been solved by me is all the testing and work that goes around it, which I was unable to complete. If someone is familiar with the testing apparatus I'd be glad to talk them through my old PR (which would need to be updated as it is quite old).

On my part, I've rolled off the project using pandas a long time ago and am primarily using R now (which handles this just fine) so I don't have a strong reason to spend the days required to learn how testing works in pandas.

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2020

if you look at #30584 this issue is completed for groupby, wasn't hard actually. will be in 1.1 (soon).

but just like anything else in pandas, its all volunteer, if you want an issue fixed then pushing a PR is the best way.

@OFFICE365USER
Copy link

People are aware that it's an issue. It's not a difficult issue to solve, I've done it before and it worked fine, the main impediment there is just getting the option through all the layers of indirection down to the actual work code. The reason it hasn't been solved by me is all the testing and work that goes around it, which I was unable to complete. If someone is familiar with the testing apparatus I'd be glad to talk them through my old PR (which would need to be updated as it is quite old).

On my part, I've rolled off the project using pandas a long time ago and am primarily using R now (which handles this just fine) so I don't have a strong reason to spend the days required to learn how testing works in pandas.

People are not aware of this issue. An interim solution would be groupby() to throw an error for an unsupported case with a flag to allow the broken exclude-nan "feature".

@andrewdalecramer
Copy link

Sorry, contributers are aware it's an issue. You're welcome to become one and submit a PR.

@jreback jreback modified the milestones: Contributions Welcome, 1.1 May 6, 2020
@zxymath
Copy link

zxymath commented Aug 24, 2020

There are some problem when groupby category column

categories = ['x%s' % str(i) for i in range(2)]
d = pd.Categorical(np.random.choice(categories, size=100), categories=categories)
d[80:] = np.nan
data = pd.DataFrame(np.random.rand(len(d), 1))
print(data.groupby(d, dropna=False).count())
Out:
     0
x0  35
x1  45

@LaveryRM
Copy link

As a business analyst I was trained to apply two QC rules:

never let the number of rows, or the var of interest total, change w/o knowing why.
Check every path through nested IFs and through conditional transforms.

This was basic Business Analyst /data scientist QC and should happens on EVERY merge/filter step in a in a project. Because of the need to do this frequently, I suggest EVERY summation should default to preserving the count and the total.

An option to remove NaNs in an index is appropriate but not as the default.

@colomb8
Copy link

colomb8 commented Feb 14, 2022

why is this issue closed? the problem persist on pandas 1.4.0.

@soerenwolfers
Copy link

soerenwolfers commented Dec 7, 2022

On 1.5.2 (and possibly lower) you can pass dropna=False, and it will work with both regular indices and multi-indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet