Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumsum sums the groupby column #5614

Closed
hayd opened this issue Nov 28, 2013 · 16 comments · Fixed by #7019
Closed

cumsum sums the groupby column #5614

hayd opened this issue Nov 28, 2013 · 16 comments · Fixed by #7019
Assignees
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Nov 28, 2013

It shouldn't sum the groupby'd col (in fact index col should be the index, if groupby as_index).

In [13]: df = pd.DataFrame([[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9]], columns=['A', 'B', 'C'])

In [14]: g = df.groupby('A')

In [16]: g.cumsum()
Out[16]: 
   A   B   C
0  1   2 NaN
1  2 NaN   9
2  3   4   9

[3 rows x 3 columns]

Nature of it being dispatch. Should fix up for 0.14 possibly along with some other whitelisted groupby functions.

@jorisvandenbossche
Copy link
Member

What would be the expected output? Something like this?:

In [29]: g.cumsum
Out[29]:
       B   C
  A
0 1   2 NaN
1   NaN   9
2 3   4   9

And should it then also be the case for cumcount if as_index=True?

@hayd
Copy link
Contributor Author

hayd commented Nov 29, 2013

@jorisvandenbossche RE the index of cumcount, possibly yes it should respect as_index... I think it's debatable if this would ever be desired though... the main problem however is it's slow (I don't think efficient way to append index to index to make MI) and this is the default. I had thought I had posted about this somewhere but can't find issue...

I think so, though like I say I think we need to have a discussion about as_index (there are at least three different ways used in groupby atm)... I had a partially filled in issue about it from a week or so ago... :s will look at it again after the weekend and try to post it. It's kinda a mess and some conventions are of dubious value (e.g. that of head)

@jorisvandenbossche
Copy link
Member

Yes, you did :-) Here: #4646 (comment)
And it is indeed, for cumcount, maybe in some way more consistent to also return a MI, but I also think you mostly wouldn't want it.

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

I think should add some UserWarnings in 0.14 about this kind of behaviour, link to #5755.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@hayd you have anything in the works about this? push to 0.15 otherwise

@hayd
Copy link
Contributor Author

hayd commented Apr 6, 2014

I think I do, hope to get in the week.

@jreback
Copy link
Contributor

jreback commented May 1, 2014

ping!

@jreback
Copy link
Contributor

jreback commented May 1, 2014

I think this is closed by #7000, maybe just add a test?

@hayd
Copy link
Contributor Author

hayd commented May 1, 2014

Weirdly with the above example we don't have A as the index!

In [4]: df = pd.DataFrame([[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9]], columns=['A', 'B', 'C'])

In [5]: g = df.groupby('A')

In [6]: g.cumsum()  # should have A as index
Out[6]:
    B   C
0   2 NaN
1 NaN   9
2   4   9

In [7]: g = df.groupby('A', as_index=False)  # this is correct

In [8]: g.cumsum()
Out[8]:
   A   B   C
0  1   2 NaN
1  2 NaN   9
2  3   4   9

@hayd
Copy link
Contributor Author

hayd commented May 1, 2014

Ah wait, this is a feature! Coool!

@jreback
Copy link
Contributor

jreback commented May 1, 2014

hmm...the index should have a named index (as A)...let me fix

@hayd
Copy link
Contributor Author

hayd commented May 1, 2014

@jreback I'm not so sure, what are you changing? I think this is good as is!

@jreback
Copy link
Contributor

jreback commented May 1, 2014

I think this should be this (happens to be the same as sum in this case)

DataFrame([[2, 9], [4, 9]], columns=['B', 'C'], index=Index([1,3],name='A'))
   B  C
A      
1  2  9
3  4  9

[2 rows x 2 columns]

@jreback
Copy link
Contributor

jreback commented May 1, 2014

Here's a more realistic example

db) df = DataFrame([[1, 2, np.nan], [1, np.nan, 9], [1, 1, 2 ], [3, 4, 9]], columns=['A', 'B', 'C'])
(Pdb) df
   A   B   C
0  1   2 NaN
1  1 NaN   9
2  1   1   2
3  3   4   9

[4 rows x 3 columns]

(Pdb) results = concat([df.iloc[0:3].cumsum(),df.iloc[3:4].cumsum()])
(Pdb) p results
   A   B   C
0  1   2 NaN
1  2 NaN   9
2  3   3  11
3  3   4   9

[4 rows x 3 columns]

(Pdb) results.index = MultiIndex.from_tuples([(1,0),(1,1),(1,2),(3,0)],names=['A',None])
(Pdb) results
      B   C
A             
1 0    2 NaN
  1  NaN   9
  2    3  11
3 0    4   9

[4 rows x 3 columns]

Here's the current result

(Pdb) df.groupby('A').cumsum()
    B   C
0   2 NaN
1 NaN   9
2   3  11
3   4   9

[4 rows x 2 columns]

@jorisvandenbossche
Copy link
Member

Following our rules, cumsum is not a reducer/aggregator, so it should ignore as_index?

And I would say it is a transformer, and then it is 'correct' to drop the grouper column. At least this is also what transform does:

In [10]: df.groupby('A', as_index=False).transform(lambda x: x.cumsum())
Out[10]:
    B   C
0   2 NaN
1 NaN   9
2   3  11
3   4   9

@jreback
Copy link
Contributor

jreback commented May 1, 2014

@jorisvandenbossche you are right....ok..marked it as some additional tests needed in any event (simply to validate this expectation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants