PERF: improves performance in GroupBy.cumcount #11039

behzadnouri · 2015-09-09T23:38:43Z

    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------
    groupby_ngroups_10000_cumcount               |   3.9790 |  74.9559 |   0.0531 |
    groupby_ngroups_100_cumcount                 |   0.6043 |   0.9940 |   0.6079 |
    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------

    Ratio < 1.0 means the target commit is faster then the baseline.
    Seed used: 1234

    Target [2a1c935] : PERF: improves performance in GroupBy.cumcount
    Base   [5b1f3b6] : reverts 'from .pandas_vb_common import *'

jreback · 2015-09-10T00:09:40Z

this would prob close #7569
and partial on #5755

pls add tests and/or advise.

ty

jreback · 2015-09-10T01:50:15Z

pandas/core/groupby.py

-            indices.append(v)
+        run = np.r_[True, ids[:-1] != ids[1:]]
+        rep = np.diff(np.r_[np.nonzero(run)[0], count])
+        out = (~run).cumsum()


actually np.intp is ok. its this:

In [1]: np.array([True,False,True]).cumsum() Out[1]: array([1, 1, 2]) In [2]: np.array([True,False,True]).cumsum().dtype Out[2]: dtype('int32')

must be some weird casting on windows. providing an accumulator dtype seems to work.

In [3]: np.array([True,False,True]).cumsum(dtype='int64').dtype Out[3]: dtype('int64')

behzadnouri · 2015-09-10T12:00:38Z

well i am not adding a new feature or closing a bug; so the tests already there should be fine

jreback · 2015-09-10T12:05:08Z

pandas/tests/test_groupby.py

-        assert_series_equal(g.B.nth(0), df.B.iloc[[0, 2]])
-        assert_series_equal(g.B.nth(1), df.B.iloc[[1]])
+        assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A'))
+        assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]])


you need to document this. It was incorrect before (that the index was not set correctly), so its a bug-fix. Pls show an example in the whatsnew.

This is even an API change I would say (or at least mention it there). nth on a SeriesGroupBy has always been like that. The method on DataframeGroupby was made a reducing method instead of a filtering method on purpose in 0.14 (#7044), but I think we forgot SeriesGroupBy then?

Also the docstring of nth is still incorrect I see (but for frame, so that is a separate issue)

@jreback i am not sure what u want me to put in what's new; such things would be much more efficient if u take care of it urself. even if i do it, it may not be what u had in mind, and then we have to keep going back and forth.

the other thing is that this whole index and dropna is broken; in addition to #11038 there is this:

>>> df 1st 2nd 0 a 0 1 a 1 2 a 2 3 b NaN 4 b 4 >>> df.groupby('1st')['2nd'].nth(0, dropna=False) 0 0 3 NaN Name: 2nd, dtype: float64 >>> df.groupby('1st')['2nd'].nth(0, dropna=True) 1st a NaN b NaN Name: 2nd, dtype: float64

so not sure if there is a point in documenting it when it is not working.

hayd · 2015-09-10T20:56:24Z

I think this crept in via #7910 / @mortada adding support for passing lists to nth.

jreback · 2015-09-11T14:35:19Z

@behzadnouri can you do a whatsnew note

jorisvandenbossche · 2015-09-11T14:47:34Z

Are we sure we want to change this? This has always been like that I think? (but indeed, inconsistent between SeriesGroupBy and DataFrameGroupBy ..)

@hayd this behaviour is already in 0.14.1 (while the list enhancement was only added in 0.15.0)

jreback · 2015-09-11T14:58:54Z

I think this should be changed as @behzadnouri suggest. Code is MUCH simpler. and everything would be consistent. I think what nth does now is a little odd (and IIRC is differently if you .apply it rather than directly use .nth)

jreback · 2015-09-12T22:13:41Z

This is really a bug-fix, but since is a fundamental change (e.g. a position based index vs the original index) being returned it ought to be highlited to the user. This just needs a simple Previous Behavior / New Behavior section with a code-block from the previous and actual from the new.

In [1]: df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: g = df.groupby('A')

v0.16.2

In [6]: g.B.nth(0)
Out[6]: 
0   NaN
2     6
Name: B, dtype: float64

In [7]: g.B.nth(1)
Out[7]: 
1    4
Name: B, dtype: float64

This PR

In [4]: g.B.nth(0)
Out[4]: 
A
1   NaN
5     6
Name: B, dtype: float64

In [5]:  g.B.nth(1)
Out[5]: 
A
1    4
Name: B, dtype: float64

hayd · 2015-09-12T22:49:57Z

nth was supposed to be of the "filtering" type rather than aggregating type. Hence the v0.16.2 behaviour.

jreback · 2015-09-12T22:52:34Z

@hayd right! I forgot about that. so what should we do with all of this, the prior code had quite a number of 'hacks' to support that exact behavior.

jreback · 2015-09-13T17:58:23Z

moving to next version. This needs to not change the API. The result index can be set differently at the end.

behzadnouri · 2015-09-13T18:48:11Z

api is the function signature which does not change.

if you mean the behaviour, it is inconsistent between series and data-frame, and even within series if there are nulls; and seems to be a bug, not by design. so at some point you need to make series behave like frames or the other way around.

>>> df
   A  B
0  a  0
1  b  1
>>> df.groupby('A').nth(0)
   B
A   
a  0
b  1
>>> df.groupby('A')['B'].nth(0)
0    0
1    1
Name: B, dtype: int64

>>> ts
0     0
1   NaN
dtype: float64
>>> ts.groupby(Series(['a', 'b'])).nth(0, dropna=True)
a   NaN
b   NaN
dtype: float64

hayd · 2015-09-13T18:55:05Z

Series should behave like DataFrame (and filter rather than agg). I was sure these worked the same on Series or it did behind a flag (which was not able to be the default due to old behaviour).

jorisvandenbossche · 2015-09-13T19:06:57Z

@hayd I think we are not using the same terminology :-)

Currently, nth DataFrame behaviour is aggregating/reducing (so setting the grouping keys as the index), while Series behaviour is filtering (keeping the original index labels) (so the other way around than I understand from your comment)

Originally nth on a DataFrame was filtering when you added it (#6569), but it was changed to reducing even before it was included in a release by @jreback (#7044). Back then I asked specifically about this (#7044 (comment)), but @jreback gave some reasons to make it reducing. Only, in that PR, I think it was an oversight of us that it only made DataFrame reducing, and left Series behaviour as filtering.
Also in the docs, nth is mentioned in the reducing methods explicitely: the 'note box at the end of this section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

By the introduction of the possibility to pass lists to get multiple values, it complicated this a bit, as a real aggregating method will only return one value per group. Further, with the introduction of dropna kwarg, you cannot really call it filtering anymore when using that, since it will not return original rows always.

jreback · 2015-09-13T19:15:43Z

IIRC (and has been a while), we want to make nth(0,dropna=True) == .first() and nth(0,dropna=True) == .last()

jorisvandenbossche · 2015-09-13T19:18:07Z

@jreback and that is true for a DataFrame, but not Series (so this PR makes that analogy more consistent)

jorisvandenbossche · 2015-09-13T19:18:56Z

Now, for a solution. I think we just have to make a choice ..

Leave it as is (DataFrame as reducing, Series as filtering, it has been like this since 0.14)
Have all as reducing (so change Series)
Have all as filtering (and change DataFrame behaviour)

If we change something, I would go for 2), as the filtering behaviour is easy to get by using as_index=False (while for the other way around you have to do set_index()) + then it is more consistent with first/last

jreback · 2015-09-13T19:25:47Z

ahh, ok this was a consistency-fix partially then. I would agree with 2).

jorisvandenbossche · 2015-09-13T19:27:52Z

Some other inconsistencies:

as_index=False is not working when using dropna:

In [38]: df = pd.DataFrame({'a':list('ABABAB'), 'b':[1,2,3,4,5,6]})

In [39]: df2 = df.copy()

In [40]: df2.iloc[1,1] = np.nan

In [41]: df2.groupby('a').nth(0, dropna='any')
Out[41]:
b
a
A  1
B  4

In [42]: df2.groupby('a', as_index=False).nth(0, dropna='any')
Out[42]:
b
a
A  1
B  4

In [43]: df2.groupby('a', as_index=False).nth(0)
Out[43]:
a   b
0  A   1
1  B NaN

the resulting index when using as_index=False is different between nth and first:

In [45]: df.index = df.index + 10

In [46]: df.groupby('a').nth(0)
Out[46]:
b
a
A  1
B  2

In [47]: df.groupby('a').first()
Out[47]:
b
a
A  1
B  2

In [48]: df.groupby('a', as_index=False).first()
Out[48]:
a  b
0  A  1
1  B  2

In [50]: df.groupby('a', as_index=False).nth(0)
Out[50]:
a  b
10  A  1
11  B  2

jreback · 2015-09-13T19:29:47Z

these are already noted in #5755 (though not so explicity)

jorisvandenbossche · 2015-09-13T19:31:40Z

But here it is possibly on purpose, as for nth this is a way to get filtering behaviour (keep original index), while reducing methods give you just a default 0, 1, .., n index when using as_index=False as you never can have the original index. That is the problem with nth being some mixture of a reducing and filtering method.

jreback · 2016-03-12T17:50:00Z

can you rebase/update

behzadnouri · 2016-03-13T15:53:24Z

@jreback rebased

jreback · 2016-03-13T17:07:18Z

@behzadnouri ok thanks. Can you put a whatsnew sub-section that shows the changes as a user would care about them, IOW an example (you can take from tests) showing what used to happend and the (correct) new way.

behzadnouri · 2016-04-12T09:57:14Z

@jreback added an example to whatsnew showing the changes

jreback · 2016-04-12T13:00:27Z

doc/source/whatsnew/v0.18.1.txt

+
+New Behavior:
+
+.. code-block:: ipython


use an ipython block

use ipython blocks in the new

@jreback from what i see from similar cases, this is already in correct form; v0.18.1.txt#L159 as an example

yes for the Previous Behvaior, not the NEW. pls make the change

behzadnouri · 2016-04-15T06:38:06Z

@jreback moved to api changes section

jreback · 2016-04-17T15:37:01Z

pandas/core/groupby.py

        """
-        arr is where cumcount gets its values from


can you add a Parameters section

behzadnouri · 2016-04-18T18:53:50Z

@jreback made the changes

jreback · 2016-04-18T18:58:04Z

doc/source/whatsnew/v0.18.1.txt

+    Out[5]:
+    0    1
+    1    2
+    Name: B, dtype: int64


show the same as_index=False here as well (as u r showing below)

the point here is that as_index=True is ignored in old behaviour; as_index=False is not relevant or informative (and has not changed).

this has been going back and forth for too many times. if you like to add/modify anything please do so on your end.

If you are showing it in the new, then pls show it in the original.

this has been going back and forth for too many times.

Well, that's just how it is. The docs have to be in the proper format and be consistent.

@jreback please go ahead and make any changes you find necessary

@behzadnouri of course, but that's not the point is it. ok thank you for the PR.

jreback · 2016-04-18T19:28:11Z

@jorisvandenbossche @TomAugspurger comments?

TomAugspurger · 2016-04-19T12:32:20Z

Looks good on a quick skim.

@behzadnouri you seem to also fixed a bug where groupby.nth was ignoring the sort keyword

# master
In [5]: df.groupby('c', sort=True).nth(1)
Out[5]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

In [6]: df.groupby('c', sort=False).nth(1)
Out[6]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

vs.

# your branch
In [1]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])

In [2]: df['c'] = np.random.randint(0, 4, 100)

In [3]: df.groupby('c', sort=True).nth(1)
Out[3]:
          a         b
c
0 -1.168300 -2.224763
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404

In [4]: df.groupby('c', sort=False).nth(1)
Out[4]:
          a         b
c
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404
0 -1.168300 -2.224763

Would be nice to have a release note for that to (could maybe do on merge?).

jreback · 2016-04-25T14:22:20Z

thanks @behzadnouri

@TomAugspurger your suggestions incorporated as well!

jreback added Groupby Performance Memory or execution speed performance labels Sep 10, 2015

jreback added this to the 0.17.0 milestone Sep 10, 2015

jreback reviewed Sep 10, 2015
View reviewed changes

jreback added the Bug label Sep 12, 2015

jreback modified the milestones: 0.17.1, 0.17.0 Sep 13, 2015

jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015

behzadnouri force-pushed the grby-cumcount branch 4 times, most recently from e69ded7 to b0c973c Compare March 13, 2016 14:16

jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016

jreback mentioned this pull request Apr 9, 2016

GroupBy.nth includes group key inconsistently #12839

Closed

sinhrks mentioned this pull request Apr 10, 2016

BUG: Unable to aggregate TimeGrouper #7453

Closed

behzadnouri force-pushed the grby-cumcount branch 2 times, most recently from 29c7fb8 to 6115d4c Compare April 12, 2016 01:37

jreback reviewed Apr 12, 2016
View reviewed changes

behzadnouri force-pushed the grby-cumcount branch from 6115d4c to 3b08468 Compare April 15, 2016 06:37

jreback reviewed Apr 17, 2016
View reviewed changes

pandas/core/groupby.py

"""

arr is where cumcount gets its values from

Copy link

Contributor

jreback Apr 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section

PERF: improves performance in GroupBy.cumcount

1ffab15

behzadnouri force-pushed the grby-cumcount branch from 3b08468 to 1ffab15 Compare April 18, 2016 15:32

jreback reviewed Apr 18, 2016
View reviewed changes

jreback closed this in 445d1c6 Apr 25, 2016

behzadnouri deleted the grby-cumcount branch April 25, 2016 14:53

sinhrks mentioned this pull request Jul 27, 2016

DOC: Fix groupby nth #13810

Merged

2 tasks

PERF: improves performance in GroupBy.cumcount #11039

PERF: improves performance in GroupBy.cumcount #11039

Conversation

behzadnouri commented Sep 9, 2015

jreback commented Sep 10, 2015

Choose a reason for hiding this comment

behzadnouri commented Sep 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Sep 10, 2015

jreback commented Sep 11, 2015

jorisvandenbossche commented Sep 11, 2015

jreback commented Sep 11, 2015

jreback commented Sep 12, 2015

hayd commented Sep 12, 2015

jreback commented Sep 12, 2015

jreback commented Sep 13, 2015

behzadnouri commented Sep 13, 2015

hayd commented Sep 13, 2015

jorisvandenbossche commented Sep 13, 2015

jreback commented Sep 13, 2015

jorisvandenbossche commented Sep 13, 2015

jorisvandenbossche commented Sep 13, 2015

jreback commented Sep 13, 2015

jorisvandenbossche commented Sep 13, 2015

jreback commented Sep 13, 2015

jorisvandenbossche commented Sep 13, 2015

jreback commented Mar 12, 2016

behzadnouri commented Mar 13, 2016

jreback commented Mar 13, 2016

behzadnouri commented Apr 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri commented Apr 15, 2016

Choose a reason for hiding this comment

behzadnouri commented Apr 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 18, 2016

TomAugspurger commented Apr 19, 2016

jreback commented Apr 25, 2016