Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improves performance in GroupBy.cumcount #11039

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@behzadnouri
Copy link
Contributor

commented Sep 9, 2015

closes #12839

    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------
    groupby_ngroups_10000_cumcount               |   3.9790 |  74.9559 |   0.0531 |
    groupby_ngroups_100_cumcount                 |   0.6043 |   0.9940 |   0.6079 |
    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------

    Ratio < 1.0 means the target commit is faster then the baseline.
    Seed used: 1234

    Target [2a1c935] : PERF: improves performance in GroupBy.cumcount
    Base   [5b1f3b6] : reverts 'from .pandas_vb_common import *'
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2015

this would prob close #7569
and partial on #5755

pls add tests and/or advise.

ty

@jreback jreback added this to the 0.17.0 milestone Sep 10, 2015

@jreback

View changes

pandas/core/groupby.py Outdated
indices.append(v)
run = np.r_[True, ids[:-1] != ids[1:]]
rep = np.diff(np.r_[np.nonzero(run)[0], count])
out = (~run).cumsum()

This comment has been minimized.

Copy link
@jreback

jreback Sep 10, 2015

Contributor

actually np.intp is ok. its this:

In [1]: np.array([True,False,True]).cumsum()
Out[1]: array([1, 1, 2])

In [2]: np.array([True,False,True]).cumsum().dtype
Out[2]: dtype('int32')

must be some weird casting on windows. providing an accumulator dtype seems to work.

In [3]: np.array([True,False,True]).cumsum(dtype='int64').dtype
Out[3]: dtype('int64')
@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2015

well i am not adding a new feature or closing a bug; so the tests already there should be fine

@jreback

View changes

pandas/tests/test_groupby.py Outdated
assert_series_equal(g.B.nth(0), df.B.iloc[[0, 2]])
assert_series_equal(g.B.nth(1), df.B.iloc[[1]])
assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A'))
assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]])

This comment has been minimized.

Copy link
@jreback

jreback Sep 10, 2015

Contributor

you need to document this. It was incorrect before (that the index was not set correctly), so its a bug-fix. Pls show an example in the whatsnew.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Sep 10, 2015

Member

This is even an API change I would say (or at least mention it there). nth on a SeriesGroupBy has always been like that. The method on DataframeGroupby was made a reducing method instead of a filtering method on purpose in 0.14 (#7044), but I think we forgot SeriesGroupBy then?

Also the docstring of nth is still incorrect I see (but for frame, so that is a separate issue)

This comment has been minimized.

Copy link
@behzadnouri

behzadnouri Sep 12, 2015

Author Contributor

@jreback i am not sure what u want me to put in what's new; such things would be much more efficient if u take care of it urself. even if i do it, it may not be what u had in mind, and then we have to keep going back and forth.

the other thing is that this whole index and dropna is broken; in addition to #11038 there is this:

>>> df
  1st  2nd
0   a    0
1   a    1
2   a    2
3   b  NaN
4   b    4
>>> df.groupby('1st')['2nd'].nth(0, dropna=False)
0     0
3   NaN
Name: 2nd, dtype: float64
>>> df.groupby('1st')['2nd'].nth(0, dropna=True)
1st
a   NaN
b   NaN
Name: 2nd, dtype: float64

so not sure if there is a point in documenting it when it is not working.

@hayd

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2015

I think this crept in via #7910 / @mortada adding support for passing lists to nth.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2015

@behzadnouri can you do a whatsnew note

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 11, 2015

Are we sure we want to change this? This has always been like that I think? (but indeed, inconsistent between SeriesGroupBy and DataFrameGroupBy ..)

@hayd this behaviour is already in 0.14.1 (while the list enhancement was only added in 0.15.0)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2015

I think this should be changed as @behzadnouri suggest. Code is MUCH simpler. and everything would be consistent. I think what nth does now is a little odd (and IIRC is differently if you .apply it rather than directly use .nth)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2015

This is really a bug-fix, but since is a fundamental change (e.g. a position based index vs the original index) being returned it ought to be highlited to the user. This just needs a simple Previous Behavior / New Behavior section with a code-block from the previous and actual from the new.

In [1]: df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: g = df.groupby('A')

v0.16.2

In [6]: g.B.nth(0)
Out[6]: 
0   NaN
2     6
Name: B, dtype: float64

In [7]: g.B.nth(1)
Out[7]: 
1    4
Name: B, dtype: float64

This PR

In [4]: g.B.nth(0)
Out[4]: 
A
1   NaN
5     6
Name: B, dtype: float64

In [5]:  g.B.nth(1)
Out[5]: 
A
1    4
Name: B, dtype: float64

@jreback jreback added the Bug label Sep 12, 2015

@hayd

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2015

nth was supposed to be of the "filtering" type rather than aggregating type. Hence the v0.16.2 behaviour.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2015

@hayd right! I forgot about that. so what should we do with all of this, the prior code had quite a number of 'hacks' to support that exact behavior.

@jreback jreback modified the milestones: 0.17.1, 0.17.0 Sep 13, 2015

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

moving to next version. This needs to not change the API. The result index can be set differently at the end.

@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2015

api is the function signature which does not change.

if you mean the behaviour, it is inconsistent between series and data-frame, and even within series if there are nulls; and seems to be a bug, not by design. so at some point you need to make series behave like frames or the other way around.

>>> df
   A  B
0  a  0
1  b  1
>>> df.groupby('A').nth(0)
   B
A   
a  0
b  1
>>> df.groupby('A')['B'].nth(0)
0    0
1    1
Name: B, dtype: int64

>>> ts
0     0
1   NaN
dtype: float64
>>> ts.groupby(Series(['a', 'b'])).nth(0, dropna=True)
a   NaN
b   NaN
dtype: float64
@hayd

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

Series should behave like DataFrame (and filter rather than agg). I was sure these worked the same on Series or it did behind a flag (which was not able to be the default due to old behaviour).

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

@hayd I think we are not using the same terminology :-)

Currently, nth DataFrame behaviour is aggregating/reducing (so setting the grouping keys as the index), while Series behaviour is filtering (keeping the original index labels) (so the other way around than I understand from your comment)

Originally nth on a DataFrame was filtering when you added it (#6569), but it was changed to reducing even before it was included in a release by @jreback (#7044). Back then I asked specifically about this (#7044 (comment)), but @jreback gave some reasons to make it reducing. Only, in that PR, I think it was an oversight of us that it only made DataFrame reducing, and left Series behaviour as filtering.
Also in the docs, nth is mentioned in the reducing methods explicitely: the 'note box at the end of this section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

By the introduction of the possibility to pass lists to get multiple values, it complicated this a bit, as a real aggregating method will only return one value per group. Further, with the introduction of dropna kwarg, you cannot really call it filtering anymore when using that, since it will not return original rows always.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

IIRC (and has been a while), we want to make nth(0,dropna=True) == .first() and nth(0,dropna=True) == .last()

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

@jreback and that is true for a DataFrame, but not Series (so this PR makes that analogy more consistent)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

Now, for a solution. I think we just have to make a choice ..

  1. Leave it as is (DataFrame as reducing, Series as filtering, it has been like this since 0.14)
  2. Have all as reducing (so change Series)
  3. Have all as filtering (and change DataFrame behaviour)

If we change something, I would go for 2), as the filtering behaviour is easy to get by using as_index=False (while for the other way around you have to do set_index()) + then it is more consistent with first/last

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

ahh, ok this was a consistency-fix partially then. I would agree with 2).

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

Some other inconsistencies:

  • as_index=False is not working when using dropna:

    In [38]: df = pd.DataFrame({'a':list('ABABAB'), 'b':[1,2,3,4,5,6]})
    
    In [39]: df2 = df.copy()
    
    In [40]: df2.iloc[1,1] = np.nan
    
    In [41]: df2.groupby('a').nth(0, dropna='any')
    Out[41]:
    b
    a
    A  1
    B  4
    
    In [42]: df2.groupby('a', as_index=False).nth(0, dropna='any')
    Out[42]:
    b
    a
    A  1
    B  4
    
    In [43]: df2.groupby('a', as_index=False).nth(0)
    Out[43]:
    a   b
    0  A   1
    1  B NaN
    
  • the resulting index when using as_index=False is different between nth and first:

    In [45]: df.index = df.index + 10
    
    In [46]: df.groupby('a').nth(0)
    Out[46]:
    b
    a
    A  1
    B  2
    
    In [47]: df.groupby('a').first()
    Out[47]:
    b
    a
    A  1
    B  2
    
    In [48]: df.groupby('a', as_index=False).first()
    Out[48]:
    a  b
    0  A  1
    1  B  2
    
    In [50]: df.groupby('a', as_index=False).nth(0)
    Out[50]:
    a  b
    10  A  1
    11  B  2
    
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

these are already noted in #5755 (though not so explicity)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

But here it is possibly on purpose, as for nth this is a way to get filtering behaviour (keep original index), while reducing methods give you just a default 0, 1, .., n index when using as_index=False as you never can have the original index. That is the problem with nth being some mixture of a reducing and filtering method.

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015

@behzadnouri behzadnouri force-pushed the behzadnouri:grby-cumcount branch 4 times, most recently Mar 12, 2016

@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2016

@jreback rebased

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2016

@behzadnouri ok thanks. Can you put a whatsnew sub-section that shows the changes as a user would care about them, IOW an example (you can take from tests) showing what used to happend and the (correct) new way.

@behzadnouri behzadnouri force-pushed the behzadnouri:grby-cumcount branch 2 times, most recently Apr 12, 2016

@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2016

@jreback added an example to whatsnew showing the changes


New Behavior:

.. code-block:: ipython

This comment has been minimized.

Copy link
@jreback

jreback Apr 12, 2016

Contributor

use an ipython block

This comment has been minimized.

Copy link
@jreback

jreback Apr 18, 2016

Contributor

use ipython blocks in the new

This comment has been minimized.

Copy link
@behzadnouri

behzadnouri Apr 18, 2016

Author Contributor

@jreback from what i see from similar cases, this is already in correct form; v0.18.1.txt#L159 as an example

This comment has been minimized.

Copy link
@jreback

jreback Apr 18, 2016

Contributor

yes for the Previous Behvaior, not the NEW. pls make the change

@jreback

View changes

doc/source/whatsnew/v0.18.1.txt Outdated
@@ -21,6 +21,47 @@ Highlights include:
New features
~~~~~~~~~~~~

.. _whatsnew_0181.enhancements.groubynth:

This comment has been minimized.

Copy link
@jreback

jreback Apr 12, 2016

Contributor

put this in API changes section (move all, keeping the sub-section)

@behzadnouri behzadnouri force-pushed the behzadnouri:grby-cumcount branch Apr 15, 2016

@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2016

@jreback moved to api changes section

"""
arr is where cumcount gets its values from

This comment has been minimized.

Copy link
@jreback

jreback Apr 17, 2016

Contributor

can you add a Parameters section

@jreback

View changes

pandas/core/groupby.py Outdated
for v in self.indices.values():
indices.append(v)
if count == 0:
return np.empty(0, dtype=np.intp)

This comment has been minimized.

Copy link
@jreback

jreback Apr 17, 2016

Contributor

we are always returning i8 for counts (ATM, maybe not be fully correct, but that's the current behavor)

@behzadnouri behzadnouri force-pushed the behzadnouri:grby-cumcount branch to 1ffab15 Apr 18, 2016

@behzadnouri

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2016

@jreback made the changes

Out[5]:
0 1
1 2
Name: B, dtype: int64

This comment has been minimized.

Copy link
@jreback

jreback Apr 18, 2016

Contributor

show the same as_index=False here as well (as u r showing below)

This comment has been minimized.

Copy link
@behzadnouri

behzadnouri Apr 18, 2016

Author Contributor

the point here is that as_index=True is ignored in old behaviour; as_index=False is not relevant or informative (and has not changed).

this has been going back and forth for too many times. if you like to add/modify anything please do so on your end.

This comment has been minimized.

Copy link
@jreback

jreback Apr 18, 2016

Contributor

If you are showing it in the new, then pls show it in the original.

this has been going back and forth for too many times.

Well, that's just how it is. The docs have to be in the proper format and be consistent.

This comment has been minimized.

Copy link
@behzadnouri

behzadnouri Apr 18, 2016

Author Contributor

@jreback please go ahead and make any changes you find necessary

This comment has been minimized.

Copy link
@jreback

jreback Apr 18, 2016

Contributor

@behzadnouri of course, but that's not the point is it. ok thank you for the PR.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2016

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

Looks good on a quick skim.

@behzadnouri you seem to also fixed a bug where groupby.nth was ignoring the sort keyword

# master
In [5]: df.groupby('c', sort=True).nth(1)
Out[5]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

In [6]: df.groupby('c', sort=False).nth(1)
Out[6]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

vs.

# your branch
In [1]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])

In [2]: df['c'] = np.random.randint(0, 4, 100)

In [3]: df.groupby('c', sort=True).nth(1)
Out[3]:
          a         b
c
0 -1.168300 -2.224763
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404

In [4]: df.groupby('c', sort=False).nth(1)
Out[4]:
          a         b
c
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404
0 -1.168300 -2.224763

Would be nice to have a release note for that to (could maybe do on merge?).

@jreback jreback closed this in 445d1c6 Apr 25, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2016

thanks @behzadnouri

@TomAugspurger your suggestions incorporated as well!

@behzadnouri behzadnouri deleted the behzadnouri:grby-cumcount branch Apr 25, 2016

nps added a commit to nps/pandas that referenced this pull request May 17, 2016

@sinhrks sinhrks referenced this pull request Jul 27, 2016

Merged

DOC: Fix groupby nth #13810

2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.