Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improves performance in GroupBy.cumcount #11039

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

closes #12839

    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------
    groupby_ngroups_10000_cumcount               |   3.9790 |  74.9559 |   0.0531 |
    groupby_ngroups_100_cumcount                 |   0.6043 |   0.9940 |   0.6079 |
    -------------------------------------------------------------------------------
    Test name                                    | head[ms] | base[ms] |  ratio   |
    -------------------------------------------------------------------------------

    Ratio < 1.0 means the target commit is faster then the baseline.
    Seed used: 1234

    Target [2a1c935] : PERF: improves performance in GroupBy.cumcount
    Base   [5b1f3b6] : reverts 'from .pandas_vb_common import *'

@jreback
Copy link
Contributor

jreback commented Sep 10, 2015

this would prob close #7569
and partial on #5755

pls add tests and/or advise.

ty

@jreback jreback added Groupby Performance Memory or execution speed performance labels Sep 10, 2015
@jreback jreback added this to the 0.17.0 milestone Sep 10, 2015
indices.append(v)
run = np.r_[True, ids[:-1] != ids[1:]]
rep = np.diff(np.r_[np.nonzero(run)[0], count])
out = (~run).cumsum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually np.intp is ok. its this:

In [1]: np.array([True,False,True]).cumsum()
Out[1]: array([1, 1, 2])

In [2]: np.array([True,False,True]).cumsum().dtype
Out[2]: dtype('int32')

must be some weird casting on windows. providing an accumulator dtype seems to work.

In [3]: np.array([True,False,True]).cumsum(dtype='int64').dtype
Out[3]: dtype('int64')

@behzadnouri
Copy link
Contributor Author

well i am not adding a new feature or closing a bug; so the tests already there should be fine

assert_series_equal(g.B.nth(0), df.B.iloc[[0, 2]])
assert_series_equal(g.B.nth(1), df.B.iloc[[1]])
assert_frame_equal(g.nth(-3), df.loc[[]].set_index('A'))
assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to document this. It was incorrect before (that the index was not set correctly), so its a bug-fix. Pls show an example in the whatsnew.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even an API change I would say (or at least mention it there). nth on a SeriesGroupBy has always been like that. The method on DataframeGroupby was made a reducing method instead of a filtering method on purpose in 0.14 (#7044), but I think we forgot SeriesGroupBy then?

Also the docstring of nth is still incorrect I see (but for frame, so that is a separate issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback i am not sure what u want me to put in what's new; such things would be much more efficient if u take care of it urself. even if i do it, it may not be what u had in mind, and then we have to keep going back and forth.

the other thing is that this whole index and dropna is broken; in addition to #11038 there is this:

>>> df
  1st  2nd
0   a    0
1   a    1
2   a    2
3   b  NaN
4   b    4
>>> df.groupby('1st')['2nd'].nth(0, dropna=False)
0     0
3   NaN
Name: 2nd, dtype: float64
>>> df.groupby('1st')['2nd'].nth(0, dropna=True)
1st
a   NaN
b   NaN
Name: 2nd, dtype: float64

so not sure if there is a point in documenting it when it is not working.

@hayd
Copy link
Contributor

hayd commented Sep 10, 2015

I think this crept in via #7910 / @mortada adding support for passing lists to nth.

@jreback
Copy link
Contributor

jreback commented Sep 11, 2015

@behzadnouri can you do a whatsnew note

@jorisvandenbossche
Copy link
Member

Are we sure we want to change this? This has always been like that I think? (but indeed, inconsistent between SeriesGroupBy and DataFrameGroupBy ..)

@hayd this behaviour is already in 0.14.1 (while the list enhancement was only added in 0.15.0)

@jreback
Copy link
Contributor

jreback commented Sep 11, 2015

I think this should be changed as @behzadnouri suggest. Code is MUCH simpler. and everything would be consistent. I think what nth does now is a little odd (and IIRC is differently if you .apply it rather than directly use .nth)

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

This is really a bug-fix, but since is a fundamental change (e.g. a position based index vs the original index) being returned it ought to be highlited to the user. This just needs a simple Previous Behavior / New Behavior section with a code-block from the previous and actual from the new.

In [1]: df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: g = df.groupby('A')

v0.16.2

In [6]: g.B.nth(0)
Out[6]: 
0   NaN
2     6
Name: B, dtype: float64

In [7]: g.B.nth(1)
Out[7]: 
1    4
Name: B, dtype: float64

This PR

In [4]: g.B.nth(0)
Out[4]: 
A
1   NaN
5     6
Name: B, dtype: float64

In [5]:  g.B.nth(1)
Out[5]: 
A
1    4
Name: B, dtype: float64

@jreback jreback added the Bug label Sep 12, 2015
@hayd
Copy link
Contributor

hayd commented Sep 12, 2015

nth was supposed to be of the "filtering" type rather than aggregating type. Hence the v0.16.2 behaviour.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

@hayd right! I forgot about that. so what should we do with all of this, the prior code had quite a number of 'hacks' to support that exact behavior.

@jreback jreback modified the milestones: 0.17.1, 0.17.0 Sep 13, 2015
@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

moving to next version. This needs to not change the API. The result index can be set differently at the end.

@behzadnouri
Copy link
Contributor Author

api is the function signature which does not change.

if you mean the behaviour, it is inconsistent between series and data-frame, and even within series if there are nulls; and seems to be a bug, not by design. so at some point you need to make series behave like frames or the other way around.

>>> df
   A  B
0  a  0
1  b  1
>>> df.groupby('A').nth(0)
   B
A   
a  0
b  1
>>> df.groupby('A')['B'].nth(0)
0    0
1    1
Name: B, dtype: int64

>>> ts
0     0
1   NaN
dtype: float64
>>> ts.groupby(Series(['a', 'b'])).nth(0, dropna=True)
a   NaN
b   NaN
dtype: float64

@hayd
Copy link
Contributor

hayd commented Sep 13, 2015

Series should behave like DataFrame (and filter rather than agg). I was sure these worked the same on Series or it did behind a flag (which was not able to be the default due to old behaviour).

@jorisvandenbossche
Copy link
Member

@hayd I think we are not using the same terminology :-)

Currently, nth DataFrame behaviour is aggregating/reducing (so setting the grouping keys as the index), while Series behaviour is filtering (keeping the original index labels) (so the other way around than I understand from your comment)

Originally nth on a DataFrame was filtering when you added it (#6569), but it was changed to reducing even before it was included in a release by @jreback (#7044). Back then I asked specifically about this (#7044 (comment)), but @jreback gave some reasons to make it reducing. Only, in that PR, I think it was an oversight of us that it only made DataFrame reducing, and left Series behaviour as filtering.
Also in the docs, nth is mentioned in the reducing methods explicitely: the 'note box at the end of this section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

By the introduction of the possibility to pass lists to get multiple values, it complicated this a bit, as a real aggregating method will only return one value per group. Further, with the introduction of dropna kwarg, you cannot really call it filtering anymore when using that, since it will not return original rows always.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

IIRC (and has been a while), we want to make nth(0,dropna=True) == .first() and nth(0,dropna=True) == .last()

@jorisvandenbossche
Copy link
Member

@jreback and that is true for a DataFrame, but not Series (so this PR makes that analogy more consistent)

@jorisvandenbossche
Copy link
Member

Now, for a solution. I think we just have to make a choice ..

  1. Leave it as is (DataFrame as reducing, Series as filtering, it has been like this since 0.14)
  2. Have all as reducing (so change Series)
  3. Have all as filtering (and change DataFrame behaviour)

If we change something, I would go for 2), as the filtering behaviour is easy to get by using as_index=False (while for the other way around you have to do set_index()) + then it is more consistent with first/last

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

ahh, ok this was a consistency-fix partially then. I would agree with 2).

@jorisvandenbossche
Copy link
Member

Some other inconsistencies:

  • as_index=False is not working when using dropna:

    In [38]: df = pd.DataFrame({'a':list('ABABAB'), 'b':[1,2,3,4,5,6]})
    
    In [39]: df2 = df.copy()
    
    In [40]: df2.iloc[1,1] = np.nan
    
    In [41]: df2.groupby('a').nth(0, dropna='any')
    Out[41]:
    b
    a
    A  1
    B  4
    
    In [42]: df2.groupby('a', as_index=False).nth(0, dropna='any')
    Out[42]:
    b
    a
    A  1
    B  4
    
    In [43]: df2.groupby('a', as_index=False).nth(0)
    Out[43]:
    a   b
    0  A   1
    1  B NaN
    
  • the resulting index when using as_index=False is different between nth and first:

    In [45]: df.index = df.index + 10
    
    In [46]: df.groupby('a').nth(0)
    Out[46]:
    b
    a
    A  1
    B  2
    
    In [47]: df.groupby('a').first()
    Out[47]:
    b
    a
    A  1
    B  2
    
    In [48]: df.groupby('a', as_index=False).first()
    Out[48]:
    a  b
    0  A  1
    1  B  2
    
    In [50]: df.groupby('a', as_index=False).nth(0)
    Out[50]:
    a  b
    10  A  1
    11  B  2
    

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

these are already noted in #5755 (though not so explicity)

@jorisvandenbossche
Copy link
Member

But here it is possibly on purpose, as for nth this is a way to get filtering behaviour (keep original index), while reducing methods give you just a default 0, 1, .., n index when using as_index=False as you never can have the original index. That is the problem with nth being some mixture of a reducing and filtering method.

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

can you rebase/update

@behzadnouri behzadnouri force-pushed the grby-cumcount branch 4 times, most recently from e69ded7 to b0c973c Compare March 13, 2016 14:16
@behzadnouri
Copy link
Contributor Author

@jreback rebased

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016
@jreback
Copy link
Contributor

jreback commented Mar 13, 2016

@behzadnouri ok thanks. Can you put a whatsnew sub-section that shows the changes as a user would care about them, IOW an example (you can take from tests) showing what used to happend and the (correct) new way.

@behzadnouri
Copy link
Contributor Author

@jreback added an example to whatsnew showing the changes


New Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an ipython block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ipython blocks in the new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback from what i see from similar cases, this is already in correct form; v0.18.1.txt#L159 as an example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for the Previous Behvaior, not the NEW. pls make the change

@behzadnouri
Copy link
Contributor Author

@jreback moved to api changes section

"""
arr is where cumcount gets its values from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section

@behzadnouri
Copy link
Contributor Author

@jreback made the changes

Out[5]:
0 1
1 2
Name: B, dtype: int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the same as_index=False here as well (as u r showing below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point here is that as_index=True is ignored in old behaviour; as_index=False is not relevant or informative (and has not changed).

this has been going back and forth for too many times. if you like to add/modify anything please do so on your end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are showing it in the new, then pls show it in the original.

this has been going back and forth for too many times.

Well, that's just how it is. The docs have to be in the proper format and be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback please go ahead and make any changes you find necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri of course, but that's not the point is it. ok thank you for the PR.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

@jorisvandenbossche @TomAugspurger comments?

@TomAugspurger
Copy link
Contributor

Looks good on a quick skim.

@behzadnouri you seem to also fixed a bug where groupby.nth was ignoring the sort keyword

# master
In [5]: df.groupby('c', sort=True).nth(1)
Out[5]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

In [6]: df.groupby('c', sort=False).nth(1)
Out[6]:
          a         b
c
0 -0.029029  0.565333
1  0.186213  1.110464
2  0.982333 -0.544459
3 -0.626740 -0.541241

vs.

# your branch
In [1]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])

In [2]: df['c'] = np.random.randint(0, 4, 100)

In [3]: df.groupby('c', sort=True).nth(1)
Out[3]:
          a         b
c
0 -1.168300 -2.224763
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404

In [4]: df.groupby('c', sort=False).nth(1)
Out[4]:
          a         b
c
1 -0.562298 -1.262734
2 -0.439613  0.236592
3 -0.499235 -0.808404
0 -1.168300 -2.224763

Would be nice to have a release note for that to (could maybe do on merge?).

@jreback jreback closed this in 445d1c6 Apr 25, 2016
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

thanks @behzadnouri

@TomAugspurger your suggestions incorporated as well!

@behzadnouri behzadnouri deleted the grby-cumcount branch April 25, 2016 14:53
@sinhrks sinhrks mentioned this pull request Jul 27, 2016
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy.nth includes group key inconsistently
5 participants