New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: in _nsorted for frame with duplicated values index #13412

Closed
Tux1 opened this Issue Jun 9, 2016 · 5 comments

Comments

Projects
None yet
4 participants
@Tux1
Contributor

Tux1 commented Jun 9, 2016

The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n rows and not properly sorted. So nsmallest and nlargest for DataFrame doesn't return a correct frame in this particular case.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')

@Tux1 Tux1 changed the title from BUG: in _nsorted for frame to BUG: in _nsorted for frame with duplicated values index Jun 9, 2016

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 9, 2016

Member

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

Member

jorisvandenbossche commented Jun 9, 2016

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

@jreback jreback added this to the Next Major Release milestone Jun 9, 2016

@Tux1

This comment has been minimized.

Show comment
Hide comment
@Tux1

Tux1 Jun 10, 2016

Contributor

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche notifications@github.com a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Contributor

Tux1 commented Jun 10, 2016

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche notifications@github.com a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@Tux1

This comment has been minimized.

Show comment
Hide comment
@Tux1

Tux1 Jun 12, 2016

Contributor

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

Contributor

Tux1 commented Jun 12, 2016

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 21, 2016

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 22, 2016

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 23, 2016

BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 24, 2016

BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

@jreback jreback modified the milestones: 0.19.2, Next Major Release Nov 25, 2016

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 25, 2016

BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings

mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 4, 2016

BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings

add github issue

@jreback jreback closed this in 6e514da Dec 6, 2016

jorisvandenbossche added a commit that referenced this issue Dec 15, 2016

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Mar 13, 2017

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

	b
a	
1	1
2	1
3	1
4	1
1	1
2	1
3	1
4	1

ghost commented Mar 13, 2017

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

	b
a	
1	1
2	1
3	1
4	1
1	1
2	1
3	1
4	1
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Mar 13, 2017

@shankararul see: #15297

jerome-white added a commit to jerome-white/pyzrt that referenced this issue May 22, 2017

Pandas nlargest has a bug in older versions
(pandas-dev/pandas#13412) using sort_values
instead.

As a consequence, the normalization hack is no longer required: use
raw float values and change the precision when combine'ing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment