BUG: in _nsorted for frame with duplicated values index #13412

Closed
Tux1 opened this Issue Jun 9, 2016 · 5 comments

Comments

Projects
None yet
4 participants
Contributor

Tux1 commented Jun 9, 2016 edited

The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n rows and not properly sorted. So nsmallest and nlargest for DataFrame doesn't return a correct frame in this particular case.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')

Tux1 changed the title from BUG: in _nsorted for frame to BUG: in _nsorted for frame with duplicated values index Jun 9, 2016

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

jreback added this to the Next Major Release milestone Jun 9, 2016

Contributor

Tux1 commented Jun 10, 2016

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche notifications@github.com a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Contributor

Tux1 commented Jun 12, 2016

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

@mroeschke mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 22, 2016

@mroeschke mroeschke BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew
9b1ca18

@mroeschke mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 23, 2016

@mroeschke mroeschke BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark
fdfaa97

@mroeschke mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 24, 2016

@mroeschke mroeschke BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8
b414a3e

@jreback jreback modified the milestone: 0.19.2, Next Major Release Nov 25, 2016

@mroeschke mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 25, 2016

@mroeschke mroeschke BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings
0e6ea02

@mroeschke mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 4, 2016

@mroeschke mroeschke BUG: _nsorted incorrect with duplicated values in index (#13412)
Add note to whatsnew

Add nlargest benchmark

Add tests for Series

organize nsorted methods

pep 8 fixes

passed test and pep8

add docstrings

add github issue
d03be72

jreback closed this in 6e514da Dec 6, 2016

@jorisvandenbossche jorisvandenbossche added a commit that referenced this issue Dec 15, 2016

@mroeschke @jorisvandenbossche mroeschke + jorisvandenbossche BUG: _nsorted incorrect with duplicated values in index
closes #13412
closes #14707

(cherry picked from commit 6e514da)
11eb8ab

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

	b
a	
1	1
2	1
3	1
4	1
1	1
2	1
3	1
4	1
Contributor

jreback commented Mar 13, 2017

@shankararul see: #15297

@jerome-white jerome-white added a commit to jerome-white/pyzrt that referenced this issue May 22, 2017

@jerome-white jerome-white Pandas nlargest has a bug in older versions
(pandas-dev/pandas#13412) using sort_values
instead.

As a consequence, the normalization hack is no longer required: use
raw float values and change the precision when combine'ing.
5179363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment