Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nsmallest/nlargest With Identical Values #15299

Conversation

RogerThomas
Copy link
Contributor

@RogerThomas RogerThomas commented Feb 3, 2017

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 3, 2017
@RogerThomas
Copy link
Contributor Author

RogerThomas commented Feb 3, 2017

@jreback What do you think of this approach? The issue before was on this line
https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L917
when merging series with the frame when values are identical you end up duplicating the df potentially n times

(obviously PR isn't ready to go, just wanted your opinion on the approach first)

@@ -1323,6 +1323,12 @@ def test_nlargest_multiple_columns(self):
expected = df.sort_values(['a', 'b'], ascending=False).head(5)
tm.assert_frame_equal(result, expected)

def test_nlargest_identical_values(self):
df = pd.DataFrame({'a': [1] * 5, 'b': [1, 2, 3, 4, 5]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here as a comment

test on nsmallest as well.

result = df.nlargest(3, 'a')
expected = pd.DataFrame({'a': [1] * 3, 'b': [1, 2, 3]})
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add similar tests for series

@@ -911,10 +911,15 @@ def select_n_frame(frame, columns, n, method, keep):
if not is_list_like(columns):
columns = [columns]
columns = list(columns)
tmp = Series(frame.index)
frame.reset_index(inplace=True, drop=True)
ser = getattr(frame[columns[0]], method)(n, keep=keep)
if isinstance(ser, Series):
ser = ser.to_frame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the right way to fix this is to make the internal routines return an indexer instead (e.g. integers), select_n_series already has this (just needs to return it).

then this is a simple .take()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, i'm a little confused here. I don't see how the current implementation produces the desired result when using multiple columns, for example if we have:

df = pd.DataFrame(dict(
    a=[1, 1, 1, 2, 2],
    b=[4, 3, 8, 9, 5]
))

and we call df.nsmallest(2, ['a', 'b'])

There's no code in the current implementation that would try to order by both columns a and b in that order to give us a resulting frame of

pd.DataFrame(dict(
    a=[1, 1],
    b=[3, 4]
))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right I don't think it IS there. what you can easily do it simply use the routine to select the nsmallest/largest (its according to a single column), then sort accordting to all of the columns provided (this is still quite efficient as the selection uses partition sort which).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure we can do that @jreback, take this for example

df = pd.DataFrame(dict(
    a=[2, 2, 1, 1, 1],
    b=[4, 3, 8, 9, 5],
))
result = df.nsmallest(2, ['a', 'b'])

If we do nsmallest on col a and then sort on cols a and b
we'll get

   a  c
2  1  8
3  1  9

when the desired result is

   a  c
4  1  5
2  1  8

Copy link
Contributor Author

@RogerThomas RogerThomas Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's much more we can do than sort_values(columns).head/tail(n)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RogerThomas that defeats the entire purpose of partition sort.

Copy link
Contributor

@jreback jreback Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a way to do this might be to (for say nsmallest when multiple columns)
run nsmallest with the first column, take the last value (or values if they are duplicates), select from the frame and argsort them. Then your result are the top values from the first run + the (totalrequested - top values of the 2nd run).

e.g in your example above since nsmallest(2, 'a') yields 1 (two rows of them), you could then select the 1 from the frame and just argsort at that point.

This would allow the single column to be quite efficient and avoid the next-n problem.

You would only need to do this if you have multiple columns.

I would look around and see if there is a better solution out there (IOW google and/or look at how some other stats packages do this).

@codecov-io
Copy link

codecov-io commented Feb 3, 2017

Codecov Report

Merging #15299 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15299      +/-   ##
==========================================
+ Coverage   90.96%   90.96%   +<.01%     
==========================================
  Files         145      145              
  Lines       49534    49560      +26     
==========================================
+ Hits        45057    45082      +25     
- Misses       4477     4478       +1
Flag Coverage Δ
#multiple 88.72% <100%> (ø) ⬆️
#single 40.61% <3.12%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 93.92% <100%> (+0.23%) ⬆️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fbdc23...d3964f8. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

any progress?

@RogerThomas
Copy link
Contributor Author

Ah, sorry @jreback been super busy, haven't gotten around to this. Will try and get to it over the weekend, is that ok?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

@RogerThomas np. just pushing things along. you can ping when you are ready for a review.

@jreback
Copy link
Contributor

jreback commented Feb 21, 2017

here is another example: #14846 (comment)

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

@RogerThomas can you rebase / update.

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch 5 times, most recently from d9db74c to 739704e Compare March 30, 2017 10:24
@RogerThomas
Copy link
Contributor Author

@jreback sorry took so long on this, been super busy.
This should be good to go now

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch from f070082 to 4955d67 Compare March 30, 2017 10:48
reverse = method == 'nlargest'
for i, column in enumerate(columns):
series = frame[column]
if reverse:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this defeat the purpose of .nlargest()? e.g. you are sorting the entire series.

why not just use series.nlargest(n) here? (or .nsmallest). And defer to the duplicate handling there (I think that works).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there are duplicates I need to do another filter using isin this returns an unsorted view of the series, I then have no way of aligning the series according to the sorting requirement, i'd still need to do an argsort after the nsmallest/nlargest call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sorta what you mean @jreback ?

RogerThomas#4

Note I had to copy the index incase there is duplicate index in incoming frame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly.

else:
inds = series.argsort()[:n]
values = series.iloc[inds]
if i != len(columns) - 1 and values.duplicated().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values.is_unique is more idiomatic

@@ -1455,6 +1455,16 @@ def test_nsmallest_nlargest(self):
expected = s.sort_values().head(3)
assert_series_equal(result, expected)

# GH 15297
s = Series([1] * 5, index=[1, 2, 3, 4, 5])
expected = Series([1] * 3, index=[1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to the doc-string for both Series/DataFrame what the policy for dealing with duplicates is. IOW, we are going to always return n (or less values), so the duplicated portion would be in order of the index (see if you can reword!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

@RogerThomas RogerThomas Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback are you sure we need to add to the docstring, I think it's obvious enough here. keep=first in this example, and the series values are all duplicated, so you would expect it to return [1, 2, 3] as the index regardless of using nsmallest or nlargest

cur_frame = cur_frame[series.isin(values)].sort_values(
column, ascending=ascending
)
frame.index = tmp.ix[frame.index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use .ix


ascending = method == 'nsmallest'
tmp = Series(frame.index)
frame.reset_index(inplace=True, drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

inplace is not idiomatic at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to reorder the incoming frame based on the specified sorting method. I can't do that using the original index as it might contain duplicates which gives rise to the original issue. Therefore I save the original index to temp and reset the index of frame inplace so we're not needlessly copying the entire frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, try to make as simple as possible then. Generally you want to use indexers (meaning locations/positions) for anything that could possibly be duplicated, then do a take at the end.

also you don't want to modify the input AT ALL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jreback, i understand the need to not change the input, but as you see on L968 I change the index back.
I don't think this solution is ideal, but I don't see how we can support duplicate indexes and not temporarily change the index on frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can copy it if you need to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy is quite cheap (compared to everything else). we cannot modify the input at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough, just in my personal experience I deal with large dataframes (frames that just about fit in memory) and I have to be very careful about copying.

I have to sign off for the day, buy I'll clean this up tomorrow with a copy (also a positive is we only have to copy of the index is duplicated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*if the index is duplicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RogerThomas yes, this procedure only needs to be done on duplicated anyhow (which to be honest is not very common :)

thanks!

ping when you need a look

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch 6 times, most recently from 1664eb5 to 1a8043f Compare March 31, 2017 10:42
@RogerThomas
Copy link
Contributor Author

OK @jreback pretty sure this is ready to go now.

Saving and restoring duplicated index turned out to be challenging when considering multi-indexes, but managed to get it.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally I don't see any reason you need to actually copy the original frame, you simply need to keep a reference to it original_index. Then you drop the index. I will reverse what I said earlier, it seems you can just do this for both duplicated and unique indexes.

@@ -1057,3 +1057,4 @@ Bug Fixes
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``pd.read_msgpack()`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
- Bug in ``DataFrame.nsmallest`` and ``DataFrame.nlargest`` where identical values resulted in duplicated rows (:issue:`15297`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move this to the correct section (reshaping) in Bug FIxes. be sure to rebase on master.

# We must save frame's index to tmp
tmp = Series(np.arange(len(frame)), index=frame.index)
frame = frame.reset_index(drop=True)
for i, column in enumerate(columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what this algo is doing.

if not index_is_unique:
# If index not unique we must reset index to allow re-indexing below
# We must save frame's index to tmp
tmp = Series(np.arange(len(frame)), index=frame.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do

original_index = frame.index

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch from 2b47439 to 5f772db Compare March 31, 2017 13:37
['c'],
['a', 'b'],
['a', 'c'],
['b', 'a'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RogerThomas you will see this test failing for the below. The issues is that if one of the columns you are n* on is object this will fail. So I think that should raise a nice error message (you can check if any are object upfront).

___________________________________________________________________________________________ TestNLargestNSmallest.test_n[10-order119] ___________________________________________________________________________________________
pandas/tests/frame/test_analytics.py:1945: in test_n
    result = df.nsmallest(n, order)
pandas/core/frame.py:3478: in nsmallest
    return algorithms.select_n_frame(self, columns, n, 'nsmallest', keep)
pandas/core/algorithms.py:966: in select_n_frame
    values = getattr(series, method)(n, keep=keep)
pandas/core/series.py:1907: in nsmallest
    method='nsmallest')
pandas/core/algorithms.py:915: in select_n_series
    raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
E   TypeError: Cannot use method 'nsmallest' with dtype object

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch from d08a60e to 7f8cd04 Compare April 3, 2017 13:54
@RogerThomas
Copy link
Contributor Author

@jreback updated tests for object dtypes

Also had to update algorithm as your tests exposed a flaw in the previous algorithm

return ser.merge(frame, on=columns[0], left_index=True)[frame.columns]
for column in columns:
dtype = frame[column].dtype
if not issubclass(dtype.type, (np.integer, np.floating, np.datetime64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_integer_dtype, is_float_dtype, is_datetime64_any_dtype, is_timedelta64_dtype; this would break on others (e.g. categorical).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its possible we need additional tests for this (just checking)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though, I think you call EACH column (as a Series) right? so this check is already there (well not 100% that is actually checks things).

Or you may skip a column if no dupes?

What I am trying to say is that I want to defer the dtype checks to the other algos if possible (and not do it here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, ok, sorry I thought by your "check upfront" comment you wanted to do it for each column in this method.
But you're right the line getattr(series, method)(cur_n, keep=keep) should defer the check to n* on the series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, if we ARE checking all columns with this method (whether they are duplicated or not). Then this is fine. Is that always true?

series = cur_frame[column]
values = getattr(series, method)(cur_n, keep=keep)
is_last_column = len(columns) - 1 == i
if is_last_column or len(values.unique()) == sum(series.isin(values)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use values.nunique() == series.isin(values).sum() (you don't want to use python operators)

duplicated_filter = series.duplicated(keep=False)
non_duplicated = values[~duplicated_filter]
duplicated = values[duplicated_filter]
if method == 'nsmallest':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a tad more clear to do

def get_indexer(x, y):
     if method == ....

then (eg.)

indexer = get_indexer(indexer, values.index)

Copy link
Contributor Author

@RogerThomas RogerThomas Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, I was SO close to doing that, but wasn't sure how you guys felt about inline functions. But, I agree, i'll do it now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am big fan of DRY, even at the expense of some readability.

expected = df.sort_values(order, ascending=False).head(n)
tm.assert_frame_equal(result, expected)

def test_n_error(self, df_strings):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should expand this with a different df that has pretty much all dtypes, then can test ones which have an error

heres a sample

        df = pd.DataFrame(
            {'group': [1, 1, 2],
             'int': [1, 2, 3],
             'float': [4., 5., 6.],
             'string': list('abc'),
             'category_string': pd.Series(list('abc')).astype('category'),
             'category_int': [7, 8, 9],
             'datetime': pd.date_range('20130101', periods=3),
             'datetimetz': pd.date_range('20130101',
                                         periods=3,
                                         tz='US/Eastern'),
             'timedelta': pd.timedelta_range('1 s', periods=3, freq='s')},
            columns=['group', 'int', 'float', 'string',
                     'category_string', 'category_int',
                     'datetime', 'datetimetz',
                     'timedelta'])

@pytest.mark.parametrize(
"r", [Series([3., 2, 1, 2, '5'], dtype='object'),
Series([3., 2, 1, 2, 5], dtype='object'),
# not supported on some archs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a mini-example from test_error in frame

@RogerThomas
Copy link
Contributor Author

Hey @jreback
How about this?

RogerThomas#5

def test_n(self, df_strings, method, n, order):
# GH10393
df = df_strings
if order[0] == 'b':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this error if 'b' is in order (IOW, it is by-definition an unorderable column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, no because the only time the algorithm considers 'b' is if it's the first field it encounters, otherwise, because 'a' and 'c' are unique, 'b' will not be considered. See comment on line 1967

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think that is a bit odd then. Why shouldn't it raise if a non-valid column is in here?

# Top / bottom
@pytest.mark.parametrize(
'method, n, order',
product(['nsmallest', 'nlargest'], range(1, 11),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding method doesn't add much here (and instead add both as comparision tests below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but after seeing your other methods, maybe ok. if its easier to debug then ok.

# Only expect error when 'b' is first in order, as 'a' and 'c' are
# unique
error_msg = (
"'b' has dtype: object, cannot use method 'nsmallest' "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use tm.assertRaisesRegexp when you want to compare the message as well (also a context manager)

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

this is prob because Series is not tested as well. Let me see if I can fix this (and you can rebase on top).

=================================== FAILURES ===================================
___________________ TestNLargestNSmallest.test_n_all_dtypes ____________________
[gw0] linux2 -- Python 2.7.13 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.frame.test_analytics.TestNLargestNSmallest object at 0x7ff6da2795d0>
df_main_dtypes =    group  int  float string category_string  category_int   datetime  \
0     ...1 2013-01-02 00:00:00-05:00  00:00:02  
2 2013-01-03 00:00:00-05:00  00:00:03  
    def test_n_all_dtypes(self, df_main_dtypes):
        df = df_main_dtypes
>       df.nsmallest(2, list(set(df) - {'category_string', 'string'}))
/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/pandas/tests/frame/test_analytics.py:1998: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/pandas/core/frame.py:3478: in nsmallest
    return algorithms.select_n_frame(self, columns, n, 'nsmallest', keep)
/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/pandas/core/algorithms.py:976: in select_n_frame
    values = getattr(series, method)(cur_n, keep=keep)
/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/pandas/core/series.py:1907: in nsmallest
    method='nsmallest')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
series = 0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Name: datetimetz, dtype: datetime64[ns, US/Eastern]
n = 2, keep = 'first', method = 'nsmallest'
    def select_n_series(series, n, keep, method):
        """Implement n largest/smallest for pandas Series
    
        Parameters
        ----------
        series : pandas.Series object
        n : int
        keep : {'first', 'last'}, default 'first'
        method : str, {'nlargest', 'nsmallest'}
    
        Returns
        -------
        nordered : Series
        """
        dtype = series.dtype
        if not issubclass(dtype.type, (np.integer, np.floating, np.datetime64,
                                       np.timedelta64)):
>           raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
E           TypeError: Cannot use method 'nsmallest' with dtype datetime64[ns, US/Eastern]
/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/pandas/core/algorithms.py:915: TypeError
============ 1 failed, 8537 passed, 1991 skipped in 546.94 seconds =============

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch from 75ac0b7 to 901ab4f Compare April 5, 2017 13:35
@@ -944,14 +944,60 @@ def select_n_frame(frame, columns, n, method, keep):
-------
nordered : DataFrame
"""
from pandas.core.series import Series
from pandas import Int64Index
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I only put this import here as the previous algorithm imported Series here, why do you do inline imports?
I would always have all my imports at the top

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because some of the routines from algorithms are imported at the very top level (e.g. pd.factorize and you would get circular imports otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH ok, thanks!

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

@RogerThomas ok #15902 fixes the tz problem in series nsmallest. after I merge you should be able to easilly rebase on top, just takepandas/tests/series/test_analytics.py from master and you should be good to go.

jreback added a commit that referenced this pull request Apr 5, 2017
xref #15299

Author: Jeff Reback <jeff@reback.net>

Closes #15902 from jreback/series_n and squashes the following commits:

657eac8 [Jeff Reback] TST: better testing of Series.nlargest/nsmallest
@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

@RogerThomas ok merged that PR in. pls rebase and should work!

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch 2 times, most recently from 1d4be5e to 333e126 Compare April 5, 2017 21:08
@RogerThomas
Copy link
Contributor Author

OK @jreback tests passing for me

@@ -981,14 +981,60 @@ def select_n_frame(frame, columns, n, method, keep):
-------
nordered : DataFrame
"""
from pandas.core.series import Series
from pandas import Int64Index
if not is_list_like(columns):
columns = [columns]
columns = list(columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this check on all the columns upfront (its cheap).

you can put this in a separate function on core/algorithms (called by there and here): https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L949

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do u want to reference the column in the exception? IOW

def _ensure_is_valid_dtype_n_method(method, dtype):
    if not ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
                needs_i8_conversion(dtype)):
            raise TypeError("Cannot use method '{method}' with "
                            "dtype {dtype}".format(method=method, dtype=dtype))

or something like

def _is_valid_dtype_n_method(dtype):
    return ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
                needs_i8_conversion(dtype)):

and then in select_n_frame function something like

if not _is_valid_dtype_n_method(dtype):
    raise TypeError("Column `column` has dtype `dtype` can't use n with column")

and then in select_n_series function something like

if not _is_valid_dtype_n_method(dtype):
    raise TypeError("Cannot use  method `method` with dtype")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _is_valid_dtype_n_method(dtype):
    return ((is_numeric_dtype(dtype) and not is_complex_dtype(dtype)) or
                needs_i8_conversion(dtype))

then you can have separate messages for a series calling this (alone) and a dataframe with many columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though keep the message for Series pretty much the same (actually it IS tested).

@RogerThomas RogerThomas force-pushed the fix_nsmallest_nlargest_with_n_identical_values branch from 4c1720d to d3964f8 Compare April 6, 2017 09:18
@jreback jreback closed this in c112252 Apr 6, 2017
@jreback
Copy link
Contributor

jreback commented Apr 6, 2017

thanks @RogerThomas very nice PR! and excellent responsiveness on changes.

keem em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird behavior using nlargest/nsmallest when there are the n smallest/largest values are identical
3 participants