TypeError: rank() got an unexpected keyword argument 'numeric_only' #11759

Open
nbonnotte opened this Issue Dec 4, 2015 · 9 comments

Comments

Projects
None yet
3 participants
Contributor

nbonnotte commented Dec 4, 2015

In [19]: df = DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [20]: df.set_index('a').groupby('b').rank(method='first')
Out[20]: 
    c
a    
A1  1
A1  2
A1  1

In [21]: df.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-6b8d4cae9d91> in <module>()
----> 1 df.set_index('a').groupby('c').rank(method='first')

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in rank(self, axis, numeric_only, method, na_option, ascending, pct)

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
    618                     # mark this column as an error
    619                     try:
--> 620                         return self._aggregate_item_by_item(name, *args, **kwargs)
    621                     except (AttributeError):
    622                         raise ValueError

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in _aggregate_item_by_item(self, func, *args, **kwargs)
   3076             # GH6337
   3077             if not len(result_columns) and errors is not None:
-> 3078                 raise errors
   3079 
   3080         return DataFrame(result, columns=result_columns)

TypeError: rank() got an unexpected keyword argument 'numeric_only'

I'm trying to obtain what I would get with a row_number() in SQL...

Notice that if I replace the value in the 'c' column with the string '1', then even df.set_index('a').groupby('b').rank(method='first') fails.

Am I doing something wrong?

Contributor

jreback commented Dec 4, 2015

you are trying to rank on a string column, which is not supported.

But should give a better message I would think.

In [20]: df.set_index('a').groupby('c').first()
Out[20]: 
    b
c    
1  B1

jreback added this to the 0.18.0 milestone Dec 4, 2015

Contributor

nbonnotte commented Dec 27, 2015

That's weird, because .rank() work with method='average' (the default value) but not with method='first'.

In [2]: df = DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [3]: df.set_index('a').groupby('c').rank()
Out[3]: 
      b
a      
A1  1.5
A1  1.5
A1  3.0

In [4]: df.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6b8d4cae9d91> in <module>()
----> 1 df.set_index('a').groupby('c').rank(method='first')

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in rank(self, axis, numeric_only, method, na_option, ascending, pct)

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
    582                     # mark this column as an error
    583                     try:
--> 584                         return self._aggregate_item_by_item(name, *args, **kwargs)
    585                     except (AttributeError):
    586                         raise ValueError

/home/nicolas/Git/pandas/pandas/core/groupby.pyc in _aggregate_item_by_item(self, func, *args, **kwargs)
   3017             # GH6337
   3018             if not len(result_columns) and errors is not None:
-> 3019                 raise errors
   3020 
   3021         return DataFrame(result, columns=result_columns)

TypeError: rank() got an unexpected keyword argument 'numeric_only'

I'm looking into it.

Contributor

nbonnotte commented Dec 28, 2015

I think I understand what is going on.

DataFrameGroupBy.rank is created as part of a whitelist of operators, and its signature is taken from DataFrame.rank, which uses a wrapper obtained with DataFrame._make_wrapper. There, different things are tried to produce the result.

With method='average', the first try succeeds.

With method='first', the first two trys raise an exception with the message "first not supported for non-numeric data", which is good, but then at the last try the method NDFrame._aggregate_item_by_item is called. Things go wrong here, as it uses SeriesGroupBy.rank, the signature of which is taken from Series.rank. And the parameter numeric_only does not exist there, hence the error.

There is a design flaw here:

  • either the DataFrame and Series (and Panel, I guess) versions of rank (and the like) should always have the same signature
  • or the DataFrameGroupBy.rank should not use `SeriesGroupBy.rank

I'll think about a solution that is as minimalist as possible, solves the initial issue, and if possible addresses this flaw. If I can't, I'll just add a hack somewhere to solve the initial issue.

Contributor

jreback commented Dec 28, 2015

the right way to fix this is to move Series.rank and DataFrame.rank into generic.py and make the signature uniform.

You then accept numeric_only=None in the Series.rank (and raise NotImplementedError if its not None).

Further need to add axis as a parameter (the _get_axis_name handles the case where the axis is > than the ndim FYI).

you can raise if ndim>2 as well

Contributor

nbonnotte commented Dec 28, 2015

So now, SeriesGroupBy.rank has the right signature, and Series._make_wrapper is used, so again there is a call to ._aggregate_item_by_item()... except that this method comes from NDFrameGroupBy, and SeriesGroupBy does not inherit from NDFrameGroupBy, so now an AttributeError is raised. This is caught and transformed into a simple ValueError, with the following comment:

related to : GH #3688
try item-by-item
this can be called recursively, so need to raise ValueError if
we don't have this method to indicated to aggregate to
mark this column as an error

Indeed, the first call to _aggregate_item_by_item (the one that called SeriesGroupBy.rank... still following?) uses this ValueError to simply discard the column, and we end up with an empty dataframe with the example I gave in the beginning.

I'm going to prevent the call to SeriesGroupBy._aggregate_item_by_item (instead of asking for forgiveness), so that the exceptions can be sorted and a meaningful error message can be given to the user.

kuanche commented Dec 30, 2015

Hi guys!
Dealing with the exact same issue- any tips on what to try instead?

Contributor

nbonnotte commented Dec 30, 2015

What are you trying to do, exactly?

@jreback jreback modified the milestone: Next Major Release, 0.18.0 Feb 8, 2016

Contributor

nbonnotte commented Mar 13, 2016

Following pull request #11924, we now get an empty dataframe:

In [2]: df = DataFrame({'a': ['A1', 'A1', 'A1'],
   ...:     'b': ['B1', 'B1', 'B2'],
   ...:     'c': 1})

In [3]: dg = df.set_index('a').groupby('c')

In [5]: dg.rank(method='first')
Out[5]:
Empty DataFrame
Columns: []
Index: []

@nbonnotte nbonnotte added a commit to nbonnotte/pandas that referenced this issue Jul 24, 2016

@nbonnotte nbonnotte BUG in DataFrameGroupBy.rank returning empty frame #11759
fixes #11759
3b7831a
Contributor

jreback commented Nov 22, 2016

these seem to be working in currently master

In [3]: df = DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [4]: df
Out[4]: 
    a   b  c
0  A1  B1  1
1  A1  B1  1
2  A1  B2  1

In [5]: 

In [5]: df.set_index('a').groupby('c').rank(method='first')
Out[5]: 
Empty DataFrame
Columns: []
Index: []

In [6]: df.set_index('a').groupby('b').rank(method='first')
Out[6]: 
      c
a      
A1  1.0
A1  2.0
A1  1.0

In [7]: df.set_index('a').groupby('c').rank()
Out[7]: 
      b
a      
A1  1.5
A1  1.5
A1  3.0

In [8]: df.set_index('a').groupby('b').rank()
Out[8]: 
      c
a      
A1  1.5
A1  1.5
A1  1.0

@jreback jreback modified the milestone: 0.21.0, Next Major Release Jul 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment