New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Closed
FXLab91 opened this Issue Mar 9, 2017 · 8 comments

Comments

Projects
None yet
5 participants
@FXLab91

FXLab91 commented Mar 9, 2017

I find the behavior of rank function with method = 'dense' and pct = True unexpected as it looks like, in order to calculate percentile ranks, the function is using the total number of observations instead of the number of distinct observations.

Code Sample, a copy-pastable example if possible

import pandas as pd
n_rep = 2
ts = pd.Series([1,2,3,4] * n_rep )
output = ts.rank(method = 'dense', pct = True)

Problem description

ts.rank(method = 'dense', pct = True)
Out[116]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500

Expected Output

Something similar to:

pd.Series([1,2,3,4] * 2).rank(method = 'dense', pct = True) * n_rep 
Out[118]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00

Also, I would expected the result above to be invariant to n_rep.
i.e. I would expect a "mapping" {value -> pct_rank} that would not depend on how many times the value is repeated, while it is not the case here.

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 9, 2017

so all pct=True does is divide by the nobs, which seems correct for all of the other methods.

In [3]: ts.rank(method='dense')
Out[3]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
5    2.0
6    3.0
7    4.0
dtype: float64

# this is the result
In [4]: ts.rank(method='dense')/8
Out[4]: 
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500
dtype: float64

you want something like this I suppose, note that the original definitions are from : https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.rankdata.html (though scipy doesn't do pct, so I guess this doesn't matter).

In [14]: ts.rank(method='dense')/len(ts.drop_duplicates())
Out[14]: 
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00
dtype: float64

code is here:
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_rank_helper.pxi.in#L201

if you'd like to see what (if anything) this change would break. (not you cannot directly use .drop_duplicates, you would have to call the cython routine (or maybe better we push pct calcs higher up in the stack so we could call that routine (I don't think perf is an issue, more about clarity).

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 9, 2017

@FXLab91 another option is to not allow pct=True with dense and let the user decide.

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 9, 2017

@shoyer any thoughts

@shoyer

This comment has been minimized.

Member

shoyer commented Mar 9, 2017

I agree with @FXLab91 that this is very strange behavior, and I can't see why anyone would want it. So I would be inclined to treat it as a bug and fix it for the next release.

@dsm054

This comment has been minimized.

Contributor

dsm054 commented Mar 9, 2017

Does this suggest we should rethink the pct behaviour of some of the others as well? Something like [1,2,2] will give the same pct results under both min and dense (1/3, 2/3, 2/3).

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 10, 2017

@dsm054 surely!

yep these are prob not tested at all.

@jreback jreback added this to the Next Major Release milestone Mar 10, 2017

rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 10, 2017

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@rouzazari

This comment has been minimized.

Contributor

rouzazari commented Mar 10, 2017

May be a bit premature but I just worked through a possible solution that only touches method=dense and does not require .drop_duplicates. Comments and recommendations appreciated.

rouzazari added a commit to rouzazari/pandas that referenced this issue Apr 5, 2017

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
@rouzazari

This comment has been minimized.

Contributor

rouzazari commented Apr 6, 2017

Restating @dsm054's question (and asking a few of my own), should all other method's return a "dense percentage" on a 100% basis when pct=True?

As @dsm054 noted, Series([1,2,2]).max(method='min', pct=True) will return [1/3, 2/3, 2/3]. Should this return [1/2, 2/2, 2/2]?

Now if method='max', Series([1,2,2]).max(method='max', pct=True) will return [1/3, 3/3, 3/3]. Is that is the desired output or should it again be [1/2, 2/2, 2/2]?

#15639 will fix the method='dense' case, but we need to address other methods as well.

@jreback jreback modified the milestones: 0.21.0, Next Major Release May 7, 2017

rouzazari added a commit to rouzazari/pandas that referenced this issue May 22, 2017

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630

@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630

gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 8, 2018

gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630

gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

BUG: Dense ranking with percent now uses 100% basis
- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
  `pct=True` now scales to 100%.

See pandas-dev#15630
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment