# Pandas (0.18) Rank: unexpected behavior for method = 'dense' and pct = True #15630

Closed
opened this Issue Mar 9, 2017 · 8 comments

Projects
None yet
5 participants

### FXLab91 commented Mar 9, 2017 • edited by jorisvandenbossche

I find the behavior of rank function with method = 'dense' and pct = True unexpected as it looks like, in order to calculate percentile ranks, the function is using the total number of observations instead of the number of distinct observations.

#### Code Sample, a copy-pastable example if possible

``````import pandas as pd
n_rep = 2
ts = pd.Series([1,2,3,4] * n_rep )
output = ts.rank(method = 'dense', pct = True)
``````

#### Problem description

``````ts.rank(method = 'dense', pct = True)
Out[116]:
0    0.125
1    0.250
2    0.375
3    0.500
4    0.125
5    0.250
6    0.375
7    0.500
``````

#### Expected Output

Something similar to:

``````pd.Series([1,2,3,4] * 2).rank(method = 'dense', pct = True) * n_rep
Out[118]:
0    0.25
1    0.50
2    0.75
3    1.00
4    0.25
5    0.50
6    0.75
7    1.00
``````

Also, I would expected the result above to be invariant to n_rep.
i.e. I would expect a "mapping" {value -> pct_rank} that would not depend on how many times the value is repeated, while it is not the case here.

Contributor

### jreback commented Mar 9, 2017

 so all `pct=True` does is divide by the nobs, which seems correct for all of the other methods. ``````In [3]: ts.rank(method='dense') Out[3]: 0 1.0 1 2.0 2 3.0 3 4.0 4 1.0 5 2.0 6 3.0 7 4.0 dtype: float64 # this is the result In [4]: ts.rank(method='dense')/8 Out[4]: 0 0.125 1 0.250 2 0.375 3 0.500 4 0.125 5 0.250 6 0.375 7 0.500 dtype: float64 `````` you want something like this I suppose, note that the original definitions are from : https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.rankdata.html (though scipy doesn't do pct, so I guess this doesn't matter). ``````In [14]: ts.rank(method='dense')/len(ts.drop_duplicates()) Out[14]: 0 0.25 1 0.50 2 0.75 3 1.00 4 0.25 5 0.50 6 0.75 7 1.00 dtype: float64 `````` if you'd like to see what (if anything) this change would break. (not you cannot directly use `.drop_duplicates`, you would have to call the cython routine (or maybe better we push `pct` calcs higher up in the stack so we could call that routine (I don't think perf is an issue, more about clarity).

Contributor

### jreback commented Mar 9, 2017

 @FXLab91 another option is to not allow `pct=True` with dense and let the user decide.

Contributor

### jreback commented Mar 9, 2017

 @shoyer any thoughts
Member

### shoyer commented Mar 9, 2017

 I agree with @FXLab91 that this is very strange behavior, and I can't see why anyone would want it. So I would be inclined to treat it as a bug and fix it for the next release.
Contributor

### dsm054 commented Mar 9, 2017

 Does this suggest we should rethink the pct behaviour of some of the others as well? Something like [1,2,2] will give the same pct results under both min and dense (1/3, 2/3, 2/3).
Contributor

### jreback commented Mar 10, 2017

 @dsm054 surely! yep these are prob not tested at all.

### rouzazari added a commit to rouzazari/pandas that referenced this issue Mar 10, 2017

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` 55827c8 ```

Merged

Contributor

### rouzazari commented Mar 10, 2017

 May be a bit premature but I just worked through a possible solution that only touches `method=dense` and does not require `.drop_duplicates`. Comments and recommendations appreciated.

### rouzazari added a commit to rouzazari/pandas that referenced this issue Apr 5, 2017

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` ea077d3 ```
Contributor

### rouzazari commented Apr 6, 2017

 Restating @dsm054's question (and asking a few of my own), should all other `method`'s return a "dense percentage" on a 100% basis when `pct=True`? As @dsm054 noted, `Series([1,2,2]).max(method='min', pct=True)` will return [1/3, 2/3, 2/3]. Should this return [1/2, 2/2, 2/2]? Now if method='max', `Series([1,2,2]).max(method='max', pct=True)` will return [1/3, 3/3, 3/3]. Is that is the desired output or should it again be [1/2, 2/2, 2/2]? #15639 will fix the `method='dense'` case, but we need to address other `method`s as well.

### rouzazari added a commit to rouzazari/pandas that referenced this issue May 22, 2017

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` ba3da79 ```

Closed

Closed

Closed

### gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` 0421dc5 ```

### gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 2, 2018

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` 0f9bea3 ```

### gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` edc8f85 ```

### gfyoung added a commit to rouzazari/pandas that referenced this issue Mar 8, 2018

``` BUG: Dense ranking with percent now uses 100% basis ```
```- `DataFrame.rank()` and `Series.rank()` when `method='dense'` and
`pct=True` now scales to 100%.

See pandas-dev#15630```
``` 6299790 ```