Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

percent_rank vs cume_dist #28975

Open
xmnlab opened this issue Oct 14, 2019 · 6 comments
Open

percent_rank vs cume_dist #28975

xmnlab opened this issue Oct 14, 2019 · 6 comments
Labels

Comments

@xmnlab
Copy link

xmnlab commented Oct 14, 2019

It seems pandas percent_rank works like the cume_dist of SQL databases.

As ibis-framework tries to use the same pandas API as much as possible ... ibis-framework percent_rank is equal to pandas.

As in SQL databases there are these 2 operations, I need a way to implement the SQL percent_rank

I don't know very well which path I should take. Is there an initial thoughts here: ibis-project/ibis#1975

I wonder if there was any discussion about this topic before.

Any comment, recommendation or guidance would be very appreciated.

@datapythonista datapythonista added API Design Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels May 29, 2020
@datapythonista
Copy link
Member

Thanks @xmnlab.

Can you clarify what is the goal of this issue? It's unclear to me what do you expect.

Also, can you provide links to the functions you're talking about, in pandas, Ibis and SQL. Can't find them.

@xmnlab
Copy link
Author

xmnlab commented May 31, 2020

@datapythonista thanks for checking this issue

Some SQL databases, that have windows function support, implement percent_rank and cume_dist functions.

you can check something here: https://www.postgresql.org/docs/10/functions-window.html :

Function Return Type Description
percent_rank() double precision relative rank of the current row: (rank - 1) / (total partition rows - 1)
cume_dist() double precision cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows

Pandas percent_rank works int the same way as the SQL cume_dist .

I am creating a notebook with some examples and when it is done I will post the link here.

@xmnlab
Copy link
Author

xmnlab commented May 31, 2020

@datapythonista about the goal of this issue, it was to check if there was any discussion about this topic before.

I created this gist to show the difference between sql databases percent_rank and cume_dist: https://gist.github.com/xmnlab/d676ff1b0ff474c634d62010ebca8b07

note: related to Ibis, I will propose to change the current percent_rank to use SQL Database percent_rank approach and I will also propose to create one operation for cume_dist that will use pandas percent_rank.

@datapythonista
Copy link
Member

percent_rank seems to be a postgres function. I guess you're talking about Series.rank.

I don't see pandas implementing window functions for percent_rank or cume_dist.

I'm still unsure what outcome do you expect from this issue. Not sure if pandas Series.rank inspired anything in Ibis. Can you please do some more research, and clarify the exact problem, with links to the methods you're talking about? The description of the issue talks about percent_rank in pandas inspiring percent_rank in Ibis. I can't find any of those methods in the documentation of the projects.

@xmnlab
Copy link
Author

xmnlab commented Jun 1, 2020

you're right, when I said pandas percent_rank I meant Series.rank(method='min', pct=True)

for sql users it is confusing because Series.rank(method='min', pct=True) is not the same as SQL percent_rank function .. indeed, it is the same as SQL cume_dist function.

this gist shows some examples about that.

It seems the current ibis documentation doesn't have the full list of the operations supported there (I opened a issue for this problem)

some links about percent_rank on ibis (also in the gist above link) are:

As I commented before, my expectation with this issue was checking if there is a previous discussion about this topic before to understand the reasons about the current rank pct implementation.

@mroeschke mroeschke added Docs and removed API Design Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels Jul 21, 2021
@shivam7898
Copy link

I had the same issue and Google led me here. I am adding the python code (& R equivalent) for future reference.

import pandas as pd
pp = pd.Series([12, 15, 11, 13, None, 12])
qq = (pp.rank(method = 'min') - 1) / (pp.count() - 1)      #Percent Rank
print(*qq)
## 0.25 1.0 0.0 0.75 nan 0.25

print(*pp.rank(method = 'max', pct = True))                #Cumulative Distance
## 0.6 1.0 0.2 0.8 nan 0.6
library(dplyr)
aa <- c(12, 15, 11, 13, NA, 12)
percent_rank(aa)                                          #Percent Rank
## [1] 0.25 1.00 0.00 0.75   NA 0.25

cume_dist(aa)                                             #Cumulative Distance
## [1] 0.6 1.0 0.2 0.8  NA 0.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants