-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
percent_rank vs cume_dist #28975
Comments
Thanks @xmnlab. Can you clarify what is the goal of this issue? It's unclear to me what do you expect. Also, can you provide links to the functions you're talking about, in pandas, Ibis and SQL. Can't find them. |
@datapythonista thanks for checking this issue Some SQL databases, that have windows function support, implement you can check something here: https://www.postgresql.org/docs/10/functions-window.html :
Pandas I am creating a notebook with some examples and when it is done I will post the link here. |
@datapythonista about the goal of this issue, it was to check if there was any discussion about this topic before. I created this gist to show the difference between sql databases percent_rank and cume_dist: https://gist.github.com/xmnlab/d676ff1b0ff474c634d62010ebca8b07 note: related to Ibis, I will propose to change the current percent_rank to use SQL Database percent_rank approach and I will also propose to create one operation for cume_dist that will use pandas percent_rank. |
I don't see pandas implementing window functions for I'm still unsure what outcome do you expect from this issue. Not sure if pandas |
you're right, when I said pandas for sql users it is confusing because this gist shows some examples about that. It seems the current ibis documentation doesn't have the full list of the operations supported there (I opened a issue for this problem) some links about percent_rank on ibis (also in the gist above link) are:
As I commented before, my expectation with this issue was checking if there is a previous discussion about this topic before to understand the reasons about the current rank pct implementation. |
I had the same issue and Google led me here. I am adding the python code (& R equivalent) for future reference. import pandas as pd
pp = pd.Series([12, 15, 11, 13, None, 12])
qq = (pp.rank(method = 'min') - 1) / (pp.count() - 1) #Percent Rank
print(*qq)
## 0.25 1.0 0.0 0.75 nan 0.25
print(*pp.rank(method = 'max', pct = True)) #Cumulative Distance
## 0.6 1.0 0.2 0.8 nan 0.6 library(dplyr)
aa <- c(12, 15, 11, 13, NA, 12)
percent_rank(aa) #Percent Rank
## [1] 0.25 1.00 0.00 0.75 NA 0.25
cume_dist(aa) #Cumulative Distance
## [1] 0.6 1.0 0.2 0.8 NA 0.6 |
It seems pandas
percent_rank
works like thecume_dist
of SQL databases.As
ibis-framework
tries to use the same pandas API as much as possible ...ibis-framework
percent_rank
is equal topandas
.As in SQL databases there are these 2 operations, I need a way to implement the SQL
percent_rank
I don't know very well which path I should take. Is there an initial thoughts here: ibis-project/ibis#1975
I wonder if there was any discussion about this topic before.
Any comment, recommendation or guidance would be very appreciated.
The text was updated successfully, but these errors were encountered: