percent_rank vs cume_dist #28975

xmnlab · 2019-10-14T15:30:20Z

It seems pandas percent_rank works like the cume_dist of SQL databases.

As ibis-framework tries to use the same pandas API as much as possible ... ibis-framework percent_rank is equal to pandas.

As in SQL databases there are these 2 operations, I need a way to implement the SQL percent_rank

I don't know very well which path I should take. Is there an initial thoughts here: ibis-project/ibis#1975

I wonder if there was any discussion about this topic before.

Any comment, recommendation or guidance would be very appreciated.

The text was updated successfully, but these errors were encountered:

datapythonista · 2020-05-29T08:50:02Z

Thanks @xmnlab.

Can you clarify what is the goal of this issue? It's unclear to me what do you expect.

Also, can you provide links to the functions you're talking about, in pandas, Ibis and SQL. Can't find them.

xmnlab · 2020-05-31T00:27:23Z

@datapythonista thanks for checking this issue

Some SQL databases, that have windows function support, implement percent_rank and cume_dist functions.

you can check something here: https://www.postgresql.org/docs/10/functions-window.html :

Function	Return Type	Description
percent_rank()	double precision	relative rank of the current row: (rank - 1) / (total partition rows - 1)
cume_dist()	double precision	cumulative distribution: (number of partition rows preceding or peer with current row) / total partition rows

Pandas percent_rank works int the same way as the SQL cume_dist .

I am creating a notebook with some examples and when it is done I will post the link here.

xmnlab · 2020-05-31T22:27:22Z

@datapythonista about the goal of this issue, it was to check if there was any discussion about this topic before.

I created this gist to show the difference between sql databases percent_rank and cume_dist: https://gist.github.com/xmnlab/d676ff1b0ff474c634d62010ebca8b07

note: related to Ibis, I will propose to change the current percent_rank to use SQL Database percent_rank approach and I will also propose to create one operation for cume_dist that will use pandas percent_rank.

datapythonista · 2020-06-01T10:55:48Z

percent_rank seems to be a postgres function. I guess you're talking about Series.rank.

I don't see pandas implementing window functions for percent_rank or cume_dist.

I'm still unsure what outcome do you expect from this issue. Not sure if pandas Series.rank inspired anything in Ibis. Can you please do some more research, and clarify the exact problem, with links to the methods you're talking about? The description of the issue talks about percent_rank in pandas inspiring percent_rank in Ibis. I can't find any of those methods in the documentation of the projects.

xmnlab · 2020-06-01T15:11:11Z

you're right, when I said pandas percent_rank I meant Series.rank(method='min', pct=True)

for sql users it is confusing because Series.rank(method='min', pct=True) is not the same as SQL percent_rank function .. indeed, it is the same as SQL cume_dist function.

this gist shows some examples about that.

It seems the current ibis documentation doesn't have the full list of the operations supported there (I opened a issue for this problem)

some links about percent_rank on ibis (also in the gist above link) are:

As I commented before, my expectation with this issue was checking if there is a previous discussion about this topic before to understand the reasons about the current rank pct implementation.

shivam7898 · 2022-07-29T10:54:42Z

I had the same issue and Google led me here. I am adding the python code (& R equivalent) for future reference.

import pandas as pd
pp = pd.Series([12, 15, 11, 13, None, 12])
qq = (pp.rank(method = 'min') - 1) / (pp.count() - 1)      #Percent Rank
print(*qq)
## 0.25 1.0 0.0 0.75 nan 0.25

print(*pp.rank(method = 'max', pct = True))                #Cumulative Distance
## 0.6 1.0 0.2 0.8 nan 0.6

library(dplyr)
aa <- c(12, 15, 11, 13, NA, 12)
percent_rank(aa)                                          #Percent Rank
## [1] 0.25 1.00 0.00 0.75   NA 0.25

cume_dist(aa)                                             #Cumulative Distance
## [1] 0.6 1.0 0.2 0.8  NA 0.6

xmnlab mentioned this issue Nov 22, 2019

question: percent_rank vs cume_dist ibis-project/ibis#1975

Closed

xmnlab mentioned this issue Jan 29, 2020

[IBIS] Implement percent_rank and cume_dist properly to ibis.omniscidb Quansight/omnisci#89

Open

datapythonista added API Design Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels May 29, 2020

xmnlab mentioned this issue Dec 21, 2020

Implement percent_rank and cume_dist properly to ibis.omniscidb heavyai/ibis-heavyai#5

Open

mroeschke added Docs and removed API Design Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

percent_rank vs cume_dist #28975

percent_rank vs cume_dist #28975

xmnlab commented Oct 14, 2019

datapythonista commented May 29, 2020

xmnlab commented May 31, 2020 •

edited

Loading

xmnlab commented May 31, 2020

datapythonista commented Jun 1, 2020

xmnlab commented Jun 1, 2020 •

edited

Loading

shivam7898 commented Jul 29, 2022

percent_rank vs cume_dist #28975

percent_rank vs cume_dist #28975

Comments

xmnlab commented Oct 14, 2019

datapythonista commented May 29, 2020

xmnlab commented May 31, 2020 • edited Loading

xmnlab commented May 31, 2020

datapythonista commented Jun 1, 2020

xmnlab commented Jun 1, 2020 • edited Loading

shivam7898 commented Jul 29, 2022

xmnlab commented May 31, 2020 •

edited

Loading

xmnlab commented Jun 1, 2020 •

edited

Loading