question: percent_rank vs cume_dist #1975

xmnlab · 2019-09-21T23:55:46Z

It seems the Ibis percent_rank operation is in fact the cume_dist SQL operation as explained at [1].

So how could we implement percent_rank SQL operation?

Suggestion 1:

maybe ibis percent_rank could have one optional argument like cume_dist (default: True)
and if it is false maybe we can test that using [2], but it means that we should add scipy as a dependence.

any thoughts about this issue?

refs:
[1]

ibis/ibis/tests/all/test_window.py

Line 43 in ae71b3a

# these can't be equivalent, because pandas doesn't have a way to

[2] https://stackoverflow.com/questions/39823470/getting-postgresql-percent-rank-and-scipy-stats-percentileofscore-results-to-mat

extra ref: https://riptutorial.com/sql/example/27456/percent-rank-and-cume-dist

The text was updated successfully, but these errors were encountered:

xmnlab · 2019-09-22T21:44:57Z

suggestion 2:

propose a change on pandas github to handle these 2 cases: percent rank cume_dist

maybe related to this file: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_rank_helper.pxi.in

I really don't know if this topic was discussed before, maybe @jreback could give us some information/feedback

xmnlab · 2019-11-22T15:20:45Z

@jreback do you know someone who could help in this discussion?

xmnlab · 2019-11-22T15:21:22Z

I also opened an issue on pandas github: pandas-dev/pandas#28975

scottcode · 2019-11-23T04:51:00Z

I don’t know much about the specific feature in question, but I wanted to chime in with a thought about functionalities in various backends.

It seems like ibis is meant to provide a unifying abstraction over many backends. The different backends may have varying functionality, but is it really ibis’s scope to fill in capabilities that are missing in a particular backend? Some functionality might just not be supported for a particular backend, or if possible add it to the backend directly.

xmnlab · 2019-11-23T17:35:36Z

hey @scottcode related to your question ..

IMHO

Some functionality might just not be supported for a particular backend, or if possible add it to the backend directly.

In general I think that creates an operation just inside a particular backend directly would be dangerous because another person could add the same operation in a future with different name or any small unnecessary differences ... so it would be inconsistent ..

is it really ibis’s scope to fill in capabilities that are missing in a particular backend?
I think the users should have a way to use the functions they need using ibis expressions (as much as possible).

In my own experience, I have tried to understand how pandas implement that operation .. and if pandas has this operation I have tried to port that to Ibis as similar as possible.

related to the current issue ... it seems that percent_rank and cume_dist are both used by backends such as omniscidb, postgresql, mysql, mssql ...

xmnlab · 2020-05-28T20:34:35Z

maybe we can use a similar approach used for ntile (#2146). In this case, the tests can compare ibis cume_dist with (pure) pandas percent_rank and we can create an ibis pandas percent_rank operation and use it to check other backends percent_rank operation.

cpcloud · 2021-12-17T21:29:59Z

@xmnlab Can you clarify what might be actionable here? Should we rename percent_rank to cume_dist?

xmnlab · 2021-12-18T13:59:15Z

hi @cpcloud
I created this PR long time ago: #2224

so basically the percent_rank tests uses pandas df.rank(pct=True) .. but this works as SQL CumeDist.

So, the easiest way would be to rename the operation to CumeDist. and for PercentRank the test should be implemented manually as described in that old PR. also some backend should change the translation to percent_rank to cume_dist.

let me know if you want more information about that.

cpcloud · 2022-04-19T07:58:50Z

We're now correctly implementing percent_rank everywhere, and we have #3590 for adding cume_dist, closing this out.

xmnlab added the question Questions about the library label Sep 21, 2019

xmnlab mentioned this issue Sep 22, 2019

Ibis: Current Window Function Support Quansight/omnisci#33

Closed

13 tasks

xmnlab mentioned this issue Oct 14, 2019

percent_rank vs cume_dist pandas-dev/pandas#28975

Open

xmnlab mentioned this issue May 31, 2020

ENH: Add properly cume_dist and percent_rank ops #2224

Closed

cpcloud changed the title ~~percent_rank vs cume_dist~~ question: percent_rank vs cume_dist Dec 29, 2021

cpcloud mentioned this issue Mar 11, 2022

feat(api): add cume_dist function #3590

Closed

cpcloud closed this as completed Apr 19, 2022

cpcloud added this to the 3.0.0 milestone Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: percent_rank vs cume_dist #1975

question: percent_rank vs cume_dist #1975

xmnlab commented Sep 21, 2019

xmnlab commented Sep 22, 2019

xmnlab commented Nov 22, 2019

xmnlab commented Nov 22, 2019

scottcode commented Nov 23, 2019

xmnlab commented Nov 23, 2019

xmnlab commented May 28, 2020

cpcloud commented Dec 17, 2021

xmnlab commented Dec 18, 2021

cpcloud commented Apr 19, 2022

question: percent_rank vs cume_dist #1975

question: percent_rank vs cume_dist #1975

Comments

xmnlab commented Sep 21, 2019

xmnlab commented Sep 22, 2019

xmnlab commented Nov 22, 2019

xmnlab commented Nov 22, 2019

scottcode commented Nov 23, 2019

xmnlab commented Nov 23, 2019

xmnlab commented May 28, 2020

cpcloud commented Dec 17, 2021

xmnlab commented Dec 18, 2021

cpcloud commented Apr 19, 2022