Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: percent_rank vs cume_dist #1975

Closed
xmnlab opened this issue Sep 21, 2019 · 9 comments
Closed

question: percent_rank vs cume_dist #1975

xmnlab opened this issue Sep 21, 2019 · 9 comments
Labels
question Questions about the library
Milestone

Comments

@xmnlab
Copy link
Contributor

xmnlab commented Sep 21, 2019

It seems the Ibis percent_rank operation is in fact the cume_dist SQL operation as explained at [1].

So how could we implement percent_rank SQL operation?

Suggestion 1:

  • maybe ibis percent_rank could have one optional argument like cume_dist (default: True)
  • and if it is false maybe we can test that using [2], but it means that we should add scipy as a dependence.

any thoughts about this issue?

refs:
[1]

# these can't be equivalent, because pandas doesn't have a way to

[2] https://stackoverflow.com/questions/39823470/getting-postgresql-percent-rank-and-scipy-stats-percentileofscore-results-to-mat

extra ref: https://riptutorial.com/sql/example/27456/percent-rank-and-cume-dist

@xmnlab xmnlab added the question Questions about the library label Sep 21, 2019
@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 22, 2019

suggestion 2:

  • propose a change on pandas github to handle these 2 cases: percent rank cume_dist

maybe related to this file: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_rank_helper.pxi.in

I really don't know if this topic was discussed before, maybe @jreback could give us some information/feedback

@xmnlab
Copy link
Contributor Author

xmnlab commented Nov 22, 2019

@jreback do you know someone who could help in this discussion?

@xmnlab
Copy link
Contributor Author

xmnlab commented Nov 22, 2019

I also opened an issue on pandas github: pandas-dev/pandas#28975

@scottcode
Copy link
Contributor

I don’t know much about the specific feature in question, but I wanted to chime in with a thought about functionalities in various backends.

It seems like ibis is meant to provide a unifying abstraction over many backends. The different backends may have varying functionality, but is it really ibis’s scope to fill in capabilities that are missing in a particular backend? Some functionality might just not be supported for a particular backend, or if possible add it to the backend directly.

@xmnlab
Copy link
Contributor Author

xmnlab commented Nov 23, 2019

hey @scottcode related to your question ..

IMHO

Some functionality might just not be supported for a particular backend, or if possible add it to the backend directly.

In general I think that creates an operation just inside a particular backend directly would be dangerous because another person could add the same operation in a future with different name or any small unnecessary differences ... so it would be inconsistent ..

is it really ibis’s scope to fill in capabilities that are missing in a particular backend?
I think the users should have a way to use the functions they need using ibis expressions (as much as possible).

In my own experience, I have tried to understand how pandas implement that operation .. and if pandas has this operation I have tried to port that to Ibis as similar as possible.

related to the current issue ... it seems that percent_rank and cume_dist are both used by backends such as omniscidb, postgresql, mysql, mssql ...

@xmnlab
Copy link
Contributor Author

xmnlab commented May 28, 2020

maybe we can use a similar approach used for ntile (#2146). In this case, the tests can compare ibis cume_dist with (pure) pandas percent_rank and we can create an ibis pandas percent_rank operation and use it to check other backends percent_rank operation.

@cpcloud
Copy link
Member

cpcloud commented Dec 17, 2021

@xmnlab Can you clarify what might be actionable here? Should we rename percent_rank to cume_dist?

@xmnlab
Copy link
Contributor Author

xmnlab commented Dec 18, 2021

hi @cpcloud
I created this PR long time ago: #2224

so basically the percent_rank tests uses pandas df.rank(pct=True) .. but this works as SQL CumeDist.

So, the easiest way would be to rename the operation to CumeDist. and for PercentRank the test should be implemented manually as described in that old PR. also some backend should change the translation to percent_rank to cume_dist.

let me know if you want more information about that.

@cpcloud cpcloud changed the title percent_rank vs cume_dist question: percent_rank vs cume_dist Dec 29, 2021
@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2022

We're now correctly implementing percent_rank everywhere, and we have #3590 for adding cume_dist, closing this out.

@cpcloud cpcloud closed this as completed Apr 19, 2022
@cpcloud cpcloud added this to the 3.0.0 milestone Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions about the library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants