-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Clip in the Pyspark backend #2779
Implement Clip in the Pyspark backend #2779
Conversation
ibis/backends/pyspark/compiler.py
Outdated
| upper = t.translate(op.upper, scope, timecontext) | ||
| lower = t.translate(op.lower, scope, timecontext) | ||
| expr = F.when(col >= upper, F.lit(upper)).otherwise( | ||
| F.when(col <= lower, F.lit(lower)).otherwise(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with this implementation, but in case you haven't considered, you can implement an upper/lower bound in a (probably) simpler way using min and max. Like:
def clip_lower(value, lower):
return max(value, lower)And equivalent for upper with min. I think your implementation here would be way simpler using this approach (and not sure if the translated expression could be faster too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datapythonista Maybe I misunderstood your suggestion, but the object that we work with here are pyspark columns and sth like
max(spark_column, 0.0)
won't really work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean to implement like this?
def column_max(col1, col2):
return when(col1 < col2, col2).otherwise(col1)
def column_min(col1, col2):
return when(col1 < col2, col1).otherwise(col2)
def clip(col, lower, upper):
return column_max(column_min(col, F.lit(upper)), F.lit(lower))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't use pyspark for a while and I don't remember, but isn't it a pyspark function to get the min or max between a column and a literal? Or if it's not, we should have it in Ibis, we could write this using them.
In any case, not important, if this is simpler, let's just go with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there is any native pyspark function that will directly get us this min/max comparison functionality. Probably easiest to leave it with the when/otherwise paradigm for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can now merge master, and the CI should be fixed.
fa67c1c
to
6cfc7a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emilyreff7 can you also add a release note
36d331c
to
7ae5f2d
Compare
Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks @emilyreff7 |
Proposed Change
Implement support to be able to execute
clipin the Pyspark backend. For example:Tests
Moved existing pandas and dask clip tests into test_numeric.py to call on pyspark as well.