Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Clip in the Pyspark backend #2779

Merged
merged 4 commits into from
May 18, 2021

Conversation

emilyreff7
Copy link
Contributor

@emilyreff7 emilyreff7 commented May 17, 2021

Proposed Change

Implement support to be able to execute clip in the Pyspark backend. For example:

t.col.clip(lower=0, upper=10).execute()

Tests

Moved existing pandas and dask clip tests into test_numeric.py to call on pyspark as well.

@jreback jreback added pyspark The Apache PySpark backend expressions Issues or PRs related to the expression API labels May 17, 2021
upper = t.translate(op.upper, scope, timecontext)
lower = t.translate(op.lower, scope, timecontext)
expr = F.when(col >= upper, F.lit(upper)).otherwise(
F.when(col <= lower, F.lit(lower)).otherwise(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with this implementation, but in case you haven't considered, you can implement an upper/lower bound in a (probably) simpler way using min and max. Like:

def clip_lower(value, lower):
    return max(value, lower)

And equivalent for upper with min. I think your implementation here would be way simpler using this approach (and not sure if the translated expression could be faster too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista Maybe I misunderstood your suggestion, but the object that we work with here are pyspark columns and sth like

max(spark_column, 0.0)

won't really work here.

Copy link
Contributor

@icexelloss icexelloss May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean to implement like this?

def column_max(col1, col2):
    return when(col1 < col2, col2).otherwise(col1)
    
def column_min(col1, col2):
    return when(col1 < col2, col1).otherwise(col2)
    
def clip(col, lower, upper):
    return column_max(column_min(col, F.lit(upper)), F.lit(lower))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't use pyspark for a while and I don't remember, but isn't it a pyspark function to get the min or max between a column and a literal? Or if it's not, we should have it in Ibis, we could write this using them.

In any case, not important, if this is simpler, let's just go with this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is any native pyspark function that will directly get us this min/max comparison functionality. Probably easiest to leave it with the when/otherwise paradigm for now.

Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can now merge master, and the CI should be fixed.

@jreback jreback added this to the Next release milestone May 17, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilyreff7 can you also add a release note

@emilyreff7
Copy link
Contributor Author

@emilyreff7 can you also add a release note

Done!

@emilyreff7 emilyreff7 closed this May 18, 2021
@emilyreff7 emilyreff7 reopened this May 18, 2021
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback merged commit b32ec96 into ibis-project:master May 18, 2021
@jreback
Copy link
Contributor

jreback commented May 18, 2021

thanks @emilyreff7

@emilyreff7 emilyreff7 deleted the implement-clip-pyspark branch June 3, 2021 18:51
@cpcloud cpcloud removed this from the Next release milestone Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants