-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Validate that the output type of a UDF is a single element #2198
FEAT: Validate that the output type of a UDF is a single element #2198
Conversation
| @@ -10,9 +10,22 @@ def add_one(s): | |||
| return s + 1 | |||
|
|
|||
|
|
|||
| @elementwise(input_type=[dt.double], output_type=[dt.double]) | |||
| def add_one_with_output_type_list(s): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can inline this function in the tests, we don't think we need to use this in other tests.
|
|
||
| @pytest.mark.only_on_backends([Pandas, PySpark]) | ||
| @pytest.mark.xfail_unsupported | ||
| def test_elementwise_udf_with_output_type_list(backend, alltypes, df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably comment that this does NOT test the output of the UDF being and list. We only test that specifying output=[dt] is the same as output=dt in the udf decorator.
ibis/udf/vectorized.py
Outdated
|
|
||
| self.input_type = list(map(dt.dtype, input_type)) | ||
|
|
||
| if type(output_type) is list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more common to use isinstance(output_type, list)
ibis/udf/vectorized.py
Outdated
| self.input_type = list(map(dt.dtype, input_type)) | ||
|
|
||
| if type(output_type) is list: | ||
| [output_type] = output_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more common to use
output_type, = output_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably also catch the exception when the list has more then one element and give a reasonable error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more comment to use
output_type, = output_type
blacken was changing this to (output_type,) = output_type so [output_type] = output_type seemed cleaner, however, if output_type, is more common then maybe (output_type,) is more readable
should probably also catch the exception when the list has more then one element and give a reasonable error message
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments
e19ceb3
to
24b9bb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spark back end is not running, see #2201 need to re-enable that first.
also pls add a release note
|
if you can rebase this |
24b9bb6
to
594ae41
Compare
|
Done! CI is green The rebase did involve resolving some merge conflicts, but the changes are exactly the same besides the name of the test and comments / docstrings. Because of #2210 the changes now apply also to |
|
hmm, why are we allowing this kind of output type as a list? I would rather just raise if this is specified (on all backends) as better to be strict. |
|
@jreback I am fine with raise. But I prefer allowing I know currently output type can only be a single col, but it seems natural to type |
this syntax goes against all other call syntaxes (numpy ufunc, numba); i don't why i would allow this. ok with raising. |
|
Ah yes that make senses. We won't want to go against existing conventions. Let's raise. |
|
^ Sounds good to me! I refactored the validation function in This is starting to sound more like a feature (adding validation) than a bug fix to me, so I'm going to rename this PR |
ibis/udf/vectorized.py
Outdated
| @@ -102,6 +52,10 @@ def func(*args): | |||
|
|
|||
|
|
|||
| def _udf_decorator(node_type, input_type, output_type): | |||
| if isinstance(output_type, list): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update https://github.com/ibis-project/ibis/blob/master/ibis/expr/operations.py#L3462 to do proper rlz checking instead of checking it here.
See https://github.com/ibis-project/ibis/blob/master/ibis/expr/rules.py and other examples in https://github.com/ibis-project/ibis/blob/master/ibis/expr/operations.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this shouldn't be here (the checking is already done in UserDefinedFunction __init__)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into doing checking using rlz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying rlz checking, but now I think it actually might be more appropriate to do the checking before we create the node (in other words, not using rlz)
Using rlz seems to be geared towards describing/validating each argument of a node, when the argument is in its "final form" in the node
E.g.
We want to make sure output_type=[dt.double] is caught and an error is raised. (Note that output_type "starts out" as a Python list)
So we can detect this syntax by (at some point) checking if the output_type is a Python list
But by the time we're in the node doing rlz checking, output_type would have been converted to an Ibis datatype here. The rule would see output_type as an Ibis Array, which I think is valid (user could have intentionally specified output_type=dt.Array(dt.double)) so we can't raise an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are workarounds that would let us still use a rlz check (e.g. do the type conversion in the node after validation has been done) but after exploring this I feel like it makes sense to conceptually treat the validation as checking the user's input to the UDF API as opposed to validating the node arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I still plan to remove this particular check in _udf_decorator, it's redundant (the checking is already done elsewhere))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
ibis/spark/udf.py
Outdated
| valid_function_signature(self.input_type, func) | ||
| # Validate that the input_type argument and the function signature | ||
| # match and that the output_type is valid | ||
| v.validate_input_type_count(self.input_type, func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to touch Spark udf either. Only PySpark and Pandas backend supports "vectorized UDF".
…type, raise error for all backends
2181500
to
4b149dc
Compare
|
Rebased and addressed comments and CI is green! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ibis/udf/validate.py
Outdated
| import ibis.common.exceptions as com | ||
|
|
||
|
|
||
| def _parameter_count(funcsig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this
ibis/udf/validate.py
Outdated
| ) | ||
|
|
||
|
|
||
| def validate_input_type(input_type, func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this
ibis/udf/validate.py
Outdated
| return funcsig | ||
|
|
||
|
|
||
| def validate_output_type(output_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this
|
Hello @timothydijamco! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ping on green
|
@jreback CI is green Also I'm not sure what the PEP warning is above--the line that it links to is short |
|
thanks @timothydijamco so the PEP warning above is from the configured pep8speaks, but i don't think it uses the same settings as the project. can you open an issue about this? (I think we just need to add a .pep8speaks) or something |
|
Thank you both |
The normal syntax for defining an elementwise UDF is passing a single value to
output_type:If the output type is placed into a list (similar to
input_type) (see below), then executing an expression with the UDF succeeds on Pandas but fails on Spark withTypeError: 'NoneType' object is not callableCurrently,
output_type=[dt.float64]is interpreted as the output type beingarray<float64>This PR unpacks the single value in
output_typeif it's a list and uses that as the output type so that e.g.output_type=[dt.float64]is treated as just afloat64edit: This PR was originally "BUG: Allow output type of vectorized UDFs to be specified as a single-element list" but was renamed (see below)