Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Validate that the output type of a UDF is a single element #2198

Merged

Conversation

timothydijamco
Copy link
Contributor

@timothydijamco timothydijamco commented May 5, 2020

The normal syntax for defining an elementwise UDF is passing a single value to output_type:

@elementwise(input_type=[dt.float64, dt.float64], output_type=dt.float64)
def sum_udf(a, b):
    return a + b

If the output type is placed into a list (similar to input_type) (see below), then executing an expression with the UDF succeeds on Pandas but fails on Spark with TypeError: 'NoneType' object is not callable

@elementwise(input_type=[dt.float64, dt.float64], output_type=[dt.float64])
def sum_udf(a, b):
    return a + b

Currently, output_type=[dt.float64] is interpreted as the output type being array<float64>

This PR unpacks the single value in output_type if it's a list and uses that as the output type so that e.g. output_type=[dt.float64] is treated as just a float64

edit: This PR was originally "BUG: Allow output type of vectorized UDFs to be specified as a single-element list" but was renamed (see below)

@timothydijamco
Copy link
Contributor Author

cc @icexelloss @jreback

@@ -10,9 +10,22 @@ def add_one(s):
return s + 1


@elementwise(input_type=[dt.double], output_type=[dt.double])
def add_one_with_output_type_list(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can inline this function in the tests, we don't think we need to use this in other tests.


@pytest.mark.only_on_backends([Pandas, PySpark])
@pytest.mark.xfail_unsupported
def test_elementwise_udf_with_output_type_list(backend, alltypes, df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably comment that this does NOT test the output of the UDF being and list. We only test that specifying output=[dt] is the same as output=dt in the udf decorator.


self.input_type = list(map(dt.dtype, input_type))

if type(output_type) is list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more common to use isinstance(output_type, list)

self.input_type = list(map(dt.dtype, input_type))

if type(output_type) is list:
[output_type] = output_type
Copy link
Contributor

@icexelloss icexelloss May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more common to use

output_type, = output_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably also catch the exception when the list has more then one element and give a reasonable error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more comment to use

output_type, = output_type

blacken was changing this to (output_type,) = output_type so [output_type] = output_type seemed cleaner, however, if output_type, is more common then maybe (output_type,) is more readable

should probably also catch the exception when the list has more then one element and give a reasonable error message

Will do

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments

@timothydijamco timothydijamco force-pushed the fix-elementwise-udf-output-type branch from e19ceb3 to 24b9bb6 Compare May 15, 2020 21:02
@jreback jreback added bug Incorrect behavior inside of ibis pandas The pandas backend pyspark The Apache PySpark backend labels May 18, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spark back end is not running, see #2201 need to re-enable that first.

also pls add a release note

@jreback
Copy link
Contributor

jreback commented May 28, 2020

if you can rebase this

@timothydijamco timothydijamco force-pushed the fix-elementwise-udf-output-type branch from 24b9bb6 to 594ae41 Compare May 29, 2020 02:58
@timothydijamco
Copy link
Contributor Author

timothydijamco commented May 29, 2020

Done! CI is green

The rebase did involve resolving some merge conflicts, but the changes are exactly the same besides the name of the test and comments / docstrings. Because of #2210 the changes now apply also to analytic and reduction UDFs, so I'll rename the PR

@timothydijamco timothydijamco changed the title BUG: Allow output type of elementwise UDF to be specified as a single-element list BUG: Allow output type of vectorized UDFs to be specified as a single-element list May 29, 2020
@jreback jreback added this to the Next Bugfix Release milestone May 29, 2020
@jreback
Copy link
Contributor

jreback commented May 29, 2020

hmm, why are we allowing this kind of output type as a list? I would rather just raise if this is specified (on all backends) as better to be strict.

@icexelloss
Copy link
Contributor

@jreback I am fine with raise. But I prefer allowing output_type=[dt.double] a little more because this is more symmetrical to the input_type parameter. e.g.

@elementwise(
    input_type=[dt.double, dt.double]
    output_type=[dt.double]
)
def foo(...)

I know currently output type can only be a single col, but it seems natural to type output_type=[dt.double] than output_type=dt.double after typing input_type=[...]

@jreback
Copy link
Contributor

jreback commented May 29, 2020

@jreback I am fine with raise. But I prefer allowing output_type=[dt.double] a little more because this is more symmetrical to the input_type parameter. e.g.

@elementwise(
    input_type=[dt.double, dt.double]
    output_type=[dt.double]
)
def foo(...)

I know currently output type can only be a single col, but it seems natural to type output_type=[dt.double] than output_type=dt.double after typing input_type=[...]

this syntax goes against all other call syntaxes (numpy ufunc, numba); i don't why i would allow this. ok with raising.

@icexelloss
Copy link
Contributor

Ah yes that make senses. We won't want to go against existing conventions. Let's raise.

@timothydijamco
Copy link
Contributor Author

^ Sounds good to me!

I refactored the validation function in ibis/udf/vectorized.py out into a new module, ibis/udf/validate.py and added a function to that module to validate the output_type. That function is now used throughout the backends that have UDFs with output type arguments

This is starting to sound more like a feature (adding validation) than a bug fix to me, so I'm going to rename this PR

@timothydijamco timothydijamco changed the title BUG: Allow output type of vectorized UDFs to be specified as a single-element list FEAT: Validate that the output type of a UDF is a single element May 29, 2020
@@ -102,6 +52,10 @@ def func(*args):


def _udf_decorator(node_type, input_type, output_type):
if isinstance(output_type, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this shouldn't be here (the checking is already done in UserDefinedFunction __init__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into doing checking using rlz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying rlz checking, but now I think it actually might be more appropriate to do the checking before we create the node (in other words, not using rlz)

Using rlz seems to be geared towards describing/validating each argument of a node, when the argument is in its "final form" in the node

E.g.
We want to make sure output_type=[dt.double] is caught and an error is raised. (Note that output_type "starts out" as a Python list)

So we can detect this syntax by (at some point) checking if the output_type is a Python list

But by the time we're in the node doing rlz checking, output_type would have been converted to an Ibis datatype here. The rule would see output_type as an Ibis Array, which I think is valid (user could have intentionally specified output_type=dt.Array(dt.double)) so we can't raise an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are workarounds that would let us still use a rlz check (e.g. do the type conversion in the node after validation has been done) but after exploring this I feel like it makes sense to conceptually treat the validation as checking the user's input to the UDF API as opposed to validating the node arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I still plan to remove this particular check in _udf_decorator, it's redundant (the checking is already done elsewhere))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

ibis/udf/validate.py Outdated Show resolved Hide resolved
valid_function_signature(self.input_type, func)
# Validate that the input_type argument and the function signature
# match and that the output_type is valid
v.validate_input_type_count(self.input_type, func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to touch Spark udf either. Only PySpark and Pandas backend supports "vectorized UDF".

@timothydijamco timothydijamco force-pushed the fix-elementwise-udf-output-type branch from 2181500 to 4b149dc Compare June 2, 2020 23:07
ibis/udf/validate.py Outdated Show resolved Hide resolved
@timothydijamco
Copy link
Contributor Author

@jreback @icexelloss

Rebased and addressed comments

and CI is green!

@icexelloss icexelloss self-requested a review June 3, 2020 18:41
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import ibis.common.exceptions as com


def _parameter_count(funcsig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this

)


def validate_input_type(input_type, func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this

return funcsig


def validate_output_type(output_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this

@pep8speaks
Copy link

Hello @timothydijamco! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 95:80: E501 line too long (107 > 79 characters)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green

@timothydijamco
Copy link
Contributor Author

@jreback CI is green

Also I'm not sure what the PEP warning is above--the line that it links to is short

@jreback jreback merged commit 6f18f07 into ibis-project:master Jun 8, 2020
@jreback
Copy link
Contributor

jreback commented Jun 8, 2020

thanks @timothydijamco

so the PEP warning above is from the configured pep8speaks, but i don't think it uses the same settings as the project. can you open an issue about this? (I think we just need to add a .pep8speaks) or something

@timothydijamco
Copy link
Contributor Author

Thank you both

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis pandas The pandas backend pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants