Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: PySpark compiler cannot compile elementwise UDF for some output types #2223

Conversation

timothydijamco
Copy link
Contributor

The PySpark compiler will raise a

TypeError: 'NoneType' object is not callable

when trying to compile a ElementWiseVectorizedUDF node whose output_type argument is not in this list (e.g. Decimal and TimestampType is not in the list)

This PR:

  • Have the PySpark compiler use a more general conversion function (spark_dtype instead of ibis_dtype_to_spark_dtype)
  • Allow spark_dtype to convert an Ibis TimestampType to an Spark TimestampType (add another dispatch function to spark_dtype)

@jreback jreback added pyspark The Apache PySpark backend spark labels Jun 1, 2020
@jreback jreback added this to the Next Bugfix Release milestone Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

can you add a test which hits this (and test for the resulting error message / type). also pls add a release note.

@@ -1476,7 +1477,7 @@ def compile_not_null(t, expr, scope, **kwargs):
@compiles(ops.ElementWiseVectorizedUDF)
def compile_elementwise_udf(t, expr, scope):
op = expr.op()
spark_output_type = ibis_dtype_to_spark_dtype(op._output_type)
spark_output_type = spark_dtype(op._output_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? What's the difference between ibis_dtype_to_spark_dtype and spark_dtype? It is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, ibis_dtype_to_spark_dtype doesn't handle all Ibis DataTypes

spark_dtype is the "main"/generic function that converts an Ibis type to a Spark type

ibis_dtype_to_spark_dtype is one of the many functions that overloads it—This one's used specifically when the argument is an Ibis DataType in this list, but won't handle anything else


More info (getting into the details):

"this list" = All the Ibis DataTypes and Spark types that can be "trivially" converted in either direction

There are some Ibis DataType subclasses that are not in the list. E.g. Ibis Decimal, because converting a Spark DecimalType to an Ibis Decimal is not "trivial" (can't just return dt.Decimal() and call it a day—we need to construct an Ibis Decimal with the correct precision and scale from the original Spark DecimalType). Essentially, ibis_dtype_to_spark_dtype covers most Ibis DataType subclasses but has exceptions. But using spark_dtype would cover those exceptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing, I'll see if I can come up with function names that are clearer. It is tricky though

Also I just realized Timestamp I think actually belongs in the list of types that ibis_dtype_to_spark_dtype handles because both ways the conversion is trivial. I'll refactor that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't make any changes to the function names in this PR because I'd have to slightly refactor code that's unrelated to this bugfix if I do

I think the best action though (for a separate PR) is to rename ibis_dtype_to_spark_dtype to _ibis_dtype_to_spark_dtype (basically, shouldn't be used outside of that module because it isn't clear exactly what types it can handle), and replace usages of ibis_dtype_to_spark_dtype with spark_dtype

Timothy Dijamco added 2 commits June 1, 2020 18:55
…park Timestamp, let existing functions spark_dtype_to_ibis_dtype and ibis_dtype_to_spark_dtype handle conversion
@timothydijamco
Copy link
Contributor Author

timothydijamco commented Jun 1, 2020

@jreback

can you add a test which hits this (and test for the resulting error message / type). also pls add a release note.

I was taking a stab at this but I think this isn't big enough to add a test for. Elementwise UDFs in PySpark with most output_types already worked, and this PR basically just makes it support two more output_types (Timestamp and Decimal)

I was thinking of writing tests for ibis/spark/datatypes.py, which contains all the Ibis type <-> Spark type conversion functions, but the conversion functions are so simple that the tests would be almost the same as the implementation

I could also write a set of tests, where each test verifies that an elementwise UDF with XXXX output_type works, but that would mean the test module for UDFs would become a lot more verbose and I don't think the additional tests would test much

Let me know what you think

@jreback jreback merged commit 569f49e into ibis-project:master Jun 2, 2020
@jreback
Copy link
Contributor

jreback commented Jun 2, 2020

thanks @timothydijamco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants