-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Pandas UD(A)Fs #1277
ENH: Pandas UD(A)Fs #1277
Conversation
| expected = t.a.execute().str.len().mul(2).sum() | ||
| assert result == expected | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add some test for invalid udf decorator construction ? eg for passing invalid types
are there other ways these could fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are definitely not enough tests here, generally speaking.
ibis/pandas/udf.py
Outdated
| ) if grouper is not None | ||
| ] | ||
|
|
||
| assert all(groupers[0] == grouper for grouper in groupers[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never fail, that's why it's an assertion and not a user facing error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice stuff, making the backend easier to develop incrementally will be really nice =)
ibis/pandas/tests/test_udf.py
Outdated
| t = con.table('df') | ||
| expr = my_string_length_sum(t.a) | ||
|
|
||
| assert isinstance(expr, ir.Expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also check for ScalarExpr
ibis/pandas/udf.py
Outdated
|
|
||
| def signature(input_type, klass): | ||
| return ( | ||
| (klass,) + rule_to_python_type(r) + nullable(r) for r in input_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to squint a bit at nullable(r) to see what it's doing, not sure if there's a way to make clearer / improve readability
ibis/pandas/udf.py
Outdated
| UDAFNode, *signature(input_type, klass=SeriesGroupBy) | ||
| ) | ||
| def execute_udaf_node_groupby(op, *args, context=None, **kwargs): | ||
| iters = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is a serious piece of kit (e.g. the lambda being passed in to context.agg), I'm not sure how much more verbose / less elegant this would be if written in a more readable fashion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is definitely something I want to do and is the main reason for the WIP. It's concise but pretty hard to read without knowledge of how pandas works.
ibis/pandas/udf.py
Outdated
| UDFNode = type( | ||
| func.__name__ + '_udf_node', | ||
| (ops.ValueOp,), | ||
| dict(input_type=input_type, output_type=output_type.array_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not an udf work on mixed scalar and array inputs/outputs too?
If so the rules.shape_like_arg[s] might be required.
a9efde0
to
c55da29
Compare
bef12cd
to
0d55580
Compare
This PR introduces a way for users to hook their own functions into the ibis
expression tree for the pandas backend.
Here are some things to be aware of:
Seriesduringexecution) or scalars (Python + numpy scalar types), Tables (
DataFrames)are not allowed as inputs. Whether or not we should allow tables as inputs
is an open discussion.
like computing the correlation coefficient is possible in a group by using
this machinery
User defined functions must acceptFunctions can opt-in to specific keyword arguments or all of them with**kwargs.**kwargs. The UDF mechanism decides what to pass in based on the signature of the UDF.@udf/@udafdecorators.After this is merged, I'd like to rewrite all of the pandas
execute_nodedefinitions using the UDF/UDAF machinery which would reduce the amount of code
by about a factor of 2 to 3 for that backend.