New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pandas UD(A)Fs #1277

Closed
wants to merge 7 commits into
base: master
from

Conversation

@cpcloud
Member

cpcloud commented Jan 9, 2018

This PR introduces a way for users to hook their own functions into the ibis
expression tree for the pandas backend.

Here are some things to be aware of:

  1. Inputs to the function can be any number of columns (Series during
    execution) or scalars (Python + numpy scalar types), Tables (DataFrames)
    are not allowed as inputs. Whether or not we should allow tables as inputs
    is an open discussion.
  2. Reductions also take any number of the above types of inputs, so something
    like computing the correlation coefficient is possible in a group by using
    this machinery
  3. User defined functions must accept **kwargs. Functions can opt-in to specific keyword arguments or all of them with **kwargs. The UDF mechanism decides what to pass in based on the signature of the UDF.
  4. Users must specify the input and output types of the function in the
    @udf/@udaf decorators.

After this is merged, I'd like to rewrite all of the pandas execute_node
definitions using the UDF/UDAF machinery which would reduce the amount of code
by about a factor of 2 to 3 for that backend.

@cpcloud cpcloud self-assigned this Jan 9, 2018

@cpcloud cpcloud added this to To do in UDFs via automation Jan 9, 2018

@cpcloud cpcloud added this to the 0.13 milestone Jan 9, 2018

@cpcloud cpcloud requested a review from wesm Jan 9, 2018

expected = t.a.execute().str.len().mul(2).sum()
assert result == expected

This comment has been minimized.

@jreback

jreback Jan 9, 2018

Contributor

could add some test for invalid udf decorator construction ? eg for passing invalid types
are there other ways these could fail?

This comment has been minimized.

@cpcloud

cpcloud Jan 9, 2018

Member

There are definitely not enough tests here, generally speaking.

) if grouper is not None
]
assert all(groupers[0] == grouper for grouper in groupers[1:])

This comment has been minimized.

@jreback

jreback Jan 9, 2018

Contributor

when would this fail?

This comment has been minimized.

@cpcloud

cpcloud Jan 9, 2018

Member

It should never fail, that's why it's an assertion and not a user facing error message.

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from e48df13 to c482dc5 Jan 10, 2018

@wesm

Nice stuff, making the backend easier to develop incrementally will be really nice =)

t = con.table('df')
expr = my_string_length_sum(t.a)
assert isinstance(expr, ir.Expr)

This comment has been minimized.

@wesm

wesm Jan 11, 2018

Member

Could also check for ScalarExpr

def signature(input_type, klass):
return (
(klass,) + rule_to_python_type(r) + nullable(r) for r in input_type

This comment has been minimized.

@wesm

wesm Jan 11, 2018

Member

I had to squint a bit at nullable(r) to see what it's doing, not sure if there's a way to make clearer / improve readability

UDAFNode, *signature(input_type, klass=SeriesGroupBy)
)
def execute_udaf_node_groupby(op, *args, context=None, **kwargs):
iters = (

This comment has been minimized.

@wesm

wesm Jan 11, 2018

Member

This function is a serious piece of kit (e.g. the lambda being passed in to context.agg), I'm not sure how much more verbose / less elegant this would be if written in a more readable fashion

This comment has been minimized.

@cpcloud

cpcloud Jan 11, 2018

Member

Yeah this is definitely something I want to do and is the main reason for the WIP. It's concise but pretty hard to read without knowledge of how pandas works.

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from c482dc5 to e5c13fe Jan 12, 2018

UDFNode = type(
func.__name__ + '_udf_node',
(ops.ValueOp,),
dict(input_type=input_type, output_type=output_type.array_type)

This comment has been minimized.

@kszucs

kszucs Jan 22, 2018

Member

Should not an udf work on mixed scalar and array inputs/outputs too?
If so the rules.shape_like_arg[s] might be required.

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from f507631 to 320b821 Jan 24, 2018

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch 2 times, most recently from a9efde0 to c55da29 Jan 31, 2018

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from c55da29 to 66c7996 Feb 8, 2018

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from 66c7996 to e9fab04 Mar 9, 2018

@cpcloud cpcloud referenced this pull request Mar 11, 2018

Closed

Rules refactor #1366

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch 2 times, most recently from bef12cd to 0d55580 Mar 11, 2018

@cpcloud cpcloud changed the title from WIP/ENH: Pandas UD(A)Fs to ENH: Pandas UD(A)Fs Mar 13, 2018

@cpcloud cpcloud force-pushed the cpcloud:pandas-udf branch from 0d55580 to 3d62765 Mar 13, 2018

@cpcloud cpcloud closed this in d1b1f7d Mar 19, 2018

UDFs automation moved this from To do to Done Mar 19, 2018

@cpcloud cpcloud deleted the cpcloud:pandas-udf branch Mar 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment