Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pandas UD(A)Fs #1277

Closed
wants to merge 7 commits into from
Closed

ENH: Pandas UD(A)Fs #1277

wants to merge 7 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jan 9, 2018

This PR introduces a way for users to hook their own functions into the ibis
expression tree for the pandas backend.

Here are some things to be aware of:

  1. Inputs to the function can be any number of columns (Series during
    execution) or scalars (Python + numpy scalar types), Tables (DataFrames)
    are not allowed as inputs. Whether or not we should allow tables as inputs
    is an open discussion.
  2. Reductions also take any number of the above types of inputs, so something
    like computing the correlation coefficient is possible in a group by using
    this machinery
  3. User defined functions must accept **kwargs. Functions can opt-in to specific keyword arguments or all of them with **kwargs. The UDF mechanism decides what to pass in based on the signature of the UDF.
  4. Users must specify the input and output types of the function in the
    @udf/@udaf decorators.

After this is merged, I'd like to rewrite all of the pandas execute_node
definitions using the UDF/UDAF machinery which would reduce the amount of code
by about a factor of 2 to 3 for that backend.

@cpcloud cpcloud self-assigned this Jan 9, 2018
@cpcloud cpcloud added feature Features or general enhancements expressions Issues or PRs related to the expression API pandas The pandas backend labels Jan 9, 2018
@cpcloud cpcloud added this to the 0.13 milestone Jan 9, 2018
@cpcloud cpcloud requested a review from wesm January 9, 2018 20:28
expected = t.a.execute().str.len().mul(2).sum()
assert result == expected


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add some test for invalid udf decorator construction ? eg for passing invalid types
are there other ways these could fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely not enough tests here, generally speaking.

) if grouper is not None
]

assert all(groupers[0] == grouper for grouper in groupers[1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never fail, that's why it's an assertion and not a user facing error message.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff, making the backend easier to develop incrementally will be really nice =)

t = con.table('df')
expr = my_string_length_sum(t.a)

assert isinstance(expr, ir.Expr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also check for ScalarExpr


def signature(input_type, klass):
return (
(klass,) + rule_to_python_type(r) + nullable(r) for r in input_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to squint a bit at nullable(r) to see what it's doing, not sure if there's a way to make clearer / improve readability

UDAFNode, *signature(input_type, klass=SeriesGroupBy)
)
def execute_udaf_node_groupby(op, *args, context=None, **kwargs):
iters = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a serious piece of kit (e.g. the lambda being passed in to context.agg), I'm not sure how much more verbose / less elegant this would be if written in a more readable fashion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is definitely something I want to do and is the main reason for the WIP. It's concise but pretty hard to read without knowledge of how pandas works.

UDFNode = type(
func.__name__ + '_udf_node',
(ops.ValueOp,),
dict(input_type=input_type, output_type=output_type.array_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not an udf work on mixed scalar and array inputs/outputs too?
If so the rules.shape_like_arg[s] might be required.

@cpcloud cpcloud closed this in d1b1f7d Mar 19, 2018
@cpcloud cpcloud deleted the pandas-udf branch March 19, 2018 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API feature Features or general enhancements pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants