Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Introduce vectorized UDF api #2047

Merged
merged 7 commits into from
Jan 23, 2020

Conversation

icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Dec 5, 2019

One of Ibis' design goal is to have one common expression language that works for multiple backend execution engines. However, the current UDF api contradicts this design goal because the UDF is defined under per backend namespace (e.g., ibis.pandas.udf) and therefore, ibis program that contains UDF cannot move to another backend.

I propose to introduce a top level ibis UDF api, ibis.udf.vectorized to address this issue. Vectorized UDFs take vectorized data structure as input, this currently means numpy.ndarray pd.Series and pd.DataFrame and return Python scalar or vectorized data structure as output.

This is proposed as a very experimental API.

In this PR, I have made the following changes in particular:

  • Introduced ibis.udf.vectorized namespace
  • Implemented elementwise UDF under the new namespace
    • Modified ibis.pandas.udf to use ibis.udf.vectorized directly
    • Implemented execution rule in PySpark backend for ibis.udf.vectorized
  • Added a new common test ibis/tests/all/test_vectorized_udf.py

The following is not included in the PR and left as future work

  • Refine ibis.udf.vectorized API
  • Implement analytics and aggregation UDFs (currently defined under ibis.pandas.udf)

This PR addresses issue #2048

@icexelloss icexelloss changed the title ENH: Introduce top level UDF api ENH: Introduce top level vectorized UDF api Jan 6, 2020
@icexelloss icexelloss changed the title ENH: Introduce top level vectorized UDF api ENH: Introduce vectorized UDF api Jan 6, 2020
@icexelloss icexelloss changed the title ENH: Introduce vectorized UDF api FEAT: Introduce vectorized UDF api Jan 6, 2020
@jreback jreback added the feature Features or general enhancements label Jan 9, 2020
"""Node for element wise UDF.
"""

func = Arg(rlz.noop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these any stronger?

),
id='int_float',
),
pytest.param(
dt.int64,
True,
id='int_bool',
marks=pytest.mark.xfail(
raises=com.IbisTypeError,
marks=pytest.mark.skip(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing to skips? xfail are more appropriate here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure what to do with these tests. They test a very small/unimportant use case of the UDF - calling UDF directly on a Scalar, e.g., my_udf(1).execute() which is same as my_udf.func(1) + type checking. And they are hard to fix. (I have spent half day trying to fix them but failed).

Given that this is a very small use case I was thinking not to spend too much time on it. Also, type checking / promoting is not well supported for UDFs now anyway.

The being said, I could give it another attempt if you feel this is important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with not-fixing, but then can you segregate the failing ones and instead of skipping, explicity check that they produce the correct error / type mismatch. and add a comment. skipping tests basically get ignored forever :-<

ibis/pandas/udf.py Show resolved Hide resolved
ibis/pandas/udf.py Show resolved Hide resolved
ibis/pandas/udf.py Show resolved Hide resolved
@jreback jreback added this to the Next Feature Release milestone Jan 9, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. a couple of test comments. can you add a note in the whatsnew. a followon PR for docs would be great (or in this PR ok too!). ping on green.

),
id='int_float',
),
pytest.param(
dt.int64,
True,
id='int_bool',
marks=pytest.mark.xfail(
raises=com.IbisTypeError,
marks=pytest.mark.skip(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with not-fixing, but then can you segregate the failing ones and instead of skipping, explicity check that they produce the correct error / type mismatch. and add a comment. skipping tests basically get ignored forever :-<

ibis/pandas/udf.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @icexelloss

@jreback jreback merged commit 259a2b1 into ibis-project:master Jan 23, 2020
@jreback
Copy link
Contributor

jreback commented Jan 23, 2020

let's followup with an example in the docs proper (pls create an issue for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants