Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Implement support for elementwise UDF that returns multiple col… #2473

Merged
merged 16 commits into from
Oct 21, 2020

Conversation

icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Oct 15, 2020

What is this change

This PR adds a supports for add multiple columns with a single elementwise UDF:

Example:

@elementwise(
    input_type=[dt.double],
    output_type=dt.Struct(['col1', 'col2'], [dt.double, dt.double]),
)
def foo1(v):
    return pd.DataFrame({'col1': v + 1, 'col2': v + 2})

result = alltypes.mutate(foo1(alltypes['double_col']).destructure()).execute()

expected = alltypes.mutate(
    col1=alltypes['double_col'] + 1, col2=alltypes['double_col'] + 2,
).execute()

This PR adds support in both pandas and pyspark backennd

Tests

Added test_elementwise_udf_struct and test_elementwise_udf_destruct

@icexelloss icexelloss added pandas The pandas backend pyspark The Apache PySpark backend udf Issues related to user-defined functions labels Oct 15, 2020
setup.cfg Outdated
@@ -16,7 +16,7 @@ inherit = false
convention = numpy

[isort]
known_third_party = asv,click,clickhouse_driver,dateutil,google,graphviz,impala,kudu,mock,multipledispatch,numpy,pandas,pkg_resources,plumbum,psycopg2,pyarrow,pydata_google_auth,pygit2,pymapd,pymysql,pyspark,pytest,pytz,regex,requests,setuptools,sphinx_rtd_theme,sqlalchemy,thrift,toolz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason seed-isort-config is changing this. Investigating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reverted after #2474 merged

@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/asottile/seed-isort-config
rev: v1.9.2
rev: v2.2.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert after #2474 merged

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions & can you add a release note

if np.isscalar(result):
return pd.Series(
np.repeat(result, len(data.index)),
index=data.index,
name=result_name,
)
return result.rename(result_name)

if expr.has_name():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could e an elif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ibis/backends/pandas/execution/selection.py Show resolved Hide resolved
ibis/expr/api.py Outdated Show resolved Hide resolved
if isinstance(projection, ir.ValueExpr):
if (
isinstance(projection, ir.StructValue)
and not projection.has_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the name matter?

Copy link
Contributor Author

@icexelloss icexelloss Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a named struct, we want to create a struct column instead of "flatten" the struct. e.g

This creates a struct column "new_struct_col" with col1, col2

table = table.mutate(new_struct_col=struct_udf(dt['v']))

This creates two columns, col1 and col2

table = table.mutate(struct_udf(dt['v']))

Copy link
Contributor Author

@icexelloss icexelloss Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added test_elementwise_udf_struct to test adding a named struct column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this the api we really want, e.g. this implicit naming behavior? it seems sensible but no easy way to turn this on/off or error check it.

ibis/file/csv.py Outdated
None,
[
getattr(s.op(), 'name', None)
or (s.get_name() if s.has_name() else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm see above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@icexelloss
Copy link
Contributor Author

Per discussion offline with @jreback, changing get_name for return None instead of throwing exception has wide impact (fails many tests) and I will address that change in a separate PR.

Opened #2484 to track

if isinstance(projection, ir.ValueExpr):
if (
isinstance(projection, ir.StructValue)
and not projection.has_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this the api we really want, e.g. this implicit naming behavior? it seems sensible but no easy way to turn this on/off or error check it.

... [date.str.slice(0, 4), date.str.slice(4, 8)],
... axis=1
... )
... result.columns = ['year', 'monthday']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the result is NOT named, do I get an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, results are not required to be named. I updated the test to reflect this. Column labels will be assign based on output_type.

@jreback jreback added this to the Next Feature Release milestone Oct 20, 2020
@jreback jreback merged commit 41cfc4d into ibis-project:master Oct 21, 2020
@jreback
Copy link
Contributor

jreback commented Oct 21, 2020

thanks @icexelloss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend pyspark The Apache PySpark backend udf Issues related to user-defined functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants