-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Implement support for elementwise UDF that returns multiple col… #2473
FEAT: Implement support for elementwise UDF that returns multiple col… #2473
Conversation
e00e277
to
74aa88c
Compare
setup.cfg
Outdated
| @@ -16,7 +16,7 @@ inherit = false | |||
| convention = numpy | |||
|
|
|||
| [isort] | |||
| known_third_party = asv,click,clickhouse_driver,dateutil,google,graphviz,impala,kudu,mock,multipledispatch,numpy,pandas,pkg_resources,plumbum,psycopg2,pyarrow,pydata_google_auth,pygit2,pymapd,pymysql,pyspark,pytest,pytz,regex,requests,setuptools,sphinx_rtd_theme,sqlalchemy,thrift,toolz | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason seed-isort-config is changing this. Investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be reverted after #2474 merged
.pre-commit-config.yaml
Outdated
| @@ -1,6 +1,6 @@ | |||
| repos: | |||
| - repo: https://github.com/asottile/seed-isort-config | |||
| rev: v1.9.2 | |||
| rev: v2.2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert after #2474 merged
f4ad892
to
1d42727
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions & can you add a release note
| if np.isscalar(result): | ||
| return pd.Series( | ||
| np.repeat(result, len(data.index)), | ||
| index=data.index, | ||
| name=result_name, | ||
| ) | ||
| return result.rename(result_name) | ||
|
|
||
| if expr.has_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could e an elif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
ibis/expr/operations.py
Outdated
| if isinstance(projection, ir.ValueExpr): | ||
| if ( | ||
| isinstance(projection, ir.StructValue) | ||
| and not projection.has_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the name matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a named struct, we want to create a struct column instead of "flatten" the struct. e.g
This creates a struct column "new_struct_col" with col1, col2
table = table.mutate(new_struct_col=struct_udf(dt['v']))
This creates two columns, col1 and col2
table = table.mutate(struct_udf(dt['v']))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added test_elementwise_udf_struct to test adding a named struct column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this the api we really want, e.g. this implicit naming behavior? it seems sensible but no easy way to turn this on/off or error check it.
ibis/file/csv.py
Outdated
| None, | ||
| [ | ||
| getattr(s.op(), 'name', None) | ||
| or (s.get_name() if s.has_name() else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
5de0b8c
to
62173a2
Compare
ibis/expr/operations.py
Outdated
| if isinstance(projection, ir.ValueExpr): | ||
| if ( | ||
| isinstance(projection, ir.StructValue) | ||
| and not projection.has_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this the api we really want, e.g. this implicit naming behavior? it seems sensible but no easy way to turn this on/off or error check it.
ibis/udf/vectorized.py
Outdated
| ... [date.str.slice(0, 4), date.str.slice(4, 8)], | ||
| ... axis=1 | ||
| ... ) | ||
| ... result.columns = ['year', 'monthday'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the result is NOT named, do I get an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, results are not required to be named. I updated the test to reflect this. Column labels will be assign based on output_type.
|
thanks @icexelloss |
What is this change
This PR adds a supports for add multiple columns with a single elementwise UDF:
Example:
This PR adds support in both
pandasandpysparkbackenndTests
Added
test_elementwise_udf_structandtest_elementwise_udf_destruct