-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: API for creating array column from other columns #2615
Conversation
| @execute_node.register(ops.ArrayColumn, list) | ||
| def execute_array_column(op, cols, **kwargs): | ||
| df = dd.concat(cols, axis=1) | ||
| return df.apply(lambda row: list(row), axis=1, meta=(None, 'object')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably return np.array as the element instead of python list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should have to be explict in setting an object array (which is fine), other option is a tuple, which i find ok too
| @@ -4,6 +4,19 @@ | |||
| import ibis | |||
|
|
|||
|
|
|||
| @pytest.mark.xfail_unsupported | |||
| @pytest.mark.skip_missing_feature( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this decorator work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default feature test flags are defined here:
ibis/ibis/backends/tests/base.py
Lines 63 to 73 in b78ea0c
| class BackendTest(abc.ABC): | |
| check_dtype = True | |
| check_names = True | |
| supports_arrays = True | |
| supports_arrays_outside_of_select = supports_arrays | |
| supports_window_operations = True | |
| additional_skipped_operations = frozenset() | |
| supports_divide_by_zero = False | |
| returned_timestamp_unit = 'us' | |
| supported_to_timestamp_units = {'s', 'ms', 'us'} | |
| supports_floating_modulus = True |
Each backend's tests/conftest.py can override which feature test flags should be enabled for that backend:
ibis/ibis/backends/sqlite/tests/conftest.py
Lines 11 to 16 in b78ea0c
| class TestConf(BackendTest, RoundAwayFromZero): | |
| supports_arrays = False | |
| supports_arrays_outside_of_select = supports_arrays | |
| supports_window_operations = True | |
| check_dtype = False | |
| returned_timestamp_unit = 's' |
(And here is where the behavior of the decorator is actually implemented)
ibis/expr/types.py
Outdated
| @@ -1267,6 +1267,36 @@ def as_value_expr(val): | |||
| return val | |||
|
|
|||
|
|
|||
| def array_column(*cols): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ibis.array?
|
Reviewed one around. Looks pretty good, but two depending question
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
@icexelloss I think naming it |
this would fail trying to creata an array scalar right now?
|
|
I think naming it Here are the changes I've made:
As I write this I think I do want to add one more layer of functionality to fill this API out: |
This ended up being more complicated than I expected for the Dask and PySpark backends. I'd like to leave this out of scope for this PR. |
|
It's not obvious to me why we allow |
This was because I feel the name is slightly ambiguous; This may end up boiling down to whether we think this is true:
My thought is that if I were trying to create an array scalar, it would feel unusual to me if |
|
Just to keep the state of this PR a little clearer, here is the current API as of this comment. Current proposed API
Input: list of column expressions | Output: array columnInput: list of Python literals | Output: array scalarNote: in this case (inputs are Python literals), |
I agree here with @timothydijamco I think its ok to allow |
Make sense. I am OK with this as well. |
|
thanks @timothydijamco |
New operation
This PR adds a new operation called
ArrayColumn.Input: list of n
ColumnExprs(of the same datatype)Output: single
ArrayColumn, where each element is an array of length n. Each array is populated from the values in the input columns for that row.(See New API section below for an example)
This operation is similar to PySpark's
array(*cols)(link) and Postgres'sarray[...](link).I have implemented the operation on the following backends so far:
New API
The operation is exposed through a new top-level function:
ibis.array_column(*cols)(other examples of a top-level functions are:ibis.literal(value),ibis.sequence(values), andibis.null()).Here is an example of the new API:
Variation 1: Name the function
ibis.array(*cols)insteadThis would be more parallel to how PySpark and Postgres names this kind of operation.
I've initially chosen to name it
array_columnto avoid implying that this function could be used to create anArrayScalar(it can only create anArrayColumn). If the user wants to create anArrayScalar, they must useibis.literal:As an extension to this, we could have
ibis.array(*exprs)produce either anArrayColumnorArrayScalar, depending on its inputs. My concern is that this would overlap withibis.literal.Variation 2: Allow a mix of column and scalar expressions as input
Currently, the inputs must be column expressions only. We could accept a mix of column expressions + scalar expressions as well (but there must be at least one column expression). Any scalars in the list of inputs would be broadcast to all the rows. PySpark
arrayand Postgresarrayboth allow this.I've started this PR off with only column expressions accepted, since it's more restrictive, but I think there's definitely merit to allowing scalar expressions.