Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: API for creating array column from other columns #2615

Merged
merged 5 commits into from
Feb 10, 2021

Conversation

timothydijamco
Copy link
Contributor

@timothydijamco timothydijamco commented Feb 5, 2021

New operation

This PR adds a new operation called ArrayColumn.

Input: list of n ColumnExprs (of the same datatype)
Output: single ArrayColumn, where each element is an array of length n. Each array is populated from the values in the input columns for that row.
(See New API section below for an example)

This operation is similar to PySpark's array(*cols) (link) and Postgres's array[...] (link).

I have implemented the operation on the following backends so far:

  • PySpark
  • Pandas
  • Dask

New API

The operation is exposed through a new top-level function: ibis.array_column(*cols) (other examples of a top-level functions are: ibis.literal(value), ibis.sequence(values), and ibis.null()).

Here is an example of the new API:

>>> t.execute()
   foo  bar
0    1    4
1    2    5
2    3    6
>>> expr = ibis.array_column(t.foo, t.bar)
>>> type(expr)
<class 'ibis.expr.types.ArrayColumn'>
>>> expr.execute()
0    [1, 4]
1    [2, 5]
2    [3, 6]
dtype: object

Variation 1: Name the function ibis.array(*cols) instead

This would be more parallel to how PySpark and Postgres names this kind of operation.

I've initially chosen to name it array_column to avoid implying that this function could be used to create an ArrayScalar (it can only create an ArrayColumn). If the user wants to create an ArrayScalar, they must use ibis.literal:

>>> expr = ibis.literal([1, 2, 3])
>>> type(expr)
<class 'ibis.expr.types.ArrayScalar'>

As an extension to this, we could have ibis.array(*exprs) produce either an ArrayColumn or ArrayScalar, depending on its inputs. My concern is that this would overlap with ibis.literal.

Variation 2: Allow a mix of column and scalar expressions as input

Currently, the inputs must be column expressions only. We could accept a mix of column expressions + scalar expressions as well (but there must be at least one column expression). Any scalars in the list of inputs would be broadcast to all the rows. PySpark array and Postgres array both allow this.

I've started this PR off with only column expressions accepted, since it's more restrictive, but I think there's definitely merit to allowing scalar expressions.

@timothydijamco timothydijamco changed the title FEAT: API for creating array column from column expressions FEAT: API for creating array column from other columns Feb 5, 2021
@jreback jreback added this to the Next release milestone Feb 8, 2021
@jreback jreback added the expressions Issues or PRs related to the expression API label Feb 8, 2021
@execute_node.register(ops.ArrayColumn, list)
def execute_array_column(op, cols, **kwargs):
df = dd.concat(cols, axis=1)
return df.apply(lambda row: list(row), axis=1, meta=(None, 'object'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably return np.array as the element instead of python list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should have to be explict in setting an object array (which is fine), other option is a tuple, which i find ok too

@@ -4,6 +4,19 @@
import ibis


@pytest.mark.xfail_unsupported
@pytest.mark.skip_missing_feature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this decorator work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default feature test flags are defined here:

class BackendTest(abc.ABC):
check_dtype = True
check_names = True
supports_arrays = True
supports_arrays_outside_of_select = supports_arrays
supports_window_operations = True
additional_skipped_operations = frozenset()
supports_divide_by_zero = False
returned_timestamp_unit = 'us'
supported_to_timestamp_units = {'s', 'ms', 'us'}
supports_floating_modulus = True

Each backend's tests/conftest.py can override which feature test flags should be enabled for that backend:

class TestConf(BackendTest, RoundAwayFromZero):
supports_arrays = False
supports_arrays_outside_of_select = supports_arrays
supports_window_operations = True
check_dtype = False
returned_timestamp_unit = 's'

(And here is where the behavior of the decorator is actually implemented)

@@ -1267,6 +1267,36 @@ def as_value_expr(val):
return val


def array_column(*cols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ibis.array?

@icexelloss
Copy link
Contributor

Reviewed one around. Looks pretty good, but two depending question

  • What should be value of an array element be? (np array or python list)?
  • What should this function be called?

@icexelloss icexelloss self-requested a review February 8, 2021 14:48
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@timothydijamco
Copy link
Contributor Author

Why not use ibis.array?

@icexelloss I think naming it ibis.array would be more "canonical", but I am concerned that users would use it to try to create an array scalar (they should use ibis.literal([1, 2, 3]) instead in that case). Please see the "Variation 1" section in the overview post! (have some more thoughts on this there)

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

Why not use ibis.array?

@icexelloss I think naming it ibis.array would be more "canonical", but I am concerned that users would use it to try to create an array scalar (they should use ibis.literal([1, 2, 3]) instead in that case). Please see the "Variation 1" section in the overview post! (have some more thoughts on this there)

this would fail trying to creata an array scalar right now?

  • could provide a helpful error message (e.g. looks like you are trying to create an array scalar, use .....)
  • actually allow it to create an array scalar

@timothydijamco
Copy link
Contributor Author

I think naming it ibis.array would be nice for consistency with the API of the backend engines themselves, and after some thought, I don't think overlapping with ibis.literal is actually much of an issue.

Here are the changes I've made:

  • Rename ibis.array_column to ibis.array
  • Allow ibis.array to take in Python literals, in which case it will create an ArrayScalar
  • Make ibis.array take in a list as input (instead of variadic), otherwise ibis.array([1, 2, 3]) will have created a 2-d array, which could be confusing

As I write this I think I do want to add one more layer of functionality to fill this API out: array to take in Ibis scalar expressions and return an ArrayScalar.

@timothydijamco
Copy link
Contributor Author

As I write this I think I do want to add one more layer of functionality to fill this API out: array to take in Ibis scalar expressions and return an ArrayScalar.

This ended up being more complicated than I expected for the Dask and PySpark backends. I'd like to leave this out of scope for this PR.

@icexelloss
Copy link
Contributor

It's not obvious to me why we allow ibis.array to create literal array columns, and I'd propose we raise:

ibis.array([1, 2, 3])

# raise: Please use ibis.literal to create literal array column, e.g., ibis.literal([1, 2, 3]) or ibis.literal(np.array[1, 2, 3])

@timothydijamco
Copy link
Contributor Author

It's not obvious to me why we allow ibis.array to create literal array columns

This was because I feel the name is slightly ambiguous; ibis.array sounds like it could be used to create an array scalar (if scalars were provided as inputs). Could you expand on why you would prefer to raise?

This may end up boiling down to whether we think this is true:

ibis.array sounds like it could be used to create an array scalar

My thought is that if I were trying to create an array scalar, it would feel unusual to me if ibis.array didn't allow me to do that

@timothydijamco
Copy link
Contributor Author

timothydijamco commented Feb 9, 2021

Just to keep the state of this PR a little clearer, here is the current API as of this comment.

Current proposed API

ibis.array

Input: list of column expressions | Output: array column

>>> t.execute()
   foo  bar
0    1    4
1    2    5
2    3    6
>>> expr = ibis.array([t.foo, t.bar])
>>> type(expr)
<class 'ibis.expr.types.ArrayColumn'>
>>> expr.execute()
0    [1, 4]
1    [2, 5]
2    [3, 6]
dtype: object

Input: list of Python literals | Output: array scalar

>>> expr = ibis.array([1, 2, 3])
>>> type(expr)
<class 'ibis.expr.types.ArrayScalar'>
>>> expr.execute()
[1, 2, 3]

Note: in this case (inputs are Python literals), ibis.array is practically just an alias for ibis.literal.

>>> expr = ibis.literal([1, 2, 3])
>>> type(expr)
<class 'ibis.expr.types.ArrayScalar'>
>>> expr.execute()
[1, 2, 3]

@jreback
Copy link
Contributor

jreback commented Feb 9, 2021

It's not obvious to me why we allow ibis.array to create literal array columns

This was because I feel the name is slightly ambiguous; ibis.array sounds like it could be used to create an array scalar (if scalars were provided as inputs). Could you expand on why you would prefer to raise?

This may end up boiling down to whether we think this is true:

ibis.array sounds like it could be used to create an array scalar

My thought is that if I were trying to create an array scalar, it would feel unusual to me if ibis.array didn't allow me to do that

I agree here with @timothydijamco I think its ok to allow ibis.array to create ArrayScalar as its unambiguous.

@icexelloss
Copy link
Contributor

I agree here with @timothydijamco I think its ok to allow ibis.array to create ArrayScalar as its unambiguous.

Make sense. I am OK with this as well.

@jreback jreback merged commit b9f344b into ibis-project:master Feb 10, 2021
@jreback
Copy link
Contributor

jreback commented Feb 10, 2021

thanks @timothydijamco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants