Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use ndarray as array representation in Pandas backend #2753

Merged

Conversation

timothydijamco
Copy link
Contributor

@timothydijamco timothydijamco commented May 3, 2021

This PR was split off from #2743 (but the description below is complete—no info/discussion from #2743 is needed to understand this PR)

Background

Arrays are currently represented using list

The Pandas backend mainly uses lists as the underlying representation for Ibis arrays:

  • Operations that create arrays (i.e. output ArrayScalar or ArrayColumn) create lists during execution. (Examples: ArraySlice, ArrayCollect).
    This means that UDFs that accept arrays as input will receive lists as input, and end-result DataFrames that contain array columns will contain lists.
  • Operations that operate on arrays (i.e. take ArrayScalars and/or ArrayColumns as args) expect lists during execution. (Examples: ArrayRepeat, ArrayConcat)

Inconsistencies

There are some exceptions, which can lead to problems during execution and confusion:

  • ibis.literal(np.array([1, 2, 3])) creates an ndarray (instead of list) during execution
  • ibis.array([t.int_col, t.other_int_col])creates ndarrays (instead of lists) during execution
  • Users may write UDFs that return either list or ndarray

If any of these APIs are used, then during execution, problems might come up depending on what other operations the user has applied on top of these operations (for example, if the user uses a UDF that returns an ndarray, then applies ArrayConcat on the result, an error will occur because ArrayConcat expects list. If they had applied no operation on the result of their UDF, their expression would execute OK.).

This PR

Goals

In the Pandas backend,

  1. Use one consistent representation for arrays
  2. Favor ndarray as the representation for arrays.
    This would be a more useful array representation for users to have in their UDFs and resulting DataFrames.
    This would also be consistent with the PySpark backend which uses ndarray (if Arrow is enabled in PySpark—no guarantee that the user is using PySpark in this configuration, however, if Ibis UDFs are to be used with the PySpark backend, Arrow must be enabled).

Changes

  • All operations that create arrays should create ndarrays during execution
  • All operations that operate on arrays should work when given ndarrays as input during execution
  • UDFs that output arrays should be allowed to return either list or ndarray, but will be coerced into ndarray by the backend during execution to ensure compatibility with rest of array operations

Organized overview of which APIs/operations this PR affects:
https://gist.github.com/timothydijamco/fea0a79b6ed9a0367c58e51f9973f4af

@timothydijamco
Copy link
Contributor Author

The Dask backend imports some array-related execution functions from the Pandas backend. Since these execution functions now use np.arrays, they are not very compatible with the rest of the Dask backend

For now, I am removing these imports from the Dask backend and also xfailing some array-related Dask backend tests. (If this PR gets merged, I'll update the Dask xfailed test tracker #2553.)

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Some comments.

ibis/backends/pandas/execution/arrays.py Show resolved Hide resolved
ibis/backends/pandas/execution/generic.py Show resolved Hide resolved
ibis/backends/pandas/execution/maps.py Show resolved Hide resolved
ibis/backends/pandas/execution/maps.py Show resolved Hide resolved
@@ -134,6 +134,11 @@ def infer_pandas_timestamp(value):
@dt.infer.register(np.ndarray)
def infer_array(value):
# TODO(kszucs): infer series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this TODO?

@jreback jreback added the expressions Issues or PRs related to the expression API label May 4, 2021
@timothydijamco
Copy link
Contributor Author

  • UDFs that output arrays should be allowed to return either list or ndarray, but will be coerced into ndarray by the backend during execution to ensure compatibility with rest of array operations

@icexelloss (+ anyone else who has thoughts)
Now that the changes so far have been reviewed a few times, I wanted to discuss this additional change ^ that I was thinking of adding to this PR

One one hand, this would be good for backward-compatibility. Users who have Ibis expressions that contain:

  1. A UDF that returns list
  2. Array operations applied on the result of that UDF

would start to not work on the Pandas backend after this PR, without this additional change

But aside from backwards-compatibility, I think it would be preferable not to convert things for the user, and instead just strictly require them to return np.arrays if they want their UDFs to create arrays in Ibis. (In other words, don't make this extra change.) This would be simpler and there would be less behind-the-scenes magic

let me know what you think!

@icexelloss
Copy link
Contributor

  • UDFs that output arrays should be allowed to return either list or ndarray, but will be coerced into ndarray by the backend during execution to ensure compatibility with rest of array operations

@icexelloss (+ anyone else who has thoughts)
Now that the changes so far have been reviewed a few times, I wanted to discuss this additional change ^ that I was thinking of adding to this PR

One one hand, this would be good for backward-compatibility. Users who have Ibis expressions that contain:

  1. A UDF that returns list
  2. Array operations applied on the result of that UDF

would start to not work on the Pandas backend after this PR, without this additional change

But aside from backwards-compatibility, I think it would be preferable not to convert things for the user, and instead just strictly require them to return np.arrays if they want their UDFs to create arrays in Ibis. (In other words, don't make this extra change.) This would be simpler and there would be less behind-the-scenes magic

let me know what you think!

I don't think anyone else is using array UDF in ibis other than us. This is a fairly new feature and not exposed to other SQL-based backend. I think think it's OK to break backwards-compatibility here.

@jreback
Copy link
Contributor

jreback commented May 4, 2021

@timothydijamco can you rebase this

@@ -123,6 +123,41 @@ def infer_numpy_scalar(value):
return dt.dtype(value.dtype)


def _infer_pandas_series_contents(s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type in and out

also add a Parameters / Returusn section

if inferred_dtype in {'mixed', 'decimal'}:
# We need to inspect an element to determine the Ibis dtype
value = s.iloc[0]
if isinstance(value, (np.ndarray, list, pd.core.series.Series)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pd.Series

@@ -133,7 +168,19 @@ def infer_pandas_timestamp(value):

@dt.infer.register(np.ndarray)
def infer_array(value):
# TODO(kszucs): infer series
np_dtype_name = value.dtype.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use the name, check vs np.object as you do in the series inferer

ibis/backends/pandas/client.py Outdated Show resolved Hide resolved
'array_of_int64': [[1, 2], [], [3]],
'array_of_strings': [['a', 'b'], [], ['c']],
'array_of_float64': [
np.array([1.0, 2.0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we allowed to have a None / np.nan? (e.g. a missing value for the entire row?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should not lead to any issues, besides possibly if we're trying to infer the Ibis dtype of the array.

If it contains np.nans then it would be OK (inferred dtype would be dt.Array(dt.float64)).

If it contains Nones then the inferred dtype would be more general, e.g. dt.Array(dt.binary).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no what i mean is the array itself is None e.g. [np.array(...), np.array(...), None] (this also maybe restricted / not allowed). If you can followup with tests for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This technically will work but I wouldn't consider it well-supported

Type inference can be imperfect in this case (sort of "best-effort"): To infer the type of an object Series (includes Series that have np.arrays, Nones, or a mix, etc), I check the type of the first element of the Series. So in your example, it depends on whether a None or a np.array(...) is in the first element slot

On that note, I'm a bit on the fence about the check-first-element-only approach, because it can seem unpredictable. An alternative would be to not try to resolve the ambiguity and just raise an error, similar to what we have been doing before this PR. But this requires users to always specify an Ibis schema manually for columns that contain arrays, even when their column data is clean enough for the type to be inferred accurately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have any strong opinions (otherwise I can leave as-is, noting that this is something that is open to be reconsidered in the future)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, note however that we shoul not be checking the first element at all! we should simply call infer_dtype on each element (if its not a scalar); pandas is designed to make this very performant and it will exit immediately for ndarrays with the dtype.

e.g. something like

if is_scalar(e):
      if isna(e):
          continue # not sure what you need to track here
      raise? # (e.g. we don't allow scalars mixed i think)
else:
     inferred_type = infer_dtype(e)
     if ........

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this is promising. We could check all elements of the Series—and it shouldn't be heavy because the only thing we need to do with each element really is to: 1) check if it's an array, and 2) call dt.infer on the array, which will either directly return the np.array's dtype or rely on Pandas infer_dtype (also not heavy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the performance of checking all elements could end up being an issue: it looks like this would take ~5s for a Series with 1,000,000 arrays. It's hard for me to say whether this is reasonable or not. With many Series like this this would take a while, although maybe that wouldn't be common.

What do you think about leaving this as-is (I'm thinking that checking only the first element is not perfect but is an OK heuristic) and revisiting in a follow-up if necessary?

'array_of_int64': [np.array([1, 2]), np.array([]), np.array([3])],
'array_of_strings': [
np.array(['a', 'b']),
np.array([]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this empty array has a different dtype is that on purpose? is this correcty inferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inference isn't tested for these arrays in particular (the Ibis types for this test DataFrame are explicitly defined later in this module)

However in general, the inferred Ibis type of a pd.Series that contains np.arrays would depend on the first element of the pd.Series (see this code in this PR)*. This array in particular would be correctly inferred to be string dtype, since the first element is a np.array containing strings.

*not a foolproof way to infer the type of the column, but avoids having to check every element in the pd.Series

@timothydijamco timothydijamco changed the title WIP: Use ndarray as array representation in Pandas backend ENH: Use ndarray as array representation in Pandas backend May 5, 2021
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback added this to the Next release milestone May 5, 2021
ibis/backends/pandas/client.py Show resolved Hide resolved
'array_of_int64': [[1, 2], [], [3]],
'array_of_strings': [['a', 'b'], [], ['c']],
'array_of_float64': [
np.array([1.0, 2.0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no what i mean is the array itself is None e.g. [np.array(...), np.array(...), None] (this also maybe restricted / not allowed). If you can followup with tests for this.

ibis/backends/pandas/tests/execution/test_arrays.py Outdated Show resolved Hide resolved
ibis/expr/operations.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks, can you add a whatsnew note describing what changed. ping on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants