-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use ndarray as array representation in Pandas backend
#2753
ENH: Use ndarray as array representation in Pandas backend
#2753
Conversation
|
The Dask backend imports some array-related execution functions from the Pandas backend. Since these execution functions now use For now, I am removing these imports from the Dask backend and also xfailing some array-related Dask backend tests. (If this PR gets merged, I'll update the Dask xfailed test tracker #2553.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Some comments.
ibis/backends/pandas/client.py
Outdated
| @@ -134,6 +134,11 @@ def infer_pandas_timestamp(value): | |||
| @dt.infer.register(np.ndarray) | |||
| def infer_array(value): | |||
| # TODO(kszucs): infer series | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this TODO?
…rames that contains arrays (and similar)
@icexelloss (+ anyone else who has thoughts) One one hand, this would be good for backward-compatibility. Users who have Ibis expressions that contain:
would start to not work on the Pandas backend after this PR, without this additional change But aside from backwards-compatibility, I think it would be preferable not to convert things for the user, and instead just strictly require them to return let me know what you think! |
I don't think anyone else is using array UDF in ibis other than us. This is a fairly new feature and not exposed to other SQL-based backend. I think think it's OK to break backwards-compatibility here. |
|
@timothydijamco can you rebase this |
ibis/backends/pandas/client.py
Outdated
| @@ -123,6 +123,41 @@ def infer_numpy_scalar(value): | |||
| return dt.dtype(value.dtype) | |||
|
|
|||
|
|
|||
| def _infer_pandas_series_contents(s): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type in and out
also add a Parameters / Returusn section
ibis/backends/pandas/client.py
Outdated
| if inferred_dtype in {'mixed', 'decimal'}: | ||
| # We need to inspect an element to determine the Ibis dtype | ||
| value = s.iloc[0] | ||
| if isinstance(value, (np.ndarray, list, pd.core.series.Series)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pd.Series
ibis/backends/pandas/client.py
Outdated
| @@ -133,7 +168,19 @@ def infer_pandas_timestamp(value): | |||
|
|
|||
| @dt.infer.register(np.ndarray) | |||
| def infer_array(value): | |||
| # TODO(kszucs): infer series | |||
| np_dtype_name = value.dtype.name | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use the name, check vs np.object as you do in the series inferer
| 'array_of_int64': [[1, 2], [], [3]], | ||
| 'array_of_strings': [['a', 'b'], [], ['c']], | ||
| 'array_of_float64': [ | ||
| np.array([1.0, 2.0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we allowed to have a None / np.nan? (e.g. a missing value for the entire row?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should not lead to any issues, besides possibly if we're trying to infer the Ibis dtype of the array.
If it contains np.nans then it would be OK (inferred dtype would be dt.Array(dt.float64)).
If it contains Nones then the inferred dtype would be more general, e.g. dt.Array(dt.binary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no what i mean is the array itself is None e.g. [np.array(...), np.array(...), None] (this also maybe restricted / not allowed). If you can followup with tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This technically will work but I wouldn't consider it well-supported
Type inference can be imperfect in this case (sort of "best-effort"): To infer the type of an object Series (includes Series that have np.arrays, Nones, or a mix, etc), I check the type of the first element of the Series. So in your example, it depends on whether a None or a np.array(...) is in the first element slot
On that note, I'm a bit on the fence about the check-first-element-only approach, because it can seem unpredictable. An alternative would be to not try to resolve the ambiguity and just raise an error, similar to what we have been doing before this PR. But this requires users to always specify an Ibis schema manually for columns that contain arrays, even when their column data is clean enough for the type to be inferred accurately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have any strong opinions (otherwise I can leave as-is, noting that this is something that is open to be reconsidered in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, note however that we shoul not be checking the first element at all! we should simply call infer_dtype on each element (if its not a scalar); pandas is designed to make this very performant and it will exit immediately for ndarrays with the dtype.
e.g. something like
if is_scalar(e):
if isna(e):
continue # not sure what you need to track here
raise? # (e.g. we don't allow scalars mixed i think)
else:
inferred_type = infer_dtype(e)
if ........
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this is promising. We could check all elements of the Series—and it shouldn't be heavy because the only thing we need to do with each element really is to: 1) check if it's an array, and 2) call dt.infer on the array, which will either directly return the np.array's dtype or rely on Pandas infer_dtype (also not heavy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the performance of checking all elements could end up being an issue: it looks like this would take ~5s for a Series with 1,000,000 arrays. It's hard for me to say whether this is reasonable or not. With many Series like this this would take a while, although maybe that wouldn't be common.
What do you think about leaving this as-is (I'm thinking that checking only the first element is not perfect but is an OK heuristic) and revisiting in a follow-up if necessary?
| 'array_of_int64': [np.array([1, 2]), np.array([]), np.array([3])], | ||
| 'array_of_strings': [ | ||
| np.array(['a', 'b']), | ||
| np.array([]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this empty array has a different dtype is that on purpose? is this correcty inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type inference isn't tested for these arrays in particular (the Ibis types for this test DataFrame are explicitly defined later in this module)
However in general, the inferred Ibis type of a pd.Series that contains np.arrays would depend on the first element of the pd.Series (see this code in this PR)*. This array in particular would be correctly inferred to be string dtype, since the first element is a np.array containing strings.
*not a foolproof way to infer the type of the column, but avoids having to check every element in the pd.Series
ndarray as array representation in Pandas backendndarray as array representation in Pandas backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| 'array_of_int64': [[1, 2], [], [3]], | ||
| 'array_of_strings': [['a', 'b'], [], ['c']], | ||
| 'array_of_float64': [ | ||
| np.array([1.0, 2.0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no what i mean is the array itself is None e.g. [np.array(...), np.array(...), None] (this also maybe restricted / not allowed). If you can followup with tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great thanks, can you add a whatsnew note describing what changed. ping on green.
This PR was split off from #2743 (but the description below is complete—no info/discussion from #2743 is needed to understand this PR)
Background
Arrays are currently represented using
listThe Pandas backend mainly uses
listsas the underlying representation for Ibis arrays:ArrayScalarorArrayColumn) createlistsduring execution. (Examples:ArraySlice,ArrayCollect).This means that UDFs that accept arrays as input will receive
listsas input, and end-result DataFrames that contain array columns will containlists.ArrayScalarsand/orArrayColumnsas args) expectlistsduring execution. (Examples:ArrayRepeat,ArrayConcat)Inconsistencies
There are some exceptions, which can lead to problems during execution and confusion:
ibis.literal(np.array([1, 2, 3]))creates anndarray(instead oflist) during executionibis.array([t.int_col, t.other_int_col])createsndarrays(instead oflists) during executionlistorndarrayIf any of these APIs are used, then during execution, problems might come up depending on what other operations the user has applied on top of these operations (for example, if the user uses a UDF that returns an
ndarray, then appliesArrayConcaton the result, an error will occur becauseArrayConcatexpectslist. If they had applied no operation on the result of their UDF, their expression would execute OK.).This PR
Goals
In the Pandas backend,
ndarrayas the representation for arrays.This would be a more useful array representation for users to have in their UDFs and resulting DataFrames.
This would also be consistent with the PySpark backend which uses
ndarray(if Arrow is enabled in PySpark—no guarantee that the user is using PySpark in this configuration, however, if Ibis UDFs are to be used with the PySpark backend, Arrow must be enabled).Changes
ndarraysduring executionndarraysas input during executionUDFs that output arrays should be allowed to return eitherlistorndarray, but will be coerced intondarrayby the backend during execution to ensure compatibility with rest of array operationsOrganized overview of which APIs/operations this PR affects:
https://gist.github.com/timothydijamco/fea0a79b6ed9a0367c58e51f9973f4af