Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: generalized check_array_indexer for validating array-like getitem indexers #31150

Merged
merged 20 commits into from Jan 29, 2020

Conversation

jorisvandenbossche
Copy link
Member

Closes #30738. Also fixes the performance issue for other arrays from #30744, and related to #30308 (comment)

This generalizes the check_bool_array_indexer helper method that we added for 1.0.0 to not be specific to boolean arrays, but any array-like input, and ensures that the output is a proper numpy array that can be used to index into numpy arrays.

I think such a more general "check" is useful, to avoid that all (external+internal) EAs need to do what the test EAs are already doing (checking for integer arrays as well) at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/decimal/array.py#L118-L126, and also to fix the performance issue in a general way (as was now only done for Categorical in https://github.com/pandas-dev/pandas/pull/30747/files)

If we agree on the general idea, I still need to clean up this PR (eg remove the existing check_bool_array_indexer, update the extending docs, etc)

cc @TomAugspurger

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 20, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.0 milestone Jan 20, 2020
@jorisvandenbossche
Copy link
Member Author

In general, the __getitem__ of an EA now needs to do a

    ...
    if is_list_like(key):
        key = check_array_indexer(self, key)
    ...

In principle, we could also move the is_list_like check into the check_array_indexer to make the usage even more simple. That only means that the function then needs to pass-through integers and slices/None/Ellipsis.

----------
array : array
The array that's being indexed (only used for the length).
indexer : array-like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a few places above, you've done is_list_like, but here we require an array (with a dtype).

Thoughts on what we want? Requiring an array is certainly easier, so that we don't have to infer the types. But users may be passing arbitrary objects to __getitem__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't require an array with a dtype. The first thing that this function does is:

    if not is_array_like(indexer):
        indexer = pd.array(indexer)

to deal with eg lists.

So I probably meant to update the array into "list-like" instead of "array-like"

pandas/api/indexers/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really hate adding extension array only code that is not used in pandas proper at all. this seems like a good candidate to use internally.

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
@@ -307,3 +312,62 @@ def check_bool_array_indexer(array: AnyArrayLike, mask: AnyArrayLike) -> np.ndar
if len(result) != len(array):
raise IndexError(f"Item wrong length {len(result)} instead of {len(array)}.")
return result


def check_array_indexer(array, indexer) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these at all, shouldn't indexer -> key and be Label (or maybe something more sophisticated); not looking to solve this in this PR necessarily

pandas/core/indexers.py Show resolved Hide resolved
checked if there are missing values present, and it is converted to
the appropriate numpy array.

.. versionadded:: 1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 or 1.1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 if we're planning to subsume check_bool_array_indexer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 if we're planning to subsume check_bool_array_indexer.

Yes, this is replacing check_bool_array_indexer which is already in 1.0.0, so we should do the replacement also for 1.0.0


Parameters
----------
array : array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made more specific, e.g. "np.ndarray or EA"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used to get the length, so made it "array-like" (can in principle also be a Series)


elif is_integer_dtype(dtype):
try:
indexer = np.asarray(indexer, dtype=int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does int vs np.int64 vs np.intp matter here? are there failure modes other than the presence of NAs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does matter; indexers are intp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was on my todo to fix up. Need to figure out the easiest way to convert to numpy array preserving the bit-ness of the dtype (or can we always convert to intp?)

Will update tomorrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, went with np.intp. From a quick test, when you pass non-intp integers to index with numpy, it's not slower to do the conversion to intp yourself beforehand (although while writing this, what happens if you try to index with a too large int64 that doesn't fit into int32 on a 32-bit platform?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_platform_int is a well established pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer to update ensure_platform_int to handle extension arrays so I can use it here? (it's basically the same as np.asarray(.., dtype=np.intp), not really sure why the code in ensure_platform_int takes more hoops, performance I suppose)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way - but should be consistent and use only 1 pattern; ensure_platform_int is used extensively already

@TomAugspurger TomAugspurger mentioned this pull request Jan 22, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some thorough checking

there is a lot added here that seems duplicative or needs more test coverage


result = self._codes[key]
if result.ndim > 1:
from pandas.core.indexes.base import deprecate_ndim_indexing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be too imported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is possible yes. But I don't really like array code importing code from the Index classes (the dependence should be the other way around). I can maybe also move the deprecate_ndim_indexing helper function to pd.core.indexers (instead of pd.core.indexes) to have this separation cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved
@@ -768,6 +770,9 @@ def __getitem__(self, key):
else:
key = np.asarray(key)

if is_list_like(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this repeated non purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated from where?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next check is_bool_indexer is duplicative

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not fully duplicative, see my long explanation at #31150 (comment). It's mainly for dealing with object dtype.

pandas/core/indexers.py Show resolved Hide resolved
pandas/core/indexers.py Show resolved Hide resolved
@@ -2232,7 +2232,7 @@ def check_bool_indexer(index: Index, key) -> np.ndarray:
else:
if is_sparse(result):
result = result.to_dense()
result = check_bool_array_indexer(index, result)
result = np.asarray(check_array_indexer(index, result), dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not already guaranteed in the output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment here

@jorisvandenbossche
Copy link
Member Author

So there are two remaining discussion items I think (also relating to some of the inline comments):

1) Do we want to allow object dtype masks/indexers?

Currently, some indexing routines allow object dtyped masks, and some not:

In [1]: mask = np.array([True, False, True], dtype=object) 

In [2]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])       

In [3]: s[mask]     
Out[3]: 
a    1
c    3
dtype: int64

In [4]: s.index[mask]  
Out[4]: Index(['a', 'c'], dtype='object')

In [5]: pd.Categorical(["a", "b", "c"])[mask]  
...
IndexError: arrays used as indices must be of integer (or boolean) type

In [6]: pd.array([1, 2, 3], dtype="Int64")[mask]       
...
IndexError: arrays used as indices must be of integer (or boolean) type

I assume a good reason that we allowed this was because, as Index object, booleans are always object dtype (we don't have a BooleanIndex).
But in general, I don't really like this support (at least for new functionality like this PR), since I don't see a need to support object dtyped booleans. This support also means that you need to infer for object whether it is boolean or not.

So the question is here: are we fine with the new function being more strict on the dtype?
I ensured that all cases where it was converted before in pandas 0.25, it is still doing this now (eg in DatetimeArray, there still is a is_bool_indexer that does such inference and then ensures the object dtype is converted to a boo dtype at https://github.com/pandas-dev/pandas/pull/31150/files#diff-34bddc5b2c962f58d24f594d80a8b520R523-L521. For categorical, this didn't work in 0.25.0 / before #30308, so I didn't bother supporting object dtype there).

BTW, also integers as object seems to be somewhat inconsistent (here getitem vs iloc):

In [9]: s = pd.Series([1, 2, 3], dtype="Int64", index=['a', 'b', 'c']) 

In [10]: s[np.array([0, 2], dtype=object)]   
...
IndexError: arrays used as indices must be of integer (or boolean) type

In [11]: s.iloc[np.array([0, 2], dtype=object)] 
Out[11]: 
a    1
c    3
dtype: Int64

2) What arguments to accept to check_array_indexer (and potentially pass through)?

Currently, the check_array_indexer in this PR accepts a list-like, ensures it is converted to an array-like (ndarray or EA), and then validates integer and boolean arrays. Other array dtypes are passed through, in the idea that numpy will error on other dtypes anyway (but we could also raise the error already in check_array_indexer if that is preferred)

This means that, for correct use, you need to do something like (if you have an EA that has a _data with a numpy array):

    def __getitem__(self, item):
        if is_integer(item):
            return self._data[item]

        elif is_list_like(item) and not isinstance(item, tuple):
            item = check_array_indexer(self, item)

        return type(self)(self._data[item])

The is_list_like(item) is needed because currently check_array_indexer does not handle eg slices, and the and not isinstance(item, tuple) is needed because otherwise the current check_array_indexer would convert the tuple into an array, which gives a different meaning (a tuple would mean to index multiple dimension, which will typically error for 1D numpy arrays).

So it was already mentioned somewhere above that we could move this is_list_like(item) and not isinstance(item, tuple): check into the check_array_indexer function. But that also means that we don't just accept a list-like to the function, but will need to pass through other things through the function untouched (like integer scalars, None, slice, ellipsis, which all can be valid indexers).
Are we OK with that? It makes the typing of function less strict (otherwise we could ensure the output is an ndarray) and thus the code less explicit, but it makes the usage a bit easier (don't need to repeat the list and tuple check everywhere this is used).

@TomAugspurger
Copy link
Contributor

So the question is here: are we fine with the new function being more strict on the dtype?

Yes, I'm happy with that. We should at least have some functionality that can completely avoid dtype inference.

So it was already mentioned somewhere above that we could move this is_list_like(item) and not isinstance(item, tuple): check into the check_array_indexer function.

I think my preference is to not handle that inside check_array_indexer. I'd like to keep that focused specifically on arrays (lists too I suppose, though that's debatable I think)

If we want to make this even easier for our own / 3rd party EAs, we can add another helper like

def check_indexer(item):
    """
    Check the kind of indexer provided.

    Returns
    --------
    indexer : Any
        Integers will be passed through as is. list-likes will be converted to arrays
    kind : str
        One of 'integer-scalar', 'integer-array', 'boolean', 'slice'

That would collect all the inference and type checking in a single method. But I think that's less important than getting this array version finished up.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 23, 2020

If we want to make this even easier for our own / 3rd party EAs, we can add another helper like

But wouldn't such a check_indexer helper function basically look like:

def check_indexer(array, indexer):
    if is_list_like(indexer) and not isinstance(indexer, tuple):
            indexer = check_array_indexer(array, indexer)
    return indexer

We could also decide to replace check_bool_array_indexer with such a check_indexer instead of the check_array_indexer of this PR.

I am also not sure if the "kind" is very important. In practice (when your underlying data are numpy arrays at least), you only care about the distinction between a scalar vs not a scalar (integer/boolean array, slice, ), as for the second group you need to wrap it again in your array class (and for the first case potentially wrap it in a scalar class).


result = self._codes[key]
if result.ndim > 1:
from pandas.core.indexes.base import deprecate_ndim_indexing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved
@@ -768,6 +770,9 @@ def __getitem__(self, key):
else:
key = np.asarray(key)

if is_list_like(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next check is_bool_indexer is duplicative

@TomAugspurger
Copy link
Contributor

But wouldn't such a check_indexer helper function basically look like:

Mmm, fair point. Happy to defer to your judgement here :)

@jorisvandenbossche
Copy link
Member Author

We could also have both check_indexer and the more specialized check_array_indexer ?

# GH#30588 multi-dim indexing is deprecated, but raising is also acceptable
idx = self.create_index()
with pytest.raises(ValueError, match="cannot mask with array containing NA"):
idx[:, None]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed because the base class version (which checks for the deprecation) now passes (since I added the deprecation warning)

@TomAugspurger
Copy link
Contributor

Either is fine by me.

If this is exclusively or primarily used in __getitem__, then check_array_indexer should handle that for the user, since the type in __getitem__ is so broad.

If we're using it elsewhere where we know we already have an array, then skipping that check is nice.

In this PR, it looks like __getitem__ is the primary users, so we can go ahead and include the check I think.

@jorisvandenbossche
Copy link
Member Author

OK, added a commit moving the check inside. Now it only validates array-likes (list-likes which are not tuples) and passes through everything else.

Maybe remaining question is if we want to rename this to check_indexer instead of check_array_indexer.
I am fine with keeping the name, to have the "array" point to the fact that it's checking (array) input to index arrays (eg input to index a Series or DataFrame would have other checks)

@jbrockmendel
Copy link
Member

I'm a little late weighing in, but I'd prefer to have the function require as tight an argument type as possible (arraylike maybe?). In trying to optimize our lookups, avoidable type checks are some of the lowest-hanging fruit

@jorisvandenbossche
Copy link
Member Author

It's as simple as undoing the last commit, but you will need to find an agreement with @jreback

In the end, I think how it is now (the list-like + tuple check inside the function) makes it easier to use this (otherwise you need that check every time the function is called).
I would also be fine with two functions: check_array_indexer (which is strict) and check_indexer (which then has the additional type check and otherwise calls check_array_indexer), if that keeps things cleaner. See proposal above #31150 (comment)

In trying to optimize our lookups, avoidable type checks are some of the lowest-hanging fruit

Not sure I understand this point?

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

@jorisvandenbossche

actually I was ok with the original

It's as simple as undoing the last commit, but you will need to find an agreement with @jreback

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 28, 2020

actually I was ok with the original

I interpreted "personally I would actually relax this and avoid the need to have is_list_like checks before calling this." differently :)
Now, if you check the last commit (1ca35d1), I think it actually made things simpler in the __getitem__ functions.

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

No, that's not possible (and that's not what the last commit did). It needs to accept things like lists, as that is a valid array-like indexer.
The "tighter typing" that I mentioned above is about having a return value that is always ndarray, the input always has been list-like (and with the latest commit that changed to Any)

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

actually I was ok with the original

I interpreted "personally I would actually relax this and avoid the need to have is_list_like checks before calling this." differently :)
Now, if you check the last commit (1ca35d1), I think it actually made things simpler in the __getitem__ functions.

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

No, that's not possible (and that's not what the last commit did). It needs to accept things like lists, as that is a valid array-like indexer.
The "tighter typing" that I mentioned above is about having a return value that is always ndarray, the input always has been list-like (and with the latest commit that changed to Any)

ok i see it. ok then.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 28, 2020

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

So the new check_array_indexer doesn't allow object dtype as boolean indexer. Therefore, everywhere we supported this before, is/check_bool_indexer is still being used for backwards compatibility in addition to check_array_indexer.
We should deprecate that some day, but right now with boolean index being object this is still too early I think.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

So the new check_array_indexer doesn't allow object dtype as boolean indexer. Therefore, everywhere we supported this before, is/check_bool_indexer is still being used for backwards compatibility in addition to check_array_indexer.
We should deprecated that some day, but right now with boolean index being object this is still too early I think.

ok fair enough, maybe just rename is_bool_indexer then (doesn't have to be in this PR), to is_bool_indexer_for_object_array (too long, but that's the idea)

@TomAugspurger
Copy link
Contributor

Fixed the merge conflict.

@TomAugspurger
Copy link
Contributor

@jbrockmendel I think your comments in #31150 (comment) can be resolved in some places, but for a general __getitem__ we have to handle the fact that key can have any type / dtype.

Thanks @jorisvandenbossche!

@TomAugspurger TomAugspurger merged commit b1214af into pandas-dev:master Jan 29, 2020
@TomAugspurger
Copy link
Contributor

@meeseeksdev backport to 1.0.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 29, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 b1214af3528319e31334ff098021c5a1a6d9ffcd
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #31150: API: generalized check_array_indexer for validating array-like getitem indexers'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-31150-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31150 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: positional indexing with IntegerArray
4 participants