Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.20.0 RC: change in array coercion for scalar objects with __array_interface__ #17965

Closed
jorisvandenbossche opened this issue Dec 9, 2020 · 14 comments · Fixed by #18197
Closed

Comments

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Dec 9, 2020

The Shapely package defines a set of scalar geometry objects, but some of those objects also expose its underlying data using the array interface.

Up to now, np.array([..], dtype=object) coercion seemingly didn't check for an __array_interface__ on the elements in the list, so that code like the following to create an object array of such elements worked:

In [1]: from shapely.geometry import LineString

In [2]: line = LineString([(0,0), (1,1), (2,1)])

In [3]: np.array([line], dtype=object)
Out[3]: 
array([<shapely.geometry.linestring.LineString object at 0x7f0e4e41caf0>],
      dtype=object)

In [4]: np.__version__
Out[4]: '1.19.2'

but with the 1.20.0 RC, we now get the following:

In [3]: np.array([line], dtype=object)
Out[3]: 
array([[[0.0, 0.0],
        [1.0, 1.0],
        [2.0, 1.0]]], dtype=object)

In [4]: np.__version__
Out[4]: '1.20.0rc1'

I suppose this might be an intentional change (https://numpy.org/devdocs/release/1.20.0-notes.html#array-coercion-restructure, #16200). But still wanted to raise this, so you are at least aware of it (the release notes say "We are not aware of any such case").

This can be reproduced without shapely with the following example:

class Line:
    def __init__(self, coords):
        self._coords = np.asarray(coords)
    @property
    def __array_interface__(self):
        return self._coords.__array_interface__

line = Line([(0,0), (1,1), (2,1)])
np.array([line], dtype=object)

Although for Shapely itself there isn't any direct broken behaviour (no failing tests with 1.20.0 RC), everybody who is putting Shapely geometries in arrays will get troubles. And specifcially, the GeoPandas package is doing this a lot.

Some notes:

  • I know that having an __array_interface__ on an object that doesn't pretend to be an array, is probably bad practice (it's still a way to expose its data to other libraries using numpy). This is long-standing behaviour of Shapely, but something we are actually planning to deprecate and remove in the near future (among other things, exactly because of playing better together with numpy).
    We are actively working on a "Shapely 2.0" which will remove this array interface (see this Shapely 2.0 RFC section about this aspect). However, a final release for this is at a minimum a few months away.

  • In addition to having an array_interface, a subset of the Shapely geometries are also iterable. For this reason, in GeoPandas, we have already been quite consistently using this pattern to create numpy arrays:

    arr = np.empty(length, dtype=object)
    arr[:] = [... list of geometries ...]
    

    With this pattern, the issue described here can be avoided. However, we didn't yet do this everywhere in GeoPandas (eg in places we were sure we only had simple geometries like points, which are not iterable), so we still get some failures.

@seberg
Copy link
Member

seberg commented Dec 9, 2020

Thanks for the heads up. Yeah, I didn't really expect objects like this, that are partial array-likes.

We could probably add a tiny hack to restore a most of this for now:

def coerce_recursive(object):
    # handle scalars

    arr = None
    if isinstance(object, np.ndarray):
         arr = object

    if is_array_like(object):
         arr = np.asarray(object)

    if arr is not None:
         return coerce_from_array(arr)

    if is_sequence(object):
         return coerce_from_sequence(object)

    # use `object` scalar fallback.

I could probably inject an if not is_sequence(object) in front of the is_array_like check to retain the pre 1.20 behaviour. (The only annoyance is that I am currently not sure if __array__ and __array_interface__ behave differently here, also with respect to 1.19.x changing what 1.18.x did).

Do you think the behaviour should be retained indefinitely, or should we look into trying for a FutureWarning here?

@seberg
Copy link
Member

seberg commented Dec 9, 2020

It is too bad that I don't have public API for dtypes, yet... Otherwise we could probably even create a shaply_as_object abstract DType to write:

np.asarray(nested_sequence, dtype=shapely_as_object)

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Dec 9, 2020

I think eventually removing that capability is certainly reasonable, but if it could be done with a deprecation cycle instead of directly changing it, that would be great (and give us a bit more time to get Shapely 2.0 ready).

@seberg
Copy link
Member

seberg commented Dec 9, 2020

I made PR at gh-17973 which can be used to see how things should end up, still needs tests though.

The one issue is that I can't think of good way for opting into the future behaviour (aside from the type pretending to be a Python sequence, which may not always be correct?). Some rare users may just have to ignore the warning :(.

This does not change the fact that if the object is a sequence (or pretends to be one), the shape is always taken from the array-interface and not by unpacking the sequence. (Although I think that was partially changed in 1.19.x already.)

seberg added a commit to seberg/numpy that referenced this issue Dec 10, 2020
This fixes issue numpygh-17965.  The slightly annoying thing is that
there is no simple way to opt-in to the new behaviour and the old
behaviour is a bit quirky to begin with (honoring the dtype, but
not the shape).
charris pushed a commit to charris/numpy that referenced this issue Dec 18, 2020
This fixes issue numpygh-17965.  The slightly annoying thing is that
there is no simple way to opt-in to the new behaviour and the old
behaviour is a bit quirky to begin with (honoring the dtype, but
not the shape).
@charris
Copy link
Member

charris commented Dec 21, 2020

I'm going to close this. If the issue persists after 1.20.k0rc2 is released, please reopen.

@charris charris closed this as completed Dec 21, 2020
@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Jan 14, 2021

Late reply, but thanks a lot for the quick follow-up and nice deprecation!

Justed tested this, and this should work for us. There is still a difference compared to < 1.20: now __array_interface__ is accessed (and so a return value constructed), while before it was never even accessed at all. But that shouldn't be a problem for us (I only noticed it because for a mock in a test the property did raise NotImplementedError).

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Jan 14, 2021

Actually, I spoke too soon. The latest numpy gives a lot of failures in the geopandas test suite ...
The reason is that Shapely actually does raise NotImplementedError in some of the __array_interface__ properties.

A small example without relying on Shapely, adapted from above:

class Polygon:
    def __init__(self, coords):
        self._coords = np.asarray(coords)
    @property
    def __array_interface__(self):
        raise NotImplementedError("Polygon does not expose array interface")

poly = Polygon([(0,0), (1,1), (2,1)])
arr = np.empty((1,), dtype=object)
arr[:] = [poly]

So it's not anymore about the direct array creation with np.array([poly]) (which now raises a deprecation warning), but actually about the "create empty and assign"-workaround we have been using in geopandas to overcome the issues in the np.array(..) constructor (and which gets mentioned in the deprecation warning).

So the above raises with rc2:

In [2]: poly = Polygon([(0,0), (1,1), (2,1)])
   ...: arr = np.empty((1,), dtype=object)
   ...: arr[:] = [poly]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-2-99f5a9597e5d> in <module>
      1 poly = Polygon([(0,0), (1,1), (2,1)])
      2 arr = np.empty((1,), dtype=object)
----> 3 arr[:] = [poly]

<ipython-input-1-1ca53974a957> in __array_interface__(self)
      4     @property
      5     def __array_interface__(self):
----> 6         raise NotImplementedError("Polygon does not expose array interface")

NotImplementedError: Polygon does not expose array interface

while this worked in rc1.
(and this is actually a more problematic regression for geopandas as the original one I reported, as we are using the above idiom a lot and cannot be easily fixed (apart from getting Shapely 2.0 out of the door sooner))

@jorisvandenbossche
Copy link
Contributor Author

In one of the tests added in #17973, there is a note:

    # NOTE: We actually do still call __array__, etc. but ignore the result
    #       in the end. For `dtype=object` we could optimize that away.

So this doesn't seem to be true in practice? If the result would actually be ignored (or the error catched and ignored), then the issue would be fixed I think?

@charris charris reopened this Jan 14, 2021
@seberg
Copy link
Member

seberg commented Jan 14, 2021

Heh, there was another issue with shapely, which lead us to ignore less errors when calling __array_interface__ (gh-17785), etc. That probably is the reason for this change (i.e. gh-17817).

I am wondering now if I should go the 1.19.x route and only catch clearly "bad" errors (recursion and memory error), or keep it as is here and catch all errors but we add catch NotImplementedError explicitly. I can see that NotImplementedError reads nice. For NumPy an AttributeError would be clearer if you want to say: I don't actually implement this. (for backcompat reasons we should allow it anyway I guess, but in that case, maybe we need to be very lenient for now and maybe even deprecate)

@seberg
Copy link
Member

seberg commented Jan 14, 2021

Oh, I didn't fully parse your last comment. So I guess raising the error may be fine for shapely/you if we ensure there is a special case for object + last dimension reached (which is not relevant except for the assignment case). That special case is probably missing in the dtype discovery code right now. (Note: that would not help with np.array(..., dtype=object), however)

@seberg
Copy link
Member

seberg commented Jan 14, 2021

@jorisvandenbossche to decide how to best fix this:

  • np.array(Polygon(), dtype=object) → Should not error, or is it allowed to error?
  • np.array([Polygon(), dtype=object]) → Same question

Maybe @eric-wieser will also have a quick opinion. The question is whether this NotImplementedError is expected to never show up, or only show up in certain conditions. I take it as granted that:

arr = np.empty(1)
arr = [Polygon()]

must not error, but if we ignore the error in the first two examples, that will fall out without any special handling (although technically, it might be better to special case it anyway, due to the NOTE comment you found).

@jorisvandenbossche
Copy link
Contributor Author

For the first two cases:

np.array(Polygon(), dtype=object) → Should not error, or is it allowed to error?
np.array([Polygon(), dtype=object]) → Same question

I think it would be nice if they don't start erroring now (since it works in released numpy). But if numpy would prefer to bubble up the error, maybe something similar could be done as for the case that __array_interface__ doesn't error: raise a warning that in the future this object will be coerced as np.array(obj)(and thus in this case will effectively raise the error), but for now keep the existing behaviour of not raising.

Now, on the preferred long-term behaviour: never show such a NotImplementedError or not, I am not fully sure. On the one hand, it seems consistent with the current deprecation warning that objects with one of such protocols will "be coerced as if it was first converted using np.array(obj)" regardless of wether the protocol raises or not. On the other hand, numpy could treat such an error as an indication of "I don't want to be converted to an array" (but the library author could in such a case also simply leave out __array_interface__ entirely?).

seberg added a commit to seberg/numpy that referenced this issue Jan 20, 2021
Closes (the later point) in numpygh-17965 and reverts parts of
numpygh-17817.
Shapely did rely on being able to raise a NotImplementedError
which then got ignored in the attribute lookup.
Arguably, this should probably just raise an AttributeError to
achieve that behaviour, but it means we can't just rip the band-aid
off here.

Since 1.20 is practically released, just reverting most of the change
(leaving only recursion and memory error which are both arguably
pretty fatal).
Ignoring most errors should be deprecated (and I am happy to do so),
but it is not important enough for 1.20 or very important in itself.

Closes numpygh-17965
@seberg
Copy link
Member

seberg commented Jan 20, 2021

OK, just opened a small PR to just revert it. I am very happy to add the deprecation (or even the more complex logic), but it just isn't very important, so doing anything for 1.20 already seems not even worth back porting release note additions :).

charris pushed a commit to charris/numpy that referenced this issue Jan 21, 2021
Closes (the later point) in numpygh-17965 and reverts parts of
numpygh-17817.
Shapely did rely on being able to raise a NotImplementedError
which then got ignored in the attribute lookup.
Arguably, this should probably just raise an AttributeError to
achieve that behaviour, but it means we can't just rip the band-aid
off here.

Since 1.20 is practically released, just reverting most of the change
(leaving only recursion and memory error which are both arguably
pretty fatal).
Ignoring most errors should be deprecated (and I am happy to do so),
but it is not important enough for 1.20 or very important in itself.

Closes numpygh-17965
@jorisvandenbossche
Copy link
Contributor Author

I checked that the latest master, and geopandas tests are passing again now (didn't check with the 1.20.x maint branch, but assuming that that will be OK as well since the fix is backported). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants