Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make it possible to call .view on object arrays #8514

Closed
wants to merge 6 commits into from

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Jan 21, 2017

Right now you can do this with primitive types:

>>> np.zeros((2, 3), dtype=np.int).view([('a', np.int,3)])
array([[([0, 0, 0],)],
       [([0, 0, 0],)]], 
      dtype=[('a', '<i4', (3,))])

This adds

>>> np.zeros((2, 3), dtype=object).view([('a', object,3)])
array([[([0, 0, 0],)],
       [([0, 0, 0],)]], 
      dtype=[('a', 'O', (3,))])

Which would previously error

This makes it possible to use np.unique on 2d object arrays

@seberg
Copy link
Member

seberg commented Jan 21, 2017

Not sure what exactly is in it, but you might want to check #5508 before continuing.

@seberg
Copy link
Member

seberg commented Jan 21, 2017

Ah, sorry, nvm. I remembered there was sometihng about relaxing view, but that one is not about objects.

@eric-wieser
Copy link
Member Author

This is simply aiming to change _view_is_safe, which is coming in a future commit

@@ -355,15 +355,37 @@ def _view_is_safe(oldtype, newtype):
If the new type is incompatible with the old type.

"""
from numpy.lib.type_check import find_dtype_offsets
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import feels kinda hacky, but it can't go globally. Should I put this function somewhere else?

oldoffsets = find_dtype_offsets((oldtype, oldrepeats), object_)

# everything matches - we're good
if len(newoffsets) == len(oldoffsets) and all(newoffsets == oldoffsets):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way of doing this check? Converting to list?

@eric-wieser eric-wieser force-pushed the view-compound-object branch 2 times, most recently from 7bc42b9 to a558e1a Compare January 21, 2017 15:17
Useful for finding the location of custom dtype objects, or np.object_ members
to improve the behaviour of .view
@eric-wieser eric-wieser changed the title Make it possible to call .view on object arrays ENH: Make it possible to call .view on object arrays Jan 21, 2017
…n objects

This makes unique work on object arrays
@ahaldane
Copy link
Member

ahaldane commented Jan 22, 2017

You can also look at #5548, which was an attempt to do something similar. However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions, the way I implemented it. I welcome an attempt to fix this, btw.

In #5548, the methods _get_all_field_offsets is probably similar to your find_dtype_offsets, and I also implemented _view_is_safe quite similarly to how you do here.

There are also tests there I wrote which may be useful here.

@eric-wieser
Copy link
Member Author

eric-wieser commented Jan 23, 2017

However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions

What even is a "performance hit" when the result previous to that patch was an exception? Is it simply the lack of the hasobject short circuit present here that reverted it?

@ahaldane
Copy link
Member

ahaldane commented Jan 23, 2017

The problem was due to "hidden" objects, as I discussed with an example in the first comment in #5548. The possibility of hidden objects means I needed to do a computationally expensive overlap check even if the HASOBJECTS flag is false. In #6208 I tried to limit the cases in which the overlap needs to be checked, but it wasn't enough, see the function _may_have_objects there.

As I vaguely recall, when the performance problem became a big enough issue I realized that the other changes I had made over the course of those PRs actually fixed most of the issues I had originally indended to fix, without needing object views. Therefore I was perfectly happy to simply disable object views and remove the safety checks.

I think it would be great to re-enable object views, though. I think it would fix a number of open issues. It's very likely there are fixes to the hidden object problems I encountered which I didn't explore properly, and some fresh eyes on the problem are very welcome.

Maybe a solution is to make the HASOBJECTS flag better reflect whether the array contains objects....? I'm not sure, because that would probably also mess with the reference counting.

@eric-wieser
Copy link
Member Author

eric-wieser commented Jan 23, 2017

So right now, this is a somewhat more restrictive version of your PR, as it doesn't allow object members to be hidden in a view, whereas yours just stopped members being reinserted - but as a result, presumably comes at less of a performance cost.

Your issue was the extra time in executing _getfield_is_safe, I assume? Am I right in thinking I am OK to leave that untouched if aiming to simply fix views?

@ahaldane
Copy link
Member

ahaldane commented Jan 23, 2017

That's true, a more restrictive version could avoid the problems.

And yes, the performance cost was in _getfield_is_safe, but particularly in the bye-by-byte checks I did which you might not need in the restricted version.

Also, I think a perfomance hit is acceptable as long as it only affects object arrays. My problem was it affected non-object structured arrays too.

@eric-wieser
Copy link
Member Author

@ahaldane I'll go ahead and copy the tests from that pull request then.

Would you prefer me to make find_dtype_offsets not part of the public API, as you did? Or change it back to your variant that returns unfiltered fields? There's definitely a performance boost for not wasting time finding the offsets of (np.float, (64, 64, 64)) when all you care about is object arrays

Such as find_dtype_offsets( (float, (1024, 1024, 1024), object )
Adds back tests for object array views. Marks the ones that deliberately fail with knownfailif, for now
@eric-wieser
Copy link
Member Author

@ahaldane: Ok, your tests are copied across

@ahaldane
Copy link
Member

I don't have a strong opinion about where find_dtype_offsets should go, if you think it should be public that's fine with me.

I needed the non-object fields to do byte-overlap checks; if you don't need them it sounds good to optimize them out.

@eric-wieser
Copy link
Member Author

@ahaldane: So do you think this needs further changes?

@ahaldane
Copy link
Member

Oh I didn't realize it was finished.. I'm taking a look now.

return
# no object members is fine
if not newtype.hasobject and not oldtype.hasobject:
return
Copy link
Member

@ahaldane ahaldane Feb 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I am still thinking about, becuase of the problem of "hidden objects".

It's true that currently, I don't think there is a way to create hidden objects because most operations involving objects are disallowed or make copies, but we want to relax those restrictions.

We will have a problem when trying to solve #5994, as I attempt in my PR #6053. That is, once we allow multi-field views like:

>>> a = np.zeros(5, dtype='O,O,O')
>>> b = a[['f0','f2']]
>>> b.dtype
dtype({'names':['f0','f2'], 'formats':['O','O'], 'offsets':[0,16], 'itemsize':24})

then b will have a hidden object at byte 8. That will be true of any implementation of #5994, besides #6053. (Currently the second line returns a copy instead of a view, thus avoiding hidden objects, but that is what we want to change).

I am trying to figure out the easiest way to get both this PR and multi-field views to interact safely. (this may involve setting limitations on the plans in #5994).

Note how this PR would allow this view, using b from my example above:

>>> b.view('O,p,O')
>>> b['f1']  = 0   # segfault?

Perhaps the solution is to disallow multi-field views if it would create hidden objects? That also seems a little wonky. I need to think a bit more about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you meant to comment on this line, because it matches the old behaviour here.

I think I see the problem now. It doesn't exist yet, right, but would be a consequence of allowing partial views.

But as it stands, this pr doesn't allow the creation of those partial views in the first place. I believe this pr allows a strict subset of the things a complete partial-view implementation would allow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I guess the result would be that all this gets thrown out when partial views do make it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the best solution is just to store a reference to the original dtype for partial views, or simply a list of hidden object offsets?

Copy link
Member

@ahaldane ahaldane Feb 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this:

First, I think we only need some relatively minor updates (in a future PR, when multi-field views are enabled) so that the hasobject field is guaranteed to be True if the dtype has objects including hidden objects. Thus, whenever taking a view of an array with objects, the view dtype will have hasobject=true even if the objects are hidden.

This is actually often the case as implemented in #6053 - with c = np.zeros(4, dtype="p,O,p")[['f0','f2']], I see that c.dtype.hasobject is True. However, I'm pretty sure I saw other cases where I can get hasobject to be false if there are hidden objects. Those would need to be fixed.

Then, this PR also needs a modification: The view should only be allowed if all the fields (not just object fields) are present at the right position in the view. This way b.view('O,p,O') is disallowed since the p is missing in a. (And if hasobject=False for both inputs, the view is allowed).

I think that change to this PR still allows most of the use-cases you've pointed out.

@charris
Copy link
Member

charris commented Sep 25, 2018

@eric-wieser Needs rebase. What do you want to do with this?

return (base_offsets[:,None] + sub_offsets).ravel()

# record type - combine the fields
if dtype.fields:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for when i get back to this: Needs is not None, and a test for np.dtype([])

Base automatically changed from master to main March 4, 2021 02:03
@charris charris added the 52 - Inactive Pending author response label Apr 6, 2022
@charris charris closed this Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants