Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make no unshare mask future warnings less noisy #7363

Merged
merged 1 commit into from
Mar 19, 2016

Conversation

seberg
Copy link
Member

@seberg seberg commented Feb 28, 2016

Mostly a rebase of the other one, mostly in case anyone still cares (yeah, I can't stop myself from just not liking ):

  • I don't care about the refcount logic, but it reduces noise. Actually "fixes" the spurious warnings that those extra _sharedmask = False calls also fix in extras.py. Those warnings are caused by intermediate arrays without any user control.
  • I don't care about a MaskedArrayFutureWarning as its own class, I believe at the point where we feel that we have to make it easier to suppress a warning, there is probably something not quite right (the warning should be avoided not silenced). But it left it because it does not hurt.
  • Removed the "oldmask" logic. I think one warning per array should be enough in all cases and I doubt it will help with debugging possible issues much (or that anyone would bother anyway).
  • Left the pointer to setting _sharedmask = False it is messing with internals, so it is ugly and I don't mind removing it.
  • Current version is hit a lot by things that never call unshare_mask, i.e. field-indexing, hardmasks assign "masked". Putting the warning where the call is just does not break my brain and gets down the warning hits in the test suit to something like 6 (~28 without the refcount logic), from 65 hits (counting test failures when turning to error).

the mask of ``child`` is a view into the mask of ``ma_array``.
Also when creating ``ma_array = MaskedArray(data, mask)`` will
make ``ma_array`` view the data in ``mask`` and setting an item
will currently copy ``mask`` but in the future modify the original.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few issues with this text:

  • it's in the wrong place --it's currently under "things that will happen in 1.12", but there's no way we're going to flip the switch on this in 1.12, it's way too early.
  • the description here is hard to understand i think. "Setting items of the array using the square bracket notation" to me sounds like __setitem__. I'm not 100% sure what the part about MaskedArray is saying either. Maybe: "Currently, taking a view of a masked array produces a confusing result. For example, if we write masked_view = masked_original[:], then masked_view's data array will be a view of masked_original's data array, so modifications to one array's data will also affect the other: masked_view[0] = 123; assert masked_original[0] == 123. But currently, the mask array is copied, rather than taking a view, so changes to one array's mask will not affect the other: masked_view[0] = np.ma.masked; assert masked_original[0] is not np.ma.masked. A similar situation happens when explicitly constructing a masked array using MaskedArray(data, mask) -- the returned array will have a view of data but a copy of mask. In the future, these cases will be normalized so that the data and mask arrays are treated the same way, and modifications to either will propagate between views. In 1.11, numpy will issue a whatever class it is warning whenever user code modifies the mask of a view. To silence these warnings, and make your code robust against the upcoming changes, you have two options: if you want to keep the current behavior, call masked_view.unshare_mask() before modifying the mask. If you want to get the future behavior early, do masked_view._sharedmask = False."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: do unshare_mask and/or _sharedmask work in older versions of numpy? If so then we should say that, because downstream will be working about how to handle both old numpy and 1.11 in the same code and that will make their lives easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is a twist. The mask is a view until you call setitem (or unshare_mask explicitly). I am trying to adapt your text a bit to be clear on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, but frankly, I am not sure it can be understood. It is tricky business....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that logic was there since a very long time (probably forever). but setting _sharedmask to false, might break explicit calls to unshare_mask, hmmpf, not hat I assume there are many, but still.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just document that unshare_mask is unreliable after this, I can't
imagine people will want to intentionally be using both. Could also make at
least the new version of unshare_mask copy unconditionally (or maybe it
already does this) in preparation for removing _sharedmask logic entirely.
On Feb 28, 2016 11:53, "seberg" notifications@github.com wrote:

In doc/release/1.11.0-notes.rst
#7363 (comment):

@@ -41,8 +41,21 @@ Future Changes
The following changes are scheduled for Numpy 1.12.0.

  • Support for Python 2.6, 3.2, and 3.3 will be dropped.
    -* Slicing a MaskedArray will return views of both data and mask.
    • Currently the mask is returned as a copy.
      +* When setting items of a MaskedArray using the square bracket
    • notation NumPy will in the future never copy the mask. Some
    • examples of this are child = ma_array[::2]; view[3] = 3, where
    • the mask of child is a view into the mask of ma_array.
    • Also when creating ma_array = MaskedArray(data, mask) will
    • make ma_array view the data in mask and setting an item
    • will currently copy mask but in the future modify the original.

Yeah, that logic was there since a very long time (probably forever). but
setting _sharedmask to false, might break explicit calls to unshare_mask,
hmmpf, not hat I assume there are many, but still.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/7363/files#r54355225.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added a mention. No, it only copies when the _sharedmask is True. This is the case far more often then necessary, but I think it is reliable to always copy the mask if it may be necessary (to not propagate to parent). As you said before, there might be cases where unshare mask may be wanted, though I guess in those cases an unconditional copy is likely to be perfectly fine, so making it unconditional may be an option, hmmmm (we would just not call it unless also giving the warning otherwise).

@njsmith
Copy link
Member

njsmith commented Feb 28, 2016

Overall:

  • the release notes text in this PR needs some fixing, but so does the text that's currently in 1.11, I think.
  • the warning logic here seems clearer than that in master -- I'm not sure what oldsharedmask is about, and with this PR I don't have to figure it out :-)
  • agreed that the refcnt check is klugey and optional. But I guess if it's a temporary measure that reliably increases the specificity of the warning messages then I'm +0.5.
  • the changes to extras.py seem obviously beneficial (not sure how we were getting away without these before? Are these functions not tested?)

So modulo comments above my impression is that something like this should probably be merged.

Cc @jakirkham - thoughts?

@seberg
Copy link
Member Author

seberg commented Feb 28, 2016

@njsmith we are "getting away" with it, because our test suit is broken when it comes to warning testing, especially FutureWarnings, see my suppressed warnings PR ;).

Incorporates Nathaniels suggestions for a longer explanation
in the release notes.
@seberg
Copy link
Member Author

seberg commented Mar 1, 2016

Bah, one other annoying thing, though maybe we can ignore it. The mask is of course never a view if it is nomask.

@jakirkham
Copy link
Contributor

So, I haven't had lots of time to take a close look. Though my impression after looking through is this looks very similar to the previous version ( #7187 ). My criticisms are pretty much the same.

Not a fan of reference counting as this could result in weird behavior in terms of performance and validity (of warning). Concerned as this will not raise warnings when there are views of views (and so on). This is what _oldsharedmask was for @njsmith. There seem to be some functions that are tweaking internal state _sharedmask, which seems at best risky. Though those could probably be easily fixed by calling unshare_mask instead. Finally, it looks like a change is being added to the covariance/correlation computation, which seems orthogonal to the warning (feel free to correct me). Maybe that is a separate desirable change, but it should probably be discussed first and independently of this change.

Though the big question for me right now is whether this change is needed. I haven't heard any complaints about the warning level on the initial issue. Though that doesn't mean that there are no complaints. So, my question is, are there still problems downstream? Are there problems here? If so, what are they?

@jakirkham
Copy link
Contributor

Bah, one other annoying thing, though maybe we can ignore it. The mask is of course never a view if it is nomask.

This appears to already be addressed here thanks to the fact that the first branch rules that out ( https://github.com/numpy/numpy/pull/7363/files#diff-e85f023d9c07df014a190b485db72c3dR3229 ).

@seberg
Copy link
Member Author

seberg commented Mar 1, 2016

@jakirkham, it annoys me because it is inconsistent in the final behaviour. Your desired end result is, that mask changes tend to propagate back to the original array, but they won't in the case of nomask.

As to the rest, as I said, the reference counting is only a further filter, and yes, it is safe even for views of views, because that is what it checks. That only "self" holds a reference to the mask and the mask is not a view. If someone had a view into that same mask, they would also -- indirectly -- have a reference to it. Feel free to proof me wrong, but I predict unless you use wekrefs you won't find a loophole. You are right about bad code smell, but I will just claim my nose is better trained.

If you check carefully, all the tweaks in _sharedmask are for temporary arrays, and they are necessary (though probably not with the refcount in place) since they cause warnings that users cannot avoid, e.g. [1]. As for the removal of the unshare_mask calls, they are obviously not needed, since the mask is replaced in the following line, and all unshare mask does is replace the mask and set _sharedmask to False.

About _oldsharedmask, if you really believe it is necessary, could you please explain if it can possibly do more then keep warning for the same masked array? That part I can see, but I don't really see it as necessary (and I am not even sure I think it is not worse).

[1]:

# have import warnings; warnings.simplefilter("always") in place.
In [43]: m = np.ma.masked_greater(np.random.random((10, 10)), 0.5)
In [44]: np.ma.median(m, axis=0)
/home/sebastian/.local/lib/python2.7/site-packages/numpy/ma/core.py:3195: MaskedArrayFutureWarning: Currently, slicing will try to return a view of the data, but will return a copy of the mask. In the future, it will try to  return both as views. This means that using `__setitem__` will propagate values back through all masks that are present.
  MaskedArrayFutureWarning
Out[44]: 
masked_array(...)

In [45]: np.ma.corrcoef(m)
/home/sebastian/.local/lib/python2.7/site-packages/numpy/ma/core.py:3195: MaskedArrayFutureWarning: Currently, slicing will try to return a view of the data, but will return a copy of the mask. In the future, it will try to  return both as views. This means that using `__setitem__` will propagate values back through all masks that are present.
  MaskedArrayFutureWarning
+ extra warnings.
Out[45]: 
masked_array(...)

In [46]:  np.nanmedian([[np.nan], [np.nan]], axis=1)
/home/sebastian/.local/lib/python2.7/site-packages/numpy/ma/core.py:3195: MaskedArrayFutureWarning: Currently, slicing will try to return a view of the data, but will return a copy of the mask. In the future, it will try to  return both as views. This means that using `__setitem__` will propagate values back through all masks that are present.
  MaskedArrayFutureWarning
/home/sebastian/.local/lib/python2.7/site-packages/numpy/lib/nanfunctions.py:769: RuntimeWarning: All-NaN slice encountered
  warnings.warn("All-NaN slice encountered", RuntimeWarning)
/home/sebastian/.local/lib/python2.7/site-packages/numpy/lib/nanfunctions.py:769: RuntimeWarning: All-NaN slice encountered
  warnings.warn("All-NaN slice encountered", RuntimeWarning)
Out[46]: array([ nan,  nan])

@jakirkham
Copy link
Contributor

it annoys me because it is inconsistent in the final behaviour. Your desired end result is, that mask changes tend to propagate back to the original array, but they won't in the case of nomask.

I thought you were musing over quieting the warning by a check for nomask. You are right. @mhvk raised this as well and IIRC suggested that we look into getting rid of that next.

As to the rest, as I said, the reference counting is only a further filter...

Yeah, still feel like ref counting is a bit sketchy. Not sure how much we can trust this given how masked arrays are constructed.

If you check carefully, all the tweaks in _sharedmask are for temporary arrays, and they are necessary (though probably not with the refcount in place) since they cause warnings that users cannot avoid...

Again haven't had lots of time to take a close look. It seems like a call to unshare_mask would have the same effect here and be less sketchy though. This is just my gut reaction.

About _oldsharedmask, if you really believe it is necessary, could you please explain if it can possibly do more then keep warning for the same masked array

When I have more time I'll try to give a better explanation. To say it simply and roughly, it tries to mimic what the future behavior will be. I suppose it could be tweaked more...

Though when these improvements were contemplated the question was raised about how accurate we really want to make the warning and whether more accurate might not be worse by lulling people into reliance on a nearly (not entirely) accurate warning. This brings me to my original question, which I am still not seeing the answer to. What problems are we still solving? If it is to silence a few false alarms in covariance, median, etc., maybe we should just silence those by following the recommend procedure from before namely calling unshared_mask in those places.

@charris
Copy link
Member

charris commented Mar 16, 2016

What to do about this? Looking at Pauli's tests I see two occurances of MaskedArrayFutureWarning, one each in MatPlotLib and StatsModels, which doesn't seem much. Whether they are valid warnings, I don't know, but they vanish in comparison to the other warnings. @njsmith The _oldsharedmask thing is pretty straight forward once you figure out what is going on, which admittedly took me a while. A bit more explanation in the PR would probably have helped there. All in all, I'm inclined to leave things as they are, although I have some niggling doubts as to whether we really want to make the change.

There is mention of other independent parts of this PR that look desireable. Could someone expand on that?

@jakirkham
Copy link
Contributor

Sorry, I was trying to be clever using _oldsharedmask, but it does make things a bit difficult to follow at first pass.

@jakirkham
Copy link
Contributor

As for as the proposal presented here, below are my thoughts. Others may of course think differently.

Seem like good ideas:

  • The release notes write up seems nice.
  • The update warning text could be included.
  • The stacklevel flag for the warning could be added.

Subject of discussion:

  • Movement of the warning into the branch.

Seem like not good ideas:

  • Reference counting for warning determination.
  • Workarounds using "private" member variables.
  • Suggested use of "private" member variables.

@njsmith
Copy link
Member

njsmith commented Mar 16, 2016

Movement of the warning into the branch.

My impression is that the main effect of this is to put the warning directly onto the same code path that is actually going to be changed in the future, which is usually the simplest and most accurate way to do things. Is that wrong? Can you elaborate on what the actual trade-offs are here?

Reference counting for warning determination.

It seems safe and accurate to me, but meh, whatever, I'm not so excited that I want to argue about it :-)

Workarounds using "private" member variables.
Suggested use of "private" member variables.

IIUC, what you're saying here is that you don't like the recommendation that users do marr._sharedmask = False if they want to silence the warning by opting-in to the future behavior early. (a) do I in fact understand correctly? (b) if I do, then do you have some other suggestion for how to do this? It's indeed unfortunate that it's an underscore member variable, but sometimes these things happen, and AFAIK it is safe and works correctly for all past and future versions of numpy, which is an extremely valuable property...

@saimn
Copy link
Contributor

saimn commented Mar 16, 2016

Hi,
I and a colleague were scratching our heads to try to get a mask shared with a masked array, so this upcoming change is a great news ! I just have one remark about this sentence in the changelog:

If you want to get the future behavior early, use masked_view._sharedmask = False

At first when reading this PR I thought we would have to wait for 1.12, and this sentence was not clear because of the meaning of "_sharedmask": setting it to False seems like it prevent the mask to be shared, though it is actually the contrary. So now that I read the code I understand, but maybe the sentence should be more explicit ?
Also it seems that this workaround can be compatible with at least 1.10 ?

@seberg
Copy link
Member Author

seberg commented Mar 16, 2016

If anyone wants me to do explicit fixup stuff here (i.e. remove the
refcounting check or whatever), I will do it.
My last argument: I have gh-7099, and posted the numbers here. I feel
it is easier to add this here and putting in <10 (ok, wihtout
refcounting check it was quite a bit worse) suppressions into our tests
than putting in almost 70. Could try to figure out module wide
suppressions, but I am not sure it works reliably. Plus it helps
others.

The general point whether we want it. Dunno, in general I think I
prefer the behaviour, but there are also inconsistencies with MAs that
do not have anything masked, so I would prefer if actual users like
@ahaldane spoke up.

@ahaldane
Copy link
Member

I've been meaning to review this but just haven't had time! I'll take a look once I get a chance.

(by the way I wouldn't consider myself a MaskedArray user really... I got interested in MaskedArrays due to changes I wanted to make elsewhere in numpy, which snowballed)

@charris
Copy link
Member

charris commented Mar 18, 2016

I'm going to stick with what we have for 1.11.0. If there are internal uses that raise warnings, we should deal with that at some point, but the number of warnings currently generated downstream don't strike me as excessive. I will try to update the release notes, but with some editing as I find the version here confusing, but otherwise an improvement as it offers a more extended explanation.

@charris
Copy link
Member

charris commented Mar 18, 2016

Updated release notes in #7430, comments welcome.

@charris
Copy link
Member

charris commented Mar 19, 2016

@seberg @jakirkham I'm thinking we might solve the nomask problem by making the _mask attribute a property and using an object that may, or may not, contain an actual mask. If no mask is contained the property getter will return nomask, otherwise an actual mask. Since a reference to that object can be retained in a view, the setter can set the mask in the view and it will show up in the original. As the maxim goes, "All problems in computer science can be solved by another level of indirection".

EDIT, We might want to control access with a lock, or maybe not, we don't bother to do that with ordinary views and I don't know if array assignments or property getters/setters are atomic.

@seberg
Copy link
Member Author

seberg commented Mar 19, 2016

Hehe, yeah but another level of indirection would make it even harder
to maintain ;). I am sure if you fixed up the confusing stuff in the
current text, it is good.

Ceterum censeo, that moving the place to warn has only advantages,
and those silencing _sharedmask = False things are "obviously correct"
(i.e. on temporaries) and really should be there.
But I really don't want to delay the release over this, though I always
hoped that you guys would rethink and simply agree....
But "obvious" in the eye of the beholder I guess ;).

charris added a commit that referenced this pull request Mar 19, 2016
ENH: Make no unshare mask future warnings less noisy
@charris charris merged commit a7c8600 into numpy:master Mar 19, 2016
@charris
Copy link
Member

charris commented Mar 19, 2016

I went and checked both Matplotlib and StatsModels for both versions and this PR does a better job. So merging. Thanks Sebastian.

@homu homu mentioned this pull request Mar 19, 2016
@jakirkham
Copy link
Contributor

I and a colleague were scratching our heads to try to get a mask shared with a masked array, so this upcoming change is a great news !

@saimn, that warms my heart. 😄

@jakirkham
Copy link
Contributor

@charris, ok, if you think it is better, I trust your judgement.

@charris
Copy link
Member

charris commented Mar 19, 2016

In particular, the StatsModel warning went away, which was good because it didn't seem associated with any test. For Matplotlib, the warning was in a different place and looked valid whereas the warning without this PR did not look correct. Not sure what the difference was there. Setting the stacklevel also helped.

@charris
Copy link
Member

charris commented Mar 19, 2016

@jakirkham I figured the only valid way to settle the question was to gather some actual data and it does look like this PR performs better.

EDIT: I was hoping they would both work the same, but it didn't turn out that way.

@jakirkham
Copy link
Contributor

No need to keep selling it. 😄 I have learned my lesson after the disaster that was my first attempt at this warning. 😉

@ahaldane
Copy link
Member

A little late here since you just merged, but I was just looking through the history of changes here and this PR looks ok to me.

I was at first a little worried about the refcounting, but @seberg's comment that this is merely a further filter on the warning satisfied me: There are no cases where we should emit a warning but don't (it can't happen that the mask is really shared but appears unshared by refounting), although there are still some rare cases where we probably don't need to warn but do (eg if the users manually makes a reference to arr._mask, but doesn't write to it).

@charris
Copy link
Member

charris commented Mar 19, 2016

@seberg Just to be sure, assignment to a masked data value can also change the mask, does this cover that case?

@seberg seberg deleted the masked-silence-2 branch March 19, 2016 19:11
@seberg
Copy link
Member Author

seberg commented Mar 19, 2016

@charris sorry, a bit daft here, can you give an example? My logic here always was: Put the warning where the future change is and trust that it must be right.

@charris
Copy link
Member

charris commented Mar 19, 2016

@SeberT Actually, looks like a miss...

In [1]: import warnings

In [2]: warnings.simplefilter('always')

In [3]: x = ma.array(eye(2), mask=1)

In [4]: x
Out[4]: 
masked_array(data =
 [[-- --]
 [-- --]],
             mask =
 [[ True  True]
 [ True  True]],
       fill_value = 1e+20)

In [5]: x[0,0] = 1

In [6]: x
Out[6]: 
masked_array(data =
 [[1.0 --]
 [-- --]],
             mask =
 [[False  True]
 [ True  True]],
       fill_value = 1e+20)

No warning issued.

@charris
Copy link
Member

charris commented Mar 19, 2016

NVM, bad example...

EDIT: Looks OK.

@seberg
Copy link
Member Author

seberg commented Mar 19, 2016

Heh, almost difficult to get a good example:

In [1]: mask = np.ones((2, 2), dtype=bool)

In [2]: import warnings; warnings.simplefilter("always")

In [3]: x = np.ma.MaskedArray(np.eye(2), mask)

In [4]: x[0, 0] = 1
runtests.py:1: MaskedArrayFutureWarning: setting an item on a masked array which has a shared mask will not copy the mask and also change the original mask array in the future.
Check the NumPy 1.11 release notes for more information.
  #!/usr/bin/env python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants