Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot line not shown in some cases involving masked arrays #5016

Closed
knarrff opened this issue Sep 1, 2015 · 16 comments
Closed

plot line not shown in some cases involving masked arrays #5016

knarrff opened this issue Sep 1, 2015 · 16 comments

Comments

@knarrff
Copy link

knarrff commented Sep 1, 2015

The following code is a short test of the much longer problem in my code:

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

a=np.arange(2428, dtype=np.float64)
a[0:1553] = 1.96808407167e+243
a=np.ma.masked_greater(a, 1.e100)

ax.plot(a[678:], a[678:], linestyle=':', label='works')
ax.plot(a, a, label='does not work')
ax.plot(a[677:], a[677:], label='also does not work')

ax.legend()
plt.show()

I create an array of dtype float64, and mask part of it (to mimic the data I found this with). I would then expect all plots to show something (and usually just use the second of the three here). Leaving out part of the masked values seems to help (first plot), but that part cannot be too small (third plot). Interestingly, even if plot 1 is commented out, the limits of the axes are correct: plot() itself seems to understand what it is supposed to be doing. It is just the 'line' that does not show up in the final image (neither with show() now savefig()).

At this point I am out of ideas, this suggesting somehow strongly to be a bug outside of this script.

This happens on a Debian Jessi machine, with both combinations of:

python 2.7.9 + numpy 1.8.2 + matplotlib 1.4.2
python 3.4.2 + numpy 1.8.2 + matplotlib 1.4.2

@WeatherGod
Copy link
Member

Confirmed in recent master, too. Even more disconcerting:

>>> p1 = l1.get_path()
>>> p2 = l2.get_path()
>>> p3 = l3.get_path()
>>> p1
Path(array([[   nan,    nan],
       [   nan,    nan],
       [   nan,    nan],
       ..., 
       [ 2425.,  2425.],
       [ 2426.,  2426.],
       [ 2427.,  2427.]]), None)
>>> p2
Path(array([[   nan,    nan],
       [   nan,    nan],
       [   nan,    nan],
       ..., 
       [ 2425.,  2425.],
       [ 2426.,  2426.],
       [ 2427.,  2427.]]), None)
>>> p3
Path(array([[   nan,    nan],
       [   nan,    nan],
       [   nan,    nan],
       ..., 
       [ 2425.,  2425.],
       [ 2426.,  2426.],
       [ 2427.,  2427.]]), None)
>>> list(p1.iter_segments())
[(array([ 1553.,  1553.]), 1), (array([ 2427.,  2427.]), 2), (array([ 2427.,  2427.]), 2)]
>>> list(p2.iter_segments())
[(array([ 1553.,  1553.]), 1), (array([ 2427.,  2427.]), 2), (array([ 2427.,  2427.]), 2)]
>>> list(p3.iter_segments())
[(array([ 1553.,  1553.]), 1), (array([ 2427.,  2427.]), 2), (array([ 2427.,  2427.]), 2)]

@tacaswell tacaswell added this to the next point release milestone Sep 1, 2015
@tacaswell
Copy link
Member

#4525 similar problem, probably not related.

@tacaswell
Copy link
Member

And this is something specific to the handling of masked arrays, if you fill with np.nan it works correctly.

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

a=np.arange(2428, dtype=np.float64)
a[0:1553] = 1.96808407167e+243
a=np.ma.masked_greater(a, 1.e100)
a = a.filled(np.nan)
ln1, = ax.plot(a[678:], a[678:], linestyle=':', label='works')
ln2, = ax.plot(a + 1, a, label='does not work')
ln3, = ax.plot(a[677:] + 2, a[677:], label='also does not work')

ax.legend()
plt.show()

My guess is that the bug is someplace in https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/lines.py#L597 but I don't have time to track it down right now.

@mdboom
Copy link
Member

mdboom commented Sep 1, 2015

Data point: This goes back at least as far as 1.3.1.

@mdboom
Copy link
Member

mdboom commented Sep 1, 2015

This turned out to be a fun one. It's related to the "subslicing" optimization here:

https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/lines.py#L635
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/lines.py#L699

The purpose of the subslicing optimization is to clip the path to the bounds of the axes before passing it along to the renderer. It requires that the path is sorted in the x dimension (and the test for determining whether a path is sorted gives the wrong result on masked arrays), and then it does a binary search (numpy.searchsorted) to find the end points of the path that are within the axes. That binary search fails when the path contains missing values (which are NaNs, and therefore have weird ordering semantics I think searchsorted is is searching on the unmasked array values, which obviously can give the wrong answer). You "get lucky" if the number of missing values at either end is even, but when it's odd the whole path gets clipped. Since writing a searchsorted to ignore missing values would be prohibitively slow without a C extension, I think it's best to just turn off subslicing for all masked arrays. A PR will come soon...

@mdboom
Copy link
Member

mdboom commented Sep 1, 2015

Cc: @efiring: I think you are the author of the subslicing feature. Any other thoughts or workarounds?

@efiring
Copy link
Member

efiring commented Sep 1, 2015

Yes, I wrote that a long time ago. In the interim, quite a few things have changed, including the handling of nans by searchsorted, and a gradual shift toward using nans internally in mpl. It looks to me like the thing to do here is modify recache to use .filled(np.nan) on masked arrays, and modify _is_sorted to use `np.nanmin(np.diff(x))'.

@mdboom
Copy link
Member

mdboom commented Sep 1, 2015

I don't think the fact that searchsorted handles NaNs now actually buys us much in this case, however. From the Numpy docs:

Previous to numpy 1.4.0 sorting real and complex arrays containing nan values led to undefined behaviour. In numpy versions >= 1.4.0 nan values are sorted to the end.

So the missing values would all have to be on the right-hand-side in order to be usable for subslicing. I think perhaps it's better to just disable subslicing on anything with missing values (which is what the current is_sorted test of np.amin(x[1:] - x[0:-1]) >= 0 effectively does, but np.nanmin(np.diff(x, axis=0)) does not.)

@efiring
Copy link
Member

efiring commented Sep 1, 2015

Good point; I hadn't paid attention to what np.searchsorted actually does in the presence of nans, which turns out to be useless.
We could support subslicing with nans and with masked arrays by generating self._x_filled in recache, using interpolation to fill the gaps. Then searchsorted would be happy operating on the filled array. The performance cost would be for the interpolation when recache is run, in cases where there is at least a potential for subslicing (x is large and sorted, apart from nans). It would require very few LOC.
Reminder for any readers who may wonder why subslicing is there at all: it is to speed up panning with a small segment of a very long time series.

@tacaswell
Copy link
Member

I don't think you need interpolation per say, just drop the masked points.

On Tue, Sep 1, 2015, 17:23 Eric Firing notifications@github.com wrote:

Good point; I hadn't paid attention to what np.searchsorted actually does
in the presence of nans, which turns out to be useless.

We could support subslicing with nans and with masked arrays by generating
self._x_filled in recache, using interpolation to fill the gaps. Then
searchsorted would be happy operating on the filled array. The performance
cost would be for the interpolation when recache is run, in cases where
there is at least a potential for subslicing (x is large and sorted, apart
from nans). It would require very few LOC.

Reminder for any readers who may wonder why subslicing is there at all: it
is to speed up panning with a small segment of a very long time series.


Reply to this email directly or view it on GitHub
#5016 (comment)
.

@efiring
Copy link
Member

efiring commented Sep 1, 2015

If you completely drop the masked points, the line won't be broken where it should be. So what you need to do is use an index array to go from the searchsorted in the reduced array back to the corresponding indices in the original array. My guess is that this requires more time and memory (and LOC) than using an interpolated array.

@efiring
Copy link
Member

efiring commented Sep 2, 2015

I have a simple change implementing the suggestion above. I will submit a PR within a day or two. I need to do some more testing.

@efiring
Copy link
Member

efiring commented Sep 3, 2015

@mdboom, @tacaswell, something very strange is going on here, independent of the use of nans or masked arrays. Consider the following test script, longline.py:

import sys

import numpy as np
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
plt.rcParams['agg.path.chunksize'] = 20000

n = 10 ** int(sys.argv[1])
xx = np.random.randn(n)
ind = np.arange(n)

if len(sys.argv) > 3 and sys.argv[3] == 'reversed':
    ind = ind[::-1]
    print('descending x sequence')
else:
    print('ascending x sequence')

# Uncomment the following to test invalid and/or masked array
#ind[::10] = np.nan
#ind = np.ma.masked_invalid(ind)

fig, ax = plt.subplots()
ax.plot(ind, xx)
if len(sys.argv) > 2:
    nsmall = int(sys.argv[2])
    ax.set_xlim(n/2, n/2 + nsmall)
fig.savefig("test.png")

Invoke it with

time python longline.py 7 100
time python longline.py 7 100 reversed

We are plotting a sequence of 100 points in the middle of a series of 10,000,000.
With 1.4.3, I get about 1.4 seconds forwards and 1.7 seconds reversed, so the sublicing is not helping much. I tried modifying the 1.4.3 Line2D.draw() to disable the subslicing unconditionally; it didn't make much difference.
But with master, the reversed case takes about 50 seconds! It is as if there is something in agg in 1.4.3 that substitutes for subslicing, and that works backwards and forwards, but that in master works only with an ascending x array. I'm baffled. If this is correct--and not an artifact of scrambled versions or some other foul-up on my side--then it is a major regression. (I used git clean -dfx for both compilations, but I did not clean out all matplotlibrc files, and so I got some warnings about keys--but I don't think that would have affected this test.)
Can you reproduce this?

@tacaswell
Copy link
Member

I can reproduce this (sorry for the funny file name 😉 )

(dd_3k) ✘-127 /tmp 
17:41 $ time python eric_oh_bad.py 7 100
ascending x sequence

real    0m1.048s
user    0m0.943s
sys     0m0.103s
(dd_3k) ✔ /tmp 
17:41 $ time python eric_oh_bad.py 7 100 reversed
descending x sequence

real    0m30.382s
user    0m30.293s
sys     0m0.133s
(dd_3k) ✔ /tmp 

@mdboom
Copy link
Member

mdboom commented Sep 3, 2015

It appears that path clipping is broken -- delving further now.

mdboom added a commit to mdboom/matplotlib that referenced this issue Sep 4, 2015
Addresses the slowness mentioned in matplotlib#5016
@mdboom
Copy link
Member

mdboom commented Sep 4, 2015

@efiring: See #5023 for a solution to the (embarrassing) slowness problem.

All of the backends clip the path to the edges of the figure before passing along to lower layers like stroking etc. That tends to dramatically reduce the runtime for large paths that are largely "off figure". It's not quite as algorithmically optimal as the binary search that subslicing does -- it still ends up running through the entire set of vertices, removing nans and affine transforming them -- but by being in C, it's obviously pretty close in run time.

With 1.4.3, I get about 1.4 seconds forwards and 1.7 seconds reversed, so the sublicing is not helping much. I tried modifying the 1.4.3 Line2D.draw() to disable the subslicing unconditionally; it didn't make much difference.

1.4 seconds vs. 1.7 seconds isn't nothing (~25%), but it's not enormous. I think it proves that the approach (binary searching) is sound, even if the implementation in Numpy isn't optimal.

One could probably do way better in C -- you could "bail early" on the ordered test, and the binary search could be written to handle NaNs the way we want. But all that's for future work...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants