Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC]: clarify that markevery[float] considers path length along drawn line #27842

Open
ma-sadeghi opened this issue Mar 1, 2024 · 16 comments
Open
Labels
Documentation: API files in lib/ and doc/api Documentation

Comments

@ma-sadeghi
Copy link

Bug summary

markevery[float] is supposed to result in evenly distributed markers, but when y-data is noisy, its effect seems to "gradually" fade away.

Code for reproduction

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(1e3, 1e7, 1000)

fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(9, 9), layout="constrained")

for i, ax_i in enumerate(ax.flatten()):
    ax_i.set_title(f"noise level = {i/5:.1f}")
    y = np.exp(x**0.1) + np.random.rand(len(x)) * i/2
    ax_i.plot(x, y, 'b.-', markevery=0.1, ms=10, markerfacecolor="r")
    ax_i.set_xscale("log")
    ax_i.set_yscale("log")

Actual outcome

image

Expected outcome

All subplots be like the one with 0 noise level.

Additional information

No response

Operating system

Ubuntu

Matplotlib Version

3.8.3

Matplotlib Backend

matplotlib_inline.backend_inline

Python version

3.10.13

Jupyter version

No response

Installation

pip

@dstansby
Copy link
Member

dstansby commented Mar 1, 2024

From the docs:

e.g., if every=5, every 5-th marker will be plotted.

So I think you should be using a value > 1 (instead of 0.1). If you set e.g. markevery=4 does it behave as expected?

Regardless it would be nice to either warn or error if a value < 1 is passed here.

@timhoffm
Copy link
Member

timhoffm commented Mar 1, 2024

@dstansby um, no. Float has separate semantics

every=0.1, (i.e. a float): markers will be spaced at approximately equal visual distances along the line; the distance along the line between markers is determined by multiplying the display-coordinate distance of the axes bounding-box diagonal by the value of every.

But that said, I still believe the plot is correct. Markevery still always chooses actual data points, and the noise just shows through in them.

@jklymak
Copy link
Member

jklymak commented Mar 1, 2024

Well 0.1 means about 10 data points but as the noise increases there are clearly far more than 10 data points.

That said, I consider this a pretty bad visualization technique. I'm not shocked it doesn't work well, particularly for logarithmic axes. So I agree this is a bug, but I'm not sure how fixable it is.

@ma-sadeghi
Copy link
Author

There's a real use case: The loss during training an ML model is usually smooth at the beginning and progressively begins to fluctuate, which makes it a bit noisy. Plus, the loss tends to slow down very quickly and so must be plotted on a log scale x-axis. This means that you get exponentially more points towards the end. On top of that, when plotting different training scenarios in a single figure, the jiggles makes reading and interpreting the plots very difficult. It's very useful in such cases to only see finite number of markers for each plot.

@jklymak
Copy link
Member

jklymak commented Mar 1, 2024

I can appreciate signals with different signal to noise ratios, but I don't think counting on Matplotlib's heuristic here is the right approach. I would plot the raw data, and then decimate manually based on what works properly for the data interpretation.

the jiggles makes reading and interpreting the plots very difficult. It's very useful in such cases to only see finite number of markers for each plot.

If I had this problem, I'd smooth out the jiggles rather than subsampling them, as you are just going to alias the jiggles to the subsampled data.

@ma-sadeghi
Copy link
Author

Fair enough, but even ignoring the jiggles, for even slightly noisy data that are linearly sampled but plotted on log scale, markevery is pretty helpful, but seems to be sensitive to noise, and hence not reliable.

@jklymak
Copy link
Member

jklymak commented Mar 1, 2024

Agreed that it's not reliable! Someone could try and fix, but....

@timhoffm
Copy link
Member

timhoffm commented Mar 2, 2024

Without having time to dig into this; I suspect that we are sampling with a fixed distance along the data curve. While a smooth curve has a total length of approximately the width of the Axes, a noisy curve is much longer and thus gets more data points.

@jklymak
Copy link
Member

jklymak commented Mar 2, 2024

That sounds likely. To my point that this should be done manually but he user, that is one heuristic that works for data where spacing along the line is reasonable. Here the OP wants the spacing in x, which is a different algorithm.

@dstansby
Copy link
Member

dstansby commented Mar 4, 2024

Above comments are correct - if instead the distance is parallel to the x-axis, instead of along the curve (it's a one line diff, see bottom of this comment) the output is what I think is desired:

Screenshot 2024-03-04 at 20 02 11

I guess we could add another input type to markevery to do this, perhaps allow dict[Literal["x", "y"], float] so something like {"x": 0.1} would do spacing evenly parallel to the x-axis?

diff --git a/lib/matplotlib/lines.py b/lib/matplotlib/lines.py
index 5e5c19c9f6..752b805e4f 100644
--- a/lib/matplotlib/lines.py
+++ b/lib/matplotlib/lines.py
@@ -179,7 +179,8 @@ def _mark_every_path(markevery, tpath, affine, ax):
             delta = np.empty((len(disp_coords), 2))
             delta[0, :] = 0
             delta[1:, :] = disp_coords[1:, :] - disp_coords[:-1, :]
-            delta = np.hypot(*delta.T).cumsum()
+            # Calculate distance parallel to x-axis
+            delta = delta[:, 0].cumsum()
             # calc distance between markers along path based on the Axes
             # bounding box diagonal being a distance of unity:
             (x0, y0), (x1, y1) = ax.transAxes.transform([[0, 0], [1, 1]])

@dstansby
Copy link
Member

dstansby commented Mar 4, 2024

I'll label this as feature request, as existing options are working as intended.

@tacaswell
Copy link
Member

Changing to distance along the x-axis will do weird things for spirals and other cases where the y values are not a function (in the math sense) of x.

@tacaswell tacaswell changed the title [Bug]: markevery[float] doesn't work as expected when y-data is noisy [FEAT]: markevery[float] considers path length not distance along x-axis Mar 5, 2024
@timhoffm
Copy link
Member

timhoffm commented Mar 5, 2024

One action here is to better document that [float] sub samples along the path.

I’m -0.5 on adding subsampling along x. First, subsampling is generally a questionable operation and needs to be employed with great care - there are often better aggregation techniques. Second, there’s likely only a very small subset of cases where subsampling is reasonable and subsampling along x is better than the existing subsampling along the path. Third, it’s not too hard to create a numpy mask for that yourself. And finally, I don’t see how we can fit this additional semantics in the existing markevery API (and IMHO additional keywords would be overboard here).

@jklymak
Copy link
Member

jklymak commented Mar 5, 2024

I agree - I think expanding this functionality is the wrong direction, and if anything I think we should discourage it, and rather encourage folks to figure out their subsampling on their own. But I could be convinced if there is prior art that shows this being done in a robust way....

@tacaswell
Copy link
Member

I agree with @timhoffm the right path here is better documentation

distances along the line; the distance along the line between markers is determined by multiplying the display-coordinate distance of the axes bounding-box diagonal by the value of every.

I'm reading this as the markers are spaced by approximately every*axes_diagonal_in_screen_space pixels along the path length along the line. This is not the clearest, but I also would not say it is wrong.


I think we do have to have this functionality internally because getting it right (leaving aside that small noise on top of a big value on log scale makes it go funny) requires knowing the details of our transform stack and the current view limits. With markevery=0.1 as you pan/zoom around which points we mark will dynamically update.

import matplotlib.pyplot as plt
import numpy as np

r = np.linspace(1, 10, 256)
theta = np.linspace(0, 6*n.pi, 256)
theta = np.linspace(0, 6*np.pi, 256)
plt.plot(r*np.cos(theta), r*np.sin(theta), markevery=.1, marker='o')
plt.show()

@tacaswell tacaswell changed the title [FEAT]: markevery[float] considers path length not distance along x-axis [DOC]: clarify that markevery[float] considers path length along drawn line Mar 5, 2024
@tacaswell tacaswell added Documentation Documentation: API files in lib/ and doc/api and removed New feature labels Mar 5, 2024
@ma-sadeghi
Copy link
Author

You all make valid points. As an outsider (user vs. MPL dev) though, I see markevery simply as a way to declutter a multi-line figure with many markers. From this perspective, the implementation shouldn't get in the way of the goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation: API files in lib/ and doc/api Documentation
Projects
None yet
Development

No branches or pull requests

5 participants