Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Large file size when using fill_between() #22803

Open
tgrohens opened this issue Apr 8, 2022 · 26 comments
Open

[Bug]: Large file size when using fill_between() #22803

tgrohens opened this issue Apr 8, 2022 · 26 comments
Labels
Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues Good first issue Open a pull request against these issues if there are no active ones!

Comments

@tgrohens
Copy link

tgrohens commented Apr 8, 2022

Bug summary

Hi,

I am trying to save a plot to PDF (or SVG) that uses the fill_between function between two time series, and the resulting file is much larger than when plotting each series individually.

Code for reproduction

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=424242)

t_max = int(1e5)
t = np.arange(1, t_max+1, dtype=int)

data = rng.normal(size=(10, t_max))

# Lightweight figure (266 kb)
plt.figure(dpi=200)

plt.plot(t, np.min(data, axis=0), color='tab:blue', alpha=0.75)
plt.plot(t, np.max(data, axis=0), color='tab:blue', alpha=0.75)

plt.xlim(1, t_max)
plt.savefig(f"plot_minmax_{t_max}.svg")

# Large figure (1.1 Mb)
plt.figure(dpi=200)
plt.fill_between(t, np.min(data, axis=0), np.max(data, axis=0), color='tab:blue', alpha=0.5)
plt.xlim(1, t_max)
plt.savefig(f"plot_fill_{t_max}.svg")

Actual outcome

plot_minmax_2e+04
plot_fill_2e+04

Expected outcome

The file size of the file drawn with fill_between() should stay roughly the same as the file drawing the min and the max only.

Additional information

I have plotted the resulting file sizes for different numbers of data points.
The file size of the file generated with fill_between() grows linearly with the number of data points (as could be expected), but not the file size of the file plotting the min and the max directly: maybe there's an optimization in plot() that's not being used in fill_between() ?
download

Operating system

macOS

Matplotlib Version

3.4.3

Matplotlib Backend

module://matplotlib_inline.backend_inline

Python version

Python 3.9.7

Jupyter version

No response

Installation

pip

@jklymak
Copy link
Member

jklymak commented Apr 8, 2022

Yes, Line2D starts subsampling un-resolved data. https://matplotlib.org/stable/users/explain/performance.html#line-segment-simplification. OTOH, I don't see this rcParam mentioned at all in the docstring for Line2D, nor in the docstring for plot, which is a pretty big oversight in my opinion. Labelling as a documentation issue and good first issue, as it just requires mentioning the rcParams and linking to the explanation.

@jklymak jklymak added Good first issue Open a pull request against these issues if there are no active ones! Documentation labels Apr 8, 2022
@tgrohens
Copy link
Author

tgrohens commented Apr 8, 2022

Thanks for the additional information!

From looking at the performance page and the code of fill_between, the function generates a polygon, so I understand the line segments it comprises should be simplified as well?

@jklymak
Copy link
Member

jklymak commented Apr 8, 2022

I don't think so - Polygons derive from Patch and do not have the Path logic in them.

@tgrohens
Copy link
Author

tgrohens commented Apr 8, 2022

Ah, I see; in this case, maybe the documentation should be updated to reflect that?

It currently states (emphasis mine):

For plots that have line segments (e.g. typical line plots, outlines of polygons, etc.),

@jklymak
Copy link
Member

jklymak commented Apr 8, 2022

Maybe? I could be wrong, but I don't think it applies to polygons.

@tgrohens
Copy link
Author

tgrohens commented Apr 8, 2022

Do you mean the outline of a polygon is different than a polygon?
In this case, my bad!

@jklymak
Copy link
Member

jklymak commented Apr 8, 2022

Well there are "polygons" and then there are instances of the Polygon class. I am pretty sure instances of the Polygon class cannot simplify. I agree that the docs shouldn't make it sound like Polygons will simplify, and that may have just been a mistake?

@tacaswell
Copy link
Member

I agree with Jody, I think in that prose we are using "polygon" in the colloquial sense (in that we all look at the screen and agree it is "the outline of a ploygon") rather than a claim "that is a Polygon instance that is rendering the (thing we look at and call a polygon in english) on the screen".

It is probably worth looking into if we do not use the path simplifiaction logic in drawing patches because we can not (at least with clipping there are good reasons we can not clip filled shapes) or because no one has tried to write the code to use the simplification logic.

I think the steps here are:

  1. sort out if we are already trying to use Path Simplification when drawing patches
  2. if we are, sort out why it is not being used in this case and if there is a good reason for that
  3. if we are not, implement its usage (and see what breaks ;) ) This should follow the same configuration knobs as Line2D
  4. write tests !

I'm going to label this a good first issue as it should involve no new API design (public API has to stay the same, we already have the path simplification code used else where in the code base, use the same rcparams), but medium difficulty because it will require reading and understanding where we do the path simplification in Line2D and than adapting it for use in Patch (or its subclasses)

@tacaswell tacaswell added the Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues label Apr 9, 2022
@Nishantppanchal
Copy link

Hey @tacaswell, could I be assigned to this issue? I am a first-year computer science student looking to take on a project to gain so experience and for fun of course. Just note, It may take me a bit of time however as I have to balance with uni.

@jklymak
Copy link
Member

jklymak commented Apr 10, 2022

@Nishantppanchal We do not assign or reserve issues. Thanks for your understanding.

@Nishantppanchal
Copy link

@jklymak No problem.

@wannieman98
Copy link

Hi, I wanted to give this issue a try but wanted to clarify the tasks. My understanding with the comment @tacaswell has written. Is this simply a documentation issue or should the implementation of the simplification logic completed as well?

@tacaswell
Copy link
Member

@wannieman98 The first step is to sort out if we need to do any implementation or just document how to opt-into the optimizations. If we already have the functionality, then it is just documentation. If not, then it needs to be both implemented and documented.

@tgrohens
Copy link
Author

tgrohens commented Apr 29, 2022

Hi,

I've tried to investigate why the Line2D path gets simplified but not the Patch one, and I believe there's two different knobs that need to be turned in order to get the Patch path simplified.

First, the _should_simplify attribute of the Path object representing the patch is set to False during its construction, because the path ends with a Path.CLOSEPOLY command, which is greater than Path.LINETO:

self._should_simplify = (
self._simplify_threshold > 0 and
mpl.rcParams['path.simplify'] and
len(self._vertices) >= 128 and
(self._codes is None or np.all(self._codes <= Path.LINETO))
)

I'm not too familiar with the matplotlib internals, but I believe that tweaking this snippet to allow paths that end with a CLOSEPOLY command should make sense, as it could essentially be replaced by a LINETO back to the first vertex.

Second, when a backend is drawing the path, it decides whether it should really simplify it, by looking at the _should_simplify attribute of the path but also by deciding whether to clip it (?), using its rgbFace attribute.
For the SVG backend:

clip = (rgbFace is None and gc.get_hatch_path() is None)
simplify = path.should_simplify and clip

And the PDF backend:
def writePath(self, path, transform, clip=False, sketch=None):
if clip:
clip = (0.0, 0.0, self.width * 72, self.height * 72)
simplify = path.should_simplify
else:
clip = None
simplify = False

self.file.writePath(
path, transform,
rgbFace is None and gc.get_hatch_path() is None,
gc.get_sketch_params())

I'm not too sure why a path should not be simplified if it has an rgbFace?

In any case, by both making _should_simplify True for paths that have CLOSEPOLY codes and removing the rgbFace dependency in deciding whether to actually simplify paths the backends, I've been able to obtain the desired behavior and have a file size for the file generated with fill_between() similar to the one obtained with plot(), but at the cost of breaking some tests in the test_axes.py test suite.

Let me know if I should try to turn this into a proper patch!

@tacaswell
Copy link
Member

I'm not too familiar with the matplotlib internals, but I believe that tweaking this snippet to allow paths that end with a CLOSEPOLY command should make sense, as it could essentially be replaced by a LINETO back to the first vertex.

Not entirely because at the next layer down a LINETO back to the origin strokes it as a line (which end caps) that just happen to be at the same point. This means if you look closely you can which junction in in the path is the first one. Where as if you use CLOSEPOLY it will be rendered as a contentious line like any other junction in the path. I suspect if we were to relax this condition here, we would need to also adjust the simplification logic (which is in c++) to make sure it does not mess up the last point.

It looks like we actually already did the c++ side work: #21387 and just missed enabling it everywhere?

I'm not too sure why a path should not be simplified if it has an rgbFace?

At least with clipping, being filled makes everything harder (because you have to care about the area "inside" the path, not just the points in the path). It is possible that there is a matching issues with simplification or we may have just been being cautious. This will probably require some investigation, but conceptually seems less problematic than clipping....

If you look through git-blame on those lines is there anything suggestive?


The next step is to make sure the test suite passes and open a PR with your changes!

@tacaswell tacaswell added this to the v3.7.0 milestone Apr 30, 2022
@tgrohens
Copy link
Author

tgrohens commented May 2, 2022

Following your suggestion, I've tried to split the work by first enabling paths that finish with a CLOSEPOLY to be simplified, which could make sense as a first-step PR.

When doing only this, the imshow_clip, contour_colorbar and arc_ellipse tests break.
For the first two tests, this the result of a Path being simplified that was not simplified before, and so I believe that the expected image could be updated in order to make the test pass (I'm attaching the result images of contour_colorbar as an example).

Original image:
contour_colorbar-expected_pdf
With simplification:
contour_colorbar_pdf
Diff:
contour_colorbar_pdf-failed-diff

However, for the arc_ellipse test, I get this result, when running the test from the test suite:
Original:
arc_ellipse_pdf
Expected:
arc_ellipse-expected_pdf
Diff (more obvious this time):
arc_ellipse_pdf-failed-diff

But when running the corresponding function directly from a script, I get a resulting image where the ellipse is correctly positioned and not cropped, so I'm not sure what is causing the problem in this case?

Finally, when also changing the backends as in my previous message, I get the following image for the contour_colorbar test, which indeed looks related to the problem mentioned in #21387.
contour_colorbar_pdf

@tacaswell
Copy link
Member

But when running the corresponding function directly from a script, I get a resulting image where the ellipse is correctly positioned and not cropped, so I'm not sure what is causing the problem in this case?

Are you running with the right style? We set a bunch of rcparams to "old" values so we do not need to change the images! That looks like auto-scaling going funny....

@tgrohens
Copy link
Author

tgrohens commented May 4, 2022

I've sorted out the problem for the arc_ellipse test: without simplification, the topmost point on the ellipse path was at y=0.65+ε, leading to a tick at 0.7 at the top of the figure (ticks here have a step size of 0.05).
With simplification, this point is removed and the new topmost point is at y=0.65-ε, leading to a top tick at 0.65 that doesn't allow for the whole width of the line to be drawn.
I have fixed it by changing the ellipse coordinates so that they aren't as close to the border and it works fine; for those examples, path simplification does reduce the size of the generated PDF and SVG files.

Do you think it's worth doing a first PR with this bit of work before sorting out how to work with the rgbFace in the backends?

@jklymak
Copy link
Member

jklymak commented May 4, 2022

Do we have enough dev feedback on whether we want path simplification here? I tend to think path simplification is a bad idea, leading to aliasing and weird effects. If the user cannot handle all the data in the plot then they should either simplify or rasterize.

@tacaswell
Copy link
Member

We do simplification by default on Line2D, independent if that is a good idea or not, if we do it in one place it makes sense we should do it in the other.

The path simplification we are talking about here is done at a very low level in the backends to drop points that "have no effect" (with a knob for what "no effect" means). We only know how to evaluate this at the last possible moment (the "right" simplification for a given data set depends on the number actual output pixel density which we are working on hiding the even from the user!) and I do not think we should (or really can) push that back up to the user. At the bottom this is fundamentally a performance feature (it can massively speed up Agg rendering / reduce file size) and should be an available option everywhere we can technically implement it.


With simplification, this point is removed and the new topmost point is at y=0.65-ε, leading to a top tick at 0.65 that doesn't allow for the whole width of the line to be drawn.

This is worrying that the tick selection is that sensitive. It may be the case that we are applying this simplification too high in the stack (the renderer should see the simplified path, the auto limit code should not).

@jklymak
Copy link
Member

jklymak commented May 4, 2022

OK< I guess that is right for zoom-ability. I remove my objection!

@tgrohens
Copy link
Author

tgrohens commented May 5, 2022

This is worrying that the tick selection is that sensitive.

In this particular case, I think that the settings used in this specific test do not help: axes.autolimit_mode is set to round_numbers but the axes.xmargin and axes.ymargin are 0, so the ticks are very sensitive; with axes.autolimit_mode set to data, I believe the issue vanishes.

It may be the case that we are applying this simplification too high in the stack (the renderer should see the simplified path, the auto limit code should not).

The path is being simplified through a call to Path.cleaned(), via the original ax.fill() call:

  File "/Users/tgrohens/Work/Code/mpl-test/test.py", line 110, in <module>
    test_arc_ellipse()
  File "/Users/tgrohens/Work/Code/mpl-test/test.py", line 88, in test_arc_ellipse
    ax.fill(x, y, alpha=0.2, facecolor='yellow', edgecolor='yellow',
  File "/Users/tgrohens/Work/Code/matplotlib/lib/matplotlib/axes/_axes.py", line 5057, in fill
    self.add_patch(poly)
  File "/Users/tgrohens/Work/Code/matplotlib/lib/matplotlib/axes/_base.py", line 2376, in add_patch
    self._update_patch_limits(p)
  File "/Users/tgrohens/Work/Code/matplotlib/lib/matplotlib/axes/_base.py", line 2398, in _update_patch_limits
    for curve, code in p.iter_bezier():
  File "/Users/tgrohens/Work/Code/matplotlib/lib/matplotlib/path.py", line 456, in iter_bezier
    for verts, code in self.iter_segments(**kwargs):
  File "/Users/tgrohens/Work/Code/matplotlib/lib/matplotlib/path.py", line 410, in iter_segments
    cleaned = self.cleaned(transform=transform,
                           remove_nans=remove_nans, clip=clip,
                           snap=snap, stroke_width=stroke_width,
                           simplify=simplify, curves=curves,
                           sketch=sketch)

In this case, a workaround could be to set the simplify parameter to False here, to prevent the path from being simplified too high in the stack, if that makes sense.
Moreover, the resulting simplified path created by this specific call is very small (the original path has 361 vertices, and the simplified path only 5), so I'm wondering if the C++ simplification code is called with too aggressive parameters at that stage?

@tacaswell
Copy link
Member

In this case, a workaround could be to set the simplify parameter to False here, to prevent the path from being simplified too high in the stack, if that makes sense.

Knee-jerk that makes sense to me, but that may not survive contact with code...

so I'm wondering if the C++ simplification code is called with too aggressive parameters at that stage?

This may come back to what DPI it things is using. If it is doing this at 72 then it very likely is being too aggressively simplified.


@tgrohens Thank you for your work so far on this :)

@tgrohens
Copy link
Author

tgrohens commented May 6, 2022

This may come back to what DPI it things is using. If it is doing this at 72 then it very likely is being too aggressively simplified.

OK, I think I've understood the issue here.
The path simplification logic in PathSimplifier works with points expressed in display coordinates (ie in pixels).

When computing the bounding box for a patch (in _update_patch_limits()), the Path.cleaned() method is being called, and simplifies the path of the patch.
However, at this stage, the path is expressed in data coordinates (the transform parameter is None), which have much smaller absolute values: hence, most of the points are flushed out.

However, when calling draw_path() inside one of the backends, cleaned() is called again but this time the transform is not None, and the points are correctly expressed in display coordinates, resulting in the expected level of simplification.

Interestingly, when calling ax.plot() on a Line2D containing the same points, cleaned() is not being called in the computation of the bounding box (in _update_line_limits()), and so there is no clipping due to aggressive path simplification.

I think the best course of action is therefore to switch off path simplification in _update_patch_limits() by passing simplify=False here:

for curve, code in p.iter_bezier():

@tacaswell
Copy link
Member

I think the best course of action is therefore to switch off path simplification in _update_patch_limits() by passing simplify=False here:

👍🏻

@tgrohens
Copy link
Author

tgrohens commented May 9, 2022

I think the best course of action is therefore to switch off path simplification in _update_patch_limits() by passing simplify=False here:

So, this bit sorted out the problem with tick selection.
However, in order to be actually able to simplify Patch paths, the issue remains that they are not currently simplified because they have a non-None rgbFace (as I mentioned in the second part of that message #22803 (comment)).

I've tried removing the rgbFace from deciding whether the path should be simplified, but this results in some really failing tests, such as this one (the W at the bottom right-hand corner is misdrawn):
patheffect3_pdf
Expected:
patheffect3-expected_pdf

I'm not sure that not considering the rgbFace is something that would make sense anyway, as it is code that has been in every backend for quite some time.
This makes me wonder whether it's really possible to simplify those paths in the end?

@QuLogic QuLogic modified the milestones: v3.7.0, future releases Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues Good first issue Open a pull request against these issues if there are no active ones!
Projects
None yet
Development

No branches or pull requests

6 participants