Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: plt.savefig leaks the stack #25406

Open
CorentinJ opened this issue Mar 7, 2023 · 24 comments
Open

[Bug]: plt.savefig leaks the stack #25406

CorentinJ opened this issue Mar 7, 2023 · 24 comments

Comments

@CorentinJ
Copy link

CorentinJ commented Mar 7, 2023

Bug summary

Calling plt.savefig leaks local variables from parent calling functions. These can be found in gc.get_objects() and are cleared by gc.collect()

Code for reproduction

import gc
import matplotlib.pyplot as plt


class SomeObject:
    pass

def save_plots():
    plt.plot([1, 2, 3])
    plt.savefig("test.png")

def leaky():
    some_object = SomeObject()
    save_plots()

leaky()

for obj in gc.get_objects():
    try:
        if isinstance(obj, SomeObject):
            print("Object is leaking")
    except:
        pass

Actual outcome

Object is leaking

Expected outcome

Nothing should be printed.

Additional information

  • I initially found this bug because I realized my deep learning training loop was leaking VRAM. Torch tensors were being kept in memory when Figure.savefig was called
  • I am under the impression that ExitStack within savefig is meant to hack around this issue but is somehow not working in my case?
  • I have been through issue [Bug]: plt.figure(), plt.close() leaks memory #23701, but here I am able to provide a short and reproducible example
  • I have been through Memory leaks on matplotlib 3.4.2 (and 3.4.0) #20490 as well, but the issue persists with the agg backend
  • It is not acceptable to have to call gc.collect(). Most users won't be aware of the issue. Given the prevalence of deep learning these days, it's likely objects leaking into GPU memory will be a common problem due to this bug.

Operating system

Win10 x64

Matplotlib Version

3.7.1

Matplotlib Backend

TkAgg

Python version

3.10.0

Jupyter version

No response

Installation

pip

@rcomer
Copy link
Member

rcomer commented Mar 7, 2023

Thank you for the clear minimal example @CorentinJ. I just tried this with the main branch using both QtAgg and TkAgg, and both python 3.10 and 3.11. Of these four combinations, I only get the "leaking" output with python 3.10 and TkAgg.

@ksunden
Copy link
Member

ksunden commented Mar 7, 2023

I don't know that something which is cleaned up by gc.collect is really considered a "leak"... yeah, ideally the reference counter would free the objects more proactively, but just because cycles exist doesn't mean it is actually leaked.

If you are in a memory constrained environment (such as GP-GPU compute), calling gc.collect more may be warranted. (At some point "users doing memory heavy tasks should know about gc.collect" becomes very reasonable.).

Note that in both of the prior issues you link to, they actively call gc.collect in order to prove that there is a leak.

If it is easy to find/fix, I'd be all for explicitly breaking the object cycle/garbage collecting it to avoid the problem. (e.g. if ExitStack was to blame and some explicit del statements in its __exit__ would fix it)

I have been unable to reproduce on my machine (which is linux, so does differ) on py 3.9, 3.10, or 3.11. This makes investigating where cycles exist basically impossible on my machine. I may have seen it once, but I was trying to get a second instance to print, adding some disambiguation to it made it disappear, so not sure which was printed. I don't even reliably see it with garbage collection disabled entirely, so not sure where the cycle is being introduced... (edit: my disambiguation had an exception raised that was silenced by the try/except... fixing that showed indeed it was the second instance that correctly was alive still that was printed)

The first pass test I was going to do was to see what type of objects are referring to SomeObject (my guess is we'd have to go up the referrer stack a level or two to actually find what we want):

import gc
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt


class SomeObject:
    pass

def save_plots():
    plt.plot([1, 2, 3])
    plt.savefig("test.png")

def leaky():
    some_object = SomeObject()
    save_plots()

leaky()

for obj in gc.get_objects():
    try:
        if isinstance(obj, SomeObject):
            print("Object is leaking")
            print([type(i) for i in gc.get_referrers(obj)])
    except:
        pass

@tacaswell
Copy link
Member

Please see #23712, the discussion in the linked issues, and https://matplotlib.org/stable/users/prev_whats_new/whats_new_3.6.0.html#garbage-collection-is-no-longer-run-on-figure-close.

In this example you are creating and holding open a Figure instance, so if that does indeed hold a reference to the current stack + locals, it is being kept alive because you have asked for the referrers to be kept alive. If you explicitly close the figure, does that help?

The other thing that may be happening to make this seem non-deterministic is that the algorithm that Python uses to decide when to do the full cyclic sweep depends on how many objects are around and various water marks being crossed. It is possible that in some cases a cyclic sweep is triggered and in others it is not. If you throw in something like

a = 0
for j in range(1000):
    a += j

after you call lazy() does the issue go away?


I am under the impression that ExitStack within savefig is meant to hack around this issue but is somehow not working in my case?

That is not at all what ExitStack does (see https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack), it is for building up a (variable) number of nested context managers via method calls rather than via nested with blocks.

@rcomer
Copy link
Member

rcomer commented Mar 7, 2023

I have been unable to reproduce on my machine (which is linux, so does differ) on py 3.9, 3.10, or 3.11.

I did reproduce on Linux (Ubuntu), but I just realised my 3.10 environment had been hanging around for a few months. So I created a fresh one and could not reproduce with that. I’m looking at 3.10.9 vs 3.10.6.

@CorentinJ are you able to try a fresh python install?

@anntzer
Copy link
Contributor

anntzer commented Mar 7, 2023

I can repro on Py3.11+macOS+{agg,qtagg,tkagg,mplcairo.{base,qt,tk}}.
I think the bug report is a valid one (or at least a reasonable one) because some_object doesn't seem to belong to a reference cycle and it is thus somewhat unexpected that one needs to call gc.collect() to gc it.

@tacaswell
Copy link
Member

@anntzer does ensuring the figure is actually closed or burning a bunch of bytecode before checking affect anything?

That you can get it with just agg says we can not blame the gui toolkits.

My other guesses:

  • we are catching and holding on to an exception someplace?
  • the "first line not in Matplotlib" warning logic?

@tacaswell
Copy link
Member

This will reliably leak for me:

import gc
import matplotlib.pyplot as plt
import inspect
import sys

class SomeObject:
    pass

def save_plots():
    plt.plot([1, 2, 3])
    plt.savefig("test.png")

def leaky():
    cf = inspect.currentframe()
    print(f'pre: {sys.getrefcount(cf)}')
    some_object = SomeObject()
    print(f'pre: {sys.getrefcount(cf)}')
    save_plots()
    print(f'post: {sys.getrefcount(cf)}')


leaky()

for obj in gc.get_objects():
    if isinstance(obj, SomeObject):
        print("Object is leaking")

and adding del cf fixes it which suggest that someplace we are getting a frame in a local namespace?

Does

diff --git a/lib/matplotlib/_api/__init__.py b/lib/matplotlib/_api/__init__.py
index f4c390a198..5e0a7c1701 100644
--- a/lib/matplotlib/_api/__init__.py
+++ b/lib/matplotlib/_api/__init__.py
@@ -385,4 +385,5 @@ def warn_external(message, category=None):
                         frame.f_globals.get("__name__", "")):
             break
         frame = frame.f_back
+    del frame
     warnings.warn(message, category, stacklevel)

help?

@anntzer
Copy link
Contributor

anntzer commented Mar 8, 2023

Explicitly plt.close()ing the figure doesn't fix the problem.
I think burning bytecode is a red herring, I believe this should not leak some_object even if one called gc.disable() from the very beginning as some_object shouldn't be participating in a ref cycle, it should be collected by refcounting. (Perhaps starting with gc.disable() can make this issue more reproducible?)
My unsubstantiated guess is that the ref cycle is something along the lines of https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/

Edit: Good idea looking at warn_external, but that didn't help either.

@jklymak
Copy link
Member

jklymak commented Mar 8, 2023

I don't really understand all this and was just trying it to see if I could better understand. However, not importing Matplotlib and doing

def save_plots():
    return

also reports "Object is leaking" in the test #25406 (comment)

@ksunden
Copy link
Member

ksunden commented Mar 8, 2023

@jklymak I was about to say something very similar (I just commented out the call rather than making it return)

That said, I took that as an example of how the behavior may have arisen internally (e.g. via warn_external, though unfortunately that seems to not be the culprit)

@ksunden
Copy link
Member

ksunden commented Mar 8, 2023

Okay, behavior that involves stack frames and only Tk (having been imported already, not necessarily using tkagg, I suspect @anntzer's test/environment imported tkinter somewhere):

tkinter = sys.modules.get("tkinter")
if tkinter:
codes = {tkinter.mainloop.__code__, tkinter.Misc.mainloop.__code__}
for frame in sys._current_frames().values():
while frame:
if frame.f_code in codes:
return "tk"
frame = frame.f_back

@ksunden
Copy link
Member

ksunden commented Mar 8, 2023

diff --git a/lib/matplotlib/cbook.py b/lib/matplotlib/cbook.py
index c9699b2e21..46a17d4d8f 100644
--- a/lib/matplotlib/cbook.py
+++ b/lib/matplotlib/cbook.py
@@ -67,6 +67,7 @@ def _get_running_interactive_framework():
                 if frame.f_code in codes:
                     return "tk"
                 frame = frame.f_back
+        del frame
     macosx = sys.modules.get("matplotlib.backends._macosx")
     if macosx and macosx.event_loop_is_running():
         return "macosx"

While I'm still unable to reproduce the original bug, I suspect this diff may resolve it (same idea as warn_external, but different location

@QuLogic
Copy link
Member

QuLogic commented Mar 8, 2023

I can't reproduce this on TkAgg or QtAgg, but I can in Gtk[34]. gc.get_referrers(obj) gives me a list, a dict, and a frame. The list appears to be the one generated by the gc (as it seems to contain every possible object.) The dict seems to be something about a module or class information.

So I skipped the dict and list, and using the following walked up the frames:

        if isinstance(obj, SomeObject):
            print("Object is leaking")
            while (val := input('continue?').strip().lower()) not in ('n', 'no', 'q', 'quit'):
                if val == 'break':
                    breakpoint()
                referrers = [r for r in gc.get_referrers(obj) if not isinstance(r, (dict, list))]
                if not referrers:
                    break
                print(referrers)
                obj = referrers[0]

which gives me:

Object is leaking
continue?
[<frame at 0x7fbfa466a660, file '/home/elliott/code/matplotlib/../mpl-tests/issue25406.py', line 14, code leaky>]
continue?
[<frame at 0x7fbfa9e3a4b0, file '/home/elliott/code/matplotlib/../mpl-tests/issue25406.py', line 10, code save_plots>]
continue?
[<frame at 0x7fbfa4631220, file '/usr/lib64/python3.11/site-packages/matplotlib/pyplot.py', line 961, code savefig>]
continue?
[<frame at 0x7fbfa9dfec40, file '/usr/lib64/python3.11/site-packages/matplotlib/figure.py', line 3277, code savefig>]
continue?
[<frame at 0x7fbfa46241e0, file '/usr/lib64/python3.11/site-packages/matplotlib/backend_bases.py', line 2285, code print_figure>]
continue?
[<frame at 0x7fbfa45f6ce0, file '/usr/lib64/python3.11/site-packages/matplotlib/backend_bases.py', line 2204, code <lambda>>]
continue?
[<frame at 0x7fbfaa466ac0, file '/usr/lib64/python3.11/site-packages/matplotlib/_api/deprecation.py', line 410, code wrapper>]
continue?
[<frame at 0x7fbfa9eb1460, file '/usr/lib64/python3.11/site-packages/matplotlib/backends/backend_agg.py', line 517, code print_png>]
continue?
[<frame at 0x7fbfa9dffb40, file '/usr/lib64/python3.11/site-packages/matplotlib/backends/backend_agg.py', line 464, code _print_pil>]
continue?
[<frame at 0x7fbfa9e10ac0, file '/usr/lib64/python3.11/site-packages/matplotlib/image.py', line 1667, code imsave>]
continue?
[<frame at 0x7fbfa4640480, file '/usr/lib64/python3.11/site-packages/PIL/Image.py', line 2331, code save>]
continue?
[<frame at 0x7fbfaaeded90, file '/usr/lib64/python3.11/site-packages/PIL/PngImagePlugin.py', line 1388, code _save>]
continue?
[<frame at 0x7fbfa9e26770, file '/usr/lib64/python3.11/site-packages/PIL/ImageFile.py', line 529, code _save>]
continue?
[<traceback object at 0x7fbfaa506540>]
continue?
[AttributeError("'_idat' object has no attribute 'fileno'")]
continue?
[<frame at 0x7fbfa9e26770, file '/usr/lib64/python3.11/site-packages/PIL/ImageFile.py', line 529, code _save>]
continue?q

which is a traceback from an exception in Pillow. And it loops from that point on, so it appears to be a reference cycle. I'm not really sure why we think this is a Matplotlib problem.

@anntzer
Copy link
Contributor

anntzer commented Mar 8, 2023

Actually I may have figured out why the leak is tricky to reproduce: this may depend on mathtext, because it only appears for me when using my default matplotlibrc, which has axes.formatter.use_mathtext: True. Also, one can also just disable that, and repro with

import gc
import matplotlib.pyplot as plt; plt.rcdefaults()


class SomeObject:
    pass

def leaky():
    some_object = SomeObject()
    fig = plt.figure()
    fig.text(.5, .5, "$0.1$")  # no leak if using "0.1" i.e. no mathtext.
    fig.canvas.draw()

leaky()

for obj in gc.get_objects():
    try:
        if isinstance(obj, SomeObject):
            print("Object is leaking")
    except:
        pass

@shervinsahba
Copy link

shervinsahba commented Mar 8, 2023

This may be related to @anntzer 's example. First of all, I reproduce the same behavior that they posted above in Python 3.10.8 and MPL 3.6.2. Except if I prepend plt.rcParams.update({"text.usetex": True}) , the mathtex works fine with no leak reported.

The comment led me to find the cause of a leak on a personal, non-minimal example. In my scenario, explicitly setting the font family to Palatino (provided by my local tex package) seems to prevent plt.close() and gc.collect() from doing their job during a loop. Disabling the font allowed for garbage collection.

@ksunden
Copy link
Member

ksunden commented Mar 8, 2023

Combining the investigation from @QuLogic and @anntzer shows that the exception that is causing the cycle is still in an upstream package for that more reliable test case (that I can reproduce). This time it is an exception in pyparsing, it appears, and much deeper in the call stack than the version that was caused by a caught exception in Pillow.

...
#### Last matplotlib line referenced
[<frame at 0x56027c6625f0, file '/home/kyle/.pyenv/versions/py310/lib/python3.10/site-packages/matplotlib/_mathtext.py', line 2019, code math_string>]
continue?
[<frame at 0x56027c65f940, file '/home/kyle/.pyenv/versions/py310/lib/python3.10/site-packages/pyparsing/core.py', line 1143, code parse_string>]
continue?
...
[<frame at 0x56027c677530, file '/home/kyle/.pyenv/versions/py310/lib/python3.10/site-packages/pyparsing/core.py', line 4136, code parseImpl>]
continue?
[<traceback object at 0x7fb93f2ef4c0>, <traceback object at 0x7fb93cfde240>, <frame at 0x56027c677750, file '/home/kyle/.pyenv/versions/py310/lib/pyth
on3.10/site-packages/pyparsing/core.py', line 925, code _parseCache>]
continue?
[<traceback object at 0x7fb93f2ef780>]
continue?
[<traceback object at 0x7fb93f2eed40>]
continue?
[<traceback object at 0x7fb93f2ef040>]
continue?
[<traceback object at 0x7fb93f2efa80>]
continue?
[<traceback object at 0x7fb93cfd7280>]
continue?
[<traceback object at 0x7fb93cfd70c0>]
continue?
[<traceback object at 0x7fb93cfd4340>]
continue?
[Expected {accent | symbol | function | operatorname | group | frac | dfrac | binom | genfrac | overset | underset | sqrt | overline}, found end of te
xt  (at char 3), (line:1, col:4)]
continue?
[<frame at 0x56027c677530, file '/home/kyle/.pyenv/versions/py310/lib/python3.10/site-packages/pyparsing/core.py', line 4136, code parseImpl>]

@CorentinJ
Copy link
Author

@anntzer How can one enable/disable mathtext programmatically?

@CorentinJ
Copy link
Author

CorentinJ commented Mar 15, 2023

I found a similar leak with plt.subplots (I suppose other plt functions would also trigger the leak) that is NOT cleared by the garbage collector. This one occurs when setting the consolas font.

It does not occur on my windows machine this time, it's an Ubuntu one.

Linux 5.4.0-135-generic #152~18.04.2-Ubuntu SMP Tue Nov 29 08:23:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
matplotlib.__version__ = 3.6.2
matplotlib.get_backend() = "agg"
python = Python 3.9.15 (miniconda)
import gc

import matplotlib.pyplot as plt

plt.rc("font", family="consolas")

class SomeObject:
    pass

def make_subplots():
    plt.subplots()

def leaky_fn():
    x = SomeObject()
    make_subplots()

leaky_fn()


def detect_leak():
    for obj in gc.get_objects():
        try:
            if isinstance(obj, SomeObject):
                print("Object is leaking")
        except:
            pass


# First without gc.collect()
detect_leak()

# Still leaks after gc.collect()!
gc.collect()
detect_leak()

@ksunden
Copy link
Member

ksunden commented Mar 15, 2023

In this instance the keeping of the underlying exception alive is actually intentional.

It comes from the memoisation of font lookup returns as "consolas" is not actually found.

Ref:

ret = self._findfont_cached(
prop, fontext, directory, fallback_to_default, rebuild_if_missing,
rc_params)
if isinstance(ret, Exception):
raise ret
return ret

@lru_cache(1024)
def _findfont_cached(self, prop, fontext, directory, fallback_to_default,
rebuild_if_missing, rc_params):

# This return instead of raise is intentional, as we wish to
# cache the resulting exception, which will not occur if it was
# actually raised.
return ValueError(f"Failed to find font {prop}, and fallback "
f"to the default font was disabled")

Now... perhaps in this instance it is possible to cut out the full stack reference from the returned exception?

It would make some sense to me that a memoised exception should perhaps have a limited traceback... hard to say though, because the direct place it is used it is raised, and don't necessarily want to rob users of that context when it is accurate, but when it is reraised, that could be outright misleading (plus the stack leaking, which is not ideal)?

@tacaswell
Copy link
Member

I'm working on a fix for this (I put this particular bug in attempting to fix a performance regression).

tacaswell added a commit to tacaswell/matplotlib that referenced this issue Mar 15, 2023
This leads to caching the tracebacks which can keep user's objects in local
namespaces alive indefinitely.  This can lead to very surprising memory issues
for users and will result in incorrect tracebacks.

Responsive to matplotlib#25406
@tacaswell
Copy link
Member

@CorentinJ Are you able to check if the current main branch works in your actual application?

@CorentinJ
Copy link
Author

I can confirm that the stack leak caused by the exception in setting the font is resolved on the current main (matplotlib-3.8.0.dev726+ge24008033), while the leak in the OP remains.

@tacaswell
Copy link
Member

so at this point we think there are frame loops in:

and we have fixed 1 definite problem (memoizing the error is finding fonts) and a 3 maybe problems (manually deleting frames just to be safe) in the mpl code.

Has anyone created issues upstream with pillow and pyparsing (and if so can we have xrefs)?

@CorentinJ do you have mathtext turned on by default? From reading back through this chain I am not confident we understand why not everyone can reproduce the issue with the OP code.

@ksunden
Copy link
Member

ksunden commented Mar 30, 2023

From #25569, I noticed that Pillow 9.2 presents the problem with the test included by #25470 but at least that specific instance was resolved in Pillow 9.4.

(That said, I was one of the ones having trouble reproducing originally, so slightly confused as to how I had the version of Pillow that was failing on my system, but may have been looking into other things at some point and installed a different version)

@CorentinJ can you check the version of Pillow you have installed and see if that form of this bug remains if you update to latest Pillow?

If fixed in latest, I don't think there is any action to be taken on the Pillow front (I don't see a need to e.g. block versions of Pillow over this as they still broadly work, just missing some optimization)

Pyparsing is another story, should perhaps try to come up with a minimal (non-matplotlib) example of that form and report upstream if it has not been done already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants