New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paramiko has huge memory leaks (Solution Found) #949

Open
agronick opened this Issue Apr 28, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@agronick
Contributor

agronick commented Apr 28, 2017

Edit: I found a solution to this. There is a bug in Paramiko. Skip to the last post for the important details.

I wrap all of my ssh connections inside the classes shown below. The classes are invoked like this:

        s = SshTask()
        s.timeout = s.run_timeout = 14
        s.to_host = form.get_store('vmh01')
        s.key = settings.KEY
        s.command = " "
        packages = s.run()

This should leave nothing behind. I scripted a page to repeatedly make the same request to the same server over and over again. I installed Dozer on my Django app. Debug mode is turned off. After a few hours of running my memory went form 100mb to 272mb. Paramiko seems to be the only thing that isn't cleaning up after itself.

Now that the app is idle this is what it is showing for counts:
paramiko.agent.Agent Min: 345 Cur: 345 Max: 345
paramiko.auth_handler.AuthHandler Min: 345 Cur: 345 Max: 345
paramiko.buffered_pipe.BufferedPipe Min: 690 Cur: 690 Max: 690
paramiko.channel.Channel Min: 345 Cur: 345 Max: 345
paramiko.client.AutoAddPolicy Min: 351 Cur: 351 Max: 351
paramiko.client.SSHClient Min: 351 Cur: 351 Max: 351
paramiko.ecdsakey._ECDSACurve Min: 3 Cur: 3 Max: 3
paramiko.ecdsakey._ECDSACurveSet Min: 1 Cur: 1 Max: 1
paramiko.hostkeys.HostKeyEntry Min: 345 Cur: 345 Max: 345
paramiko.hostkeys.HostKeys Min: 702 Cur: 702 Max: 702
paramiko.message.Message Min: 345 Cur: 345 Max: 345
paramiko.packet.Packetizer Min: 351 Cur: 351 Max: 351
paramiko.py3compat.long Min: 3 Cur: 3 Max: 3
paramiko.resource.ResourceManager Min: 1 Cur: 1 Max: 1
paramiko.rsakey.RSAKey Min: 690 Cur: 690 Max: 690
paramiko.ssh_exception.SSHException Min: 6 Cur: 6 Max: 6
paramiko.transport.ChannelMap Min: 351 Cur: 351 Max: 351
paramiko.transport.Transport Min: 351 Cur: 351 Max: 351
paramiko.util.PFilter Min: 1 Cur: 1 Max: 1

Edit: Narrowed this down more, see below post. Removed my implementation. Not relevant anymore.

@agronick

This comment has been minimized.

Contributor

agronick commented May 1, 2017

Here is a screenshot of all related objects from a transport object that isn't being cleaned up.

screenshot_20170502_143110

Edit: Seems like this might only happen when the following exception is thrown:

paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

Clean exiting connections seem to be cleaned up fine.

Edit: More research, I seem to have gotten to the bottom of this. I'm seeing that sometimes things will loop in read_all in the packetizer until a rekey exception is thrown. Everything seems to exit normally. For some reason the weak references in the resource manager don't seem to work as intended. You can see that the only thing holding these transport objects in memory is a weak reference. Even calling gc.collect() doesn't make them go away.

screenshot_20170502_150735

All these transports sit in memory, as well as the client connected to them, as well as their last message, and an EOF exception from the last_exception key, and a stack trace on that exception. - A bunch of crap that doesn't seem to go away after any amount of time.

The whole resource manager object seems to be needlessly complex, all to get around the common idiom that you shouldn't rely on the __del__ method. Well the ResourceManager does exactly what the __del__ method would do, except the __del__ method actually works.

If you get rid of the ResourceManager and just throw this on the bottom of Transport:

    def __del__(self):
        self.close()

this whole problem goes away and everything is cleaned up as soon as the rekey exception is thrown. This is on vanilla Python 3.6.

These monkey patches resolve the problem for me:

paramiko.transport.Transport.__del__ = lambda self: self.close()
paramiko.resource.ResourceManager.register = lambda *args: None
@agronick

This comment has been minimized.

Contributor

agronick commented May 3, 2017

I found out why this is happening. It all has to do with the ResourceManager. The resource manager says when Client is deleted close the transport. The Client has a reference to transport. Transport has a reference to Client - thats ok. They cancel out.

But when an the transport is registered this method is created which references the Transport as resource:

(Transport == resource)

        def callback(ref):
            try:
                resource.close()
            except:
                pass
            del self._table[id(resource)]

It assigns this as a callback to be called when the Client is deleted. Now here is the issue. This function references Transport bumping Transport's reference counter up by 1.

As long as this callback exists is memory the Transport's reference counter will never be 0. This callback will exist in memory as long as the Client exists in memory. The Client will exist as long as self._sshclient is set to the Client on the transport.

The only place that self._sshclient is set to None on the Transport is in the close method. So if close() is never called on the transport (because of an exception) you have a cyclical reference that never goes away.

I can think of a few ways to fix this. I still think the best way is to get rid of the ResourceManager entirely. Simple is better than complex. A __del__ method on the Transport does the same thing as this.

It would be nice to hear back from someone. Multiple posts on this with no replies, I feel like I'm talking to myself.

@agronick agronick changed the title from Paramiko has huge memory leaks to Paramiko has huge memory leaks (Solution Found) May 3, 2017

@ploxiln

This comment has been minimized.

Contributor

ploxiln commented May 4, 2017

This change was recently merged (cherry-picked): #891
That fixed some problems, but I think might have inadvertently caused this leak.

@ploxiln

This comment has been minimized.

Contributor

ploxiln commented May 4, 2017

Thanks for the thorough analysis of the problem!

So this ResourceManager callback system is what caused the problem that #891 tried to fix: if a reference to the client was not being saved by the program using paramiko, but it was still using the transport, the client would be garbage collected, and then it would close the transport while the program was still using it. But the fix made the ResourceManager useless for this relationship. (And the ResourceManager isn't used anywhere else.)

It seems like the fix is to get rid of the ResourceManager, and figure out the GC situation anew. It's prudent to read why it was introduced in the first place: 029b898

EDIT: fixed link

@ploxiln

This comment has been minimized.

Contributor

ploxiln commented May 4, 2017

... I now notice your simple "monkey patch" fix, to just get rid of the ResourceManager and add a simple Transport destructor. That seems to make sense.

The ResourceManager does seem pointless. It seems to be nothing more than a SSHClient destructor, but with a complication which "tricks" the GC cycle detector. I hope you don't mind if I describe what happens here in my own way:

  1. module has reference to singleton ResourceManager
  2. singleton ResourceManager has reference to _table member
  3. _table has (normal/strong) reference to weakref object
    1. weakref has a "weak" reference to SSHClient, does not contribute to SSHClient's reference count
  4. weakref has a (normal/strong) reference to the callback function!
  5. callback has a reference to the Transport

So the GC cycle detector doesn't see that the ResourceManager reference to the Transport is really due to the SSHClient object, it looks like an unrelated reference from the ResourceManager singleton, it looks independent.

Back to what makes sense as the fix. Assuming we remove the ResourceManager: If the program using the SSHClient calls .close() on it, it will also .close() the transport - so correct programs will work correctly, right? We could leave the back-reference from Transport to SSHClient, either .close() or the cycle detector will clean up, but is the back-reference still needed for anything, can we actually remove that too? I think we can...

@agronick

This comment has been minimized.

Contributor

agronick commented May 5, 2017

Yeah, that's exactly whats happening. It took me a day just thinking about how I got it to work right. When I was walking back to my car the second day it finally clicked why it was all happening. It was a real eureka moment.

So issue #44 was that originally the Client getting garbage collected caused close() to be called on the transport. If we just let transport close itself and don't do anything when the Client is garbage collected, we can get remove the ResourceManager and let the transport clean up on gc? Then everything is fine including #44.

@ploxiln

This comment has been minimized.

Contributor

ploxiln commented May 5, 2017

There's another aspect to this:

The back-reference Transport._sshclient was cleared in the body of Transport.close(), but that would not run if the "SSHException: Error reading SSH protocol banner" was thrown, because that causes Transport.run() to clean up the Transport as it ends, and it clears Transport.active to prevent close() from having effect. But the .run() cleanup neglected to clear ._sshclient.

So if that exception was thrown, even if you called .close(), the Transport / SSHClient would still leak due to the ResourceManager. But if you called .close() before .run() ends, it un-sets ._sshclient and breaks the cycle.

Anyway, ResourceManager and ._sshclient should still be removed.

ploxiln added a commit to ploxiln/paramiko that referenced this issue May 5, 2017

@agronick

This comment has been minimized.

Contributor

agronick commented May 5, 2017

Yeah, thats what probably made it a hard bug to track down since normal execution wouldn't cause a leak.

@bitprophet

This comment has been minimized.

Member

bitprophet commented May 22, 2017

Putting thoughts here since it's the issue and not the PRs, discussion going across PRs can be hard to track 🙃

First, I haven't crawled the discussion and the patchsets in detail yet, apologies. Enough to get the high level gist of "reorganize classes/references so that GC can actually function as intended". Seems reasonable.

Second, @ploxiln's earlier PR mentioned porting the work back to 1.17+, which I agree with conceptually (memory leaks suck, such a bugfix should be made widely available). Unfortunately, these PRs seem to involve very large diffs - as per the first point - which is at odds with merging to stable, bugfix oriented branches. (Implicit: because the more code you move around the more instability may result.)

So I'm very of two minds on that point. Not sure how to square it...perhaps apply to a 2.2 release, wait for any unintended side effects to fall out, then backport? Or suck it up and say no, you don't get this on old versions, please modernize and roll forwards. (Which is increasingly how we need to do things re: cryptography support, possibly dropping python 2.6, etc.)

Opinions welcome on that.

Third, thanks to @ploxiln and @agronick for doing all of this heavy lifting, I super appreciate it! I would definitely like to get this merged in the shorter term even if it means making unpopular decisions re: which branches get it 😁

@ploxiln

This comment has been minimized.

Contributor

ploxiln commented May 22, 2017

I think #952 should be merged first, and backported. It's the smaller fix, and rather simple. Without it, it's possible to have these leaks even if .close() is called on the Transport or SSHClient. With it, if you make sure .close() is always eventually called, you can avoid leaks.

Then, I think #964 should be validated, and then merged to master. That is the fix that makes GC work to close Transport automatically when it is truly no longer used. We could wait for people who need to use 1.18 to report problems before backporting this one, IMHO.

@bitprophet

This comment has been minimized.

Member

bitprophet commented May 23, 2017

Right on, I'd forgotten about #952 and that is definitely much easier to swallow as a backport. Thanks!

ploxiln added a commit to ploxiln/paramiko that referenced this issue Jun 2, 2017

ploxiln added a commit to ploxiln/paramiko that referenced this issue Jun 2, 2017

ploxiln added a commit to ploxiln/paramiko that referenced this issue Jun 6, 2017

ploxiln added a commit to ploxiln/paramiko that referenced this issue Jun 6, 2017

@rustyscottweber

This comment has been minimized.

rustyscottweber commented May 10, 2018

Looking at this issues and many others like it.. I'm wondering if there is a test in that can be run to asses if there are still memory leaks and where the memory leaks are at.
Is there a test that already exists? Or should I potentially look at writing one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment