Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Paramiko has huge memory leaks (Solution Found) #949
Edit: I found a solution to this. There is a bug in Paramiko. Skip to the last post for the important details.
I wrap all of my ssh connections inside the classes shown below. The classes are invoked like this:
This should leave nothing behind. I scripted a page to repeatedly make the same request to the same server over and over again. I installed Dozer on my Django app. Debug mode is turned off. After a few hours of running my memory went form 100mb to 272mb. Paramiko seems to be the only thing that isn't cleaning up after itself.
Now that the app is idle this is what it is showing for counts:
Edit: Narrowed this down more, see below post. Removed my implementation. Not relevant anymore.
Here is a screenshot of all related objects from a transport object that isn't being cleaned up.
Edit: Seems like this might only happen when the following exception is thrown:
Clean exiting connections seem to be cleaned up fine.
Edit: More research, I seem to have gotten to the bottom of this. I'm seeing that sometimes things will loop in read_all in the packetizer until a rekey exception is thrown. Everything seems to exit normally. For some reason the weak references in the resource manager don't seem to work as intended. You can see that the only thing holding these transport objects in memory is a weak reference. Even calling gc.collect() doesn't make them go away.
All these transports sit in memory, as well as the client connected to them, as well as their last message, and an EOF exception from the last_exception key, and a stack trace on that exception. - A bunch of crap that doesn't seem to go away after any amount of time.
The whole resource manager object seems to be needlessly complex, all to get around the common idiom that you shouldn't rely on the
If you get rid of the ResourceManager and just throw this on the bottom of Transport:
this whole problem goes away and everything is cleaned up as soon as the rekey exception is thrown. This is on vanilla Python 3.6.
These monkey patches resolve the problem for me:
I found out why this is happening. It all has to do with the ResourceManager. The resource manager says when Client is deleted close the transport. The Client has a reference to transport. Transport has a reference to Client - thats ok. They cancel out.
But when an the transport is registered this method is created which references the Transport as resource:
(Transport == resource)
It assigns this as a callback to be called when the Client is deleted. Now here is the issue. This function references Transport bumping Transport's reference counter up by 1.
As long as this callback exists is memory the Transport's reference counter will never be 0. This callback will exist in memory as long as the Client exists in memory. The Client will exist as long as self._sshclient is set to the Client on the transport.
The only place that self._sshclient is set to None on the Transport is in the close method. So if close() is never called on the transport (because of an exception) you have a cyclical reference that never goes away.
I can think of a few ways to fix this. I still think the best way is to get rid of the ResourceManager entirely. Simple is better than complex. A
It would be nice to hear back from someone. Multiple posts on this with no replies, I feel like I'm talking to myself.
Thanks for the thorough analysis of the problem!
So this ResourceManager callback system is what caused the problem that #891 tried to fix: if a reference to the client was not being saved by the program using paramiko, but it was still using the transport, the client would be garbage collected, and then it would close the transport while the program was still using it. But the fix made the ResourceManager useless for this relationship. (And the ResourceManager isn't used anywhere else.)
It seems like the fix is to get rid of the ResourceManager, and figure out the GC situation anew. It's prudent to read why it was introduced in the first place: 029b898
EDIT: fixed link
... I now notice your simple "monkey patch" fix, to just get rid of the ResourceManager and add a simple Transport destructor. That seems to make sense.
The ResourceManager does seem pointless. It seems to be nothing more than a SSHClient destructor, but with a complication which "tricks" the GC cycle detector. I hope you don't mind if I describe what happens here in my own way:
So the GC cycle detector doesn't see that the ResourceManager reference to the Transport is really due to the SSHClient object, it looks like an unrelated reference from the ResourceManager singleton, it looks independent.
Back to what makes sense as the fix. Assuming we remove the ResourceManager: If the program using the SSHClient calls
referenced this issue
May 4, 2017
Yeah, that's exactly whats happening. It took me a day just thinking about how I got it to work right. When I was walking back to my car the second day it finally clicked why it was all happening. It was a real eureka moment.
So issue #44 was that originally the Client getting garbage collected caused close() to be called on the transport. If we just let transport close itself and don't do anything when the Client is garbage collected, we can get remove the ResourceManager and let the transport clean up on gc? Then everything is fine including #44.
There's another aspect to this:
So if that exception was thrown, even if you called
referenced this issue
May 19, 2017
Putting thoughts here since it's the issue and not the PRs, discussion going across PRs can be hard to track
First, I haven't crawled the discussion and the patchsets in detail yet, apologies. Enough to get the high level gist of "reorganize classes/references so that GC can actually function as intended". Seems reasonable.
Second, @ploxiln's earlier PR mentioned porting the work back to 1.17+, which I agree with conceptually (memory leaks suck, such a bugfix should be made widely available). Unfortunately, these PRs seem to involve very large diffs - as per the first point - which is at odds with merging to stable, bugfix oriented branches. (Implicit: because the more code you move around the more instability may result.)
So I'm very of two minds on that point. Not sure how to square it...perhaps apply to a 2.2 release, wait for any unintended side effects to fall out, then backport? Or suck it up and say no, you don't get this on old versions, please modernize and roll forwards. (Which is increasingly how we need to do things re: cryptography support, possibly dropping python 2.6, etc.)
Opinions welcome on that.
Third, thanks to @ploxiln and @agronick for doing all of this heavy lifting, I super appreciate it! I would definitely like to get this merged in the shorter term even if it means making unpopular decisions re: which branches get it
I think #952 should be merged first, and backported. It's the smaller fix, and rather simple. Without it, it's possible to have these leaks even if
Then, I think #964 should be validated, and then merged to master. That is the fix that makes GC work to close Transport automatically when it is truly no longer used. We could wait for people who need to use 1.18 to report problems before backporting this one, IMHO.