New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix SSHClient/Transport leak, remove unneeded references #952

Merged

bitprophet merged 4 commits into paramiko:2.0 from ploxiln:sshclient_transport_refs

Jun 9, 2017

Contributor

ploxiln commented May 4, 2017 •

edited

Fix to #949 - thanks @agronick for discovering and debugging the problem.

Caused by a fix in #891. But, it's a matter of choice and policy.

This should also be back-ported to 1.17.z

(This issue description has been updated.)

The back-reference from Transport to SSHClient was added because
the SSHClient had a destructor that would close the Transport,
and some users did not want the Transport closed when the SSHClient
was garbage collected.

The SSHClient destructor was not a normal destructor, it was
implemented with the ResourceManager singleton. This sometimes
prevented the GC cycle detector from freeing the SSHClient and
Transport even after the Transport Thread stopped running.

We can simplify these problems by just getting rid of the
ResourceManager, and the back-reference. Transports cannot be
garbage-collected while their Thread is running, .close() must
be called (on the SSHClient or the Transport).

ploxiln force-pushed the sshclient_transport_refs branch from 94a7061 to b39ab50 Compare

May 4, 2017 05:32

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

4 similar comments

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017 •

edited

Coverage decreased (-0.09%) to 74.121% when pulling b39ab50 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

Contributor Author

ploxiln commented May 4, 2017

side note: is every test variant (with different python version) resulting in a separate coverage report ...

ploxiln force-pushed the sshclient_transport_refs branch from b39ab50 to f18a2cf Compare

May 4, 2017 06:02

coveralls commented May 4, 2017 •

edited

Coverage decreased (-0.1%) to 74.101% when pulling f18a2cf on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

ploxiln force-pushed the sshclient_transport_refs branch from f18a2cf to d7dc296 Compare

May 4, 2017 14:10

Contributor Author

ploxiln commented May 4, 2017

I've twiddled the commits a bit, and force-pushed over the previous state.
@aronick I made you "author" of the simple changes you proposed, I hope that's OK.

coveralls commented May 4, 2017

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

2 similar comments

coveralls commented May 4, 2017

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017 •

edited

Coverage increased (+1.4%) to 75.602% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017 •

edited

Coverage decreased (-0.07%) to 74.142% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017

Coverage decreased (-0.1%) to 74.101% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

coveralls commented May 4, 2017 •

edited

Coverage decreased (-0.1%) to 74.115% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

Member

bitprophet commented May 4, 2017 •

edited

Just turned off coveralls comments because yes, this is completely ridiculous. (CodeCov seems better anyhow and I mean to switch to that sometime...)

Don't have time to specifically review & confirm/test this at the moment but I like how much red is in the PR and at a high level (and re: the related ticket) I'm +1 on the idea that the resource manager concept is cruft that needs axing. Thanks for this!

Contributor Author

ploxiln commented May 4, 2017

@agronick could you run your test case on this branch, to confirm the leak is fixed for you?

@omeranson could you run your use case on this branch, and confirm that the connection streams keep working for you? (since I removed the back-reference which is theoretically no longer needed)

Contributor

agronick commented May 5, 2017

Sure, I will when I get to work tomorrow. Thanks for getting this merged so fast.

Contributor

agronick commented May 5, 2017

I'm happy to be the author. I would have made a PR but there were a number of ways to fix it. I didn't know which one would be the most beneficial or if the ResourceManager did anything special I wasn't aware of.

Contributor Author

ploxiln commented May 5, 2017 •

edited

It's not merged yet ;)

I've tested this branch a bit, and functionality-wise stuff still works, for my use cases (mostly fabric). But I've also run some minimal-case experiments, and Transport objects do not seem to be GCed unless they are explicitly close()ed.

I've tested both python 2.7.13 and 3.6.1 on macOS. SSHClient objects are easily GCed and cause no problems once gone (now). But even if no channels are created on them, Transport objects remain, even after a sleep and a gc.collect(). Unless I call .close() on the transport (before losing my reference to it). So that's why the strange SSHClient destructor was probably done in the first place ...

Contributor Author

ploxiln commented May 5, 2017

It's because of threads.

From transport.py top level (just after imports):

# for thread cleanup
_active_threads = []

def _join_lingering_threads():
    for thr in _active_threads:
        thr.stop_thread()

import atexit
atexit.register(_join_lingering_threads)

from transport.py Transport class:

    def run(self):
        # (use the exposed "run" method, because if we specify a thread target
        # of a private method, threading.Thread will keep a reference to it
        # indefinitely, creating a GC cycle and not letting Transport ever be
        # GC'd. it's a bug in Thread.)

        # Hold reference to 'sys' so we can test sys.modules to detect
        # interpreter shutdown.
        self.sys = sys

        # active=True occurs before the thread is launched, to avoid a race
        _active_threads.append(self)
        if self.server_mode:
            self._log(DEBUG, 'starting thread (server mode): %s' % hex(long(id(self)) & xffffffff))
        else:
            self._log(DEBUG, 'starting thread (client mode): %s' % hex(long(id(self)) & xffffffff))

        ... very long run() method ...

            except Exception as e:
                self._log(ERROR, 'Unknown exception: ' + str(e))
                self._log(ERROR, util.tb_strings())
                self.saved_exception = e
            _active_threads.remove(self)

So this _active_threads reference is to clean up remaining Transport threads when exiting the process - to prevent hangs I guess, I'm not completely sure. Anyway, if I comment out all _active_threads code, the Transport still leaks (obviously because of the Thread it created).

Long story short, it looks like Transport.close() needs to be called, one way or the other, before a Transport can be GCed (and it can't be the other way around).

So that's why this ResourceManager was created to tie the lifetime of the Transport to the SSHClient (but could still probably be replaced with a __del__() method on the SSHClient).

So it seems the choice is between:

go back to the way things were, you need to keep a reference to the SSHClient for the Transport and Channel objects to keep working
you need to explicitly close the Transport (or the SSHClient)

Contributor Author

ploxiln commented May 5, 2017

Transport.__init__() does self.setDaemon(True)

Python documentation says:

Daemon threads are abruptly stopped at shutdown. Their resources (such as open files, database transactions, etc.) may not be released properly.

So I'm not sure how much the "atexit thread cleanup" is needed. But Transport.stop_thread() is non-trivial ... and even if _active_threads was removed, Transport would still leak. So I dunno...

Contributor Author

ploxiln commented May 5, 2017

I can imagine a possible fix, but it's a really huge refactor:

Make the Transport not a Thread itself, but instead have a reference to a TransportThread. Make Channel a wrapper for a RawChannel (or whatever). The point is that while Channel and Transport refer to each other, and to the inner TransportThread and RawChannel, the inner TransportThread and RawChannel do not have references back out. So if the Transport and all Channel objs are no longer referred to by user code, they can be detected as a cycle and garbage collected, and the Transport can stop the TransportThread on its way out (doing what the SSHClient destructor did, but not while the Transport or a Channel or a ChannelFile are still being used by the user).

I don't think any of us currently have the time for a change of that magnitude though ...

ploxiln force-pushed the sshclient_transport_refs branch from e74121d to 374cd37 Compare

May 5, 2017 05:47

Contributor Author

ploxiln commented May 5, 2017 •

edited

New conclusion: This branch is an improvement over the current state.

#949 was really about the fact that it was possible for the SSHClient and Transport to leak, even if Transport.close() was called.

There's a bunch of factors that contributed, but the smallest and most recent one is: the Transport._sshclient was cleared in Transport.close(), but not at the end of Transport.run() where it repeats almost all the logic in .close() (and prevents .close() from running). So if the loop in .run() was ended or skipped due to an exception (e.g. protocol error), instead of by the user calling .close(), the leak occurred.

I think the best fix is (still) to get rid of both the back-reference and the ResourceManager. I'm going to remove the addition of the destructor because I think it's useless (Transport will never be destructed until after .close() logic happens, whether in .close() or at the end of .run()).

I've experimented and tested enough now - I think this change is good.
(If your transports stay connected/running, they'll still leak. That's just how it'll have to be though.)

Contributor

omeranson commented May 5, 2017

@ploxiln My scenario works on with this change. Thanks.

Contributor Author

ploxiln commented May 6, 2017

That's not sufficient because of the Channels.

Consider calling SSHClient.invoke_shell() and then just using the Channel it returns. (That's roughly what #891 was about.) That Channel has a reference to the real Transport, TransportThread in your case, not the wrapper Transport. So the wrapper transport can be garbage collected while the Channel is still being used, and break the connection.

So, a wrapper transport needs to override all the methods which return channels, and return wrapper channels (which have references to the wrapper transport).

ploxiln force-pushed the sshclient_transport_refs branch from 965b9ea to a6e6fe7 Compare

May 6, 2017 21:32

Contributor Author

ploxiln commented May 6, 2017

I have removed the last commit I added, the one adding a destructor to SSHClient.

Now this pull request contains strictly only small improvement for each preference / use-case.

This is not the GC-able Transport you wanted, but it means you need to monkey-patch only one thing (SSHClient destructor) to get that. This is a bit of progress. I encourage you to add the full wrapper Transport solution (or propose re-adding the simple SSHClient destructor) in a separate follow-up pull request.

Contributor

agronick commented May 7, 2017

I'll try it. We should iron out specifics first and whats sufficient. Is it enough to implement what I posted above with changes for the channel to reference the transport?

Contributor Author

ploxiln commented May 7, 2017 •

edited

I think that wrapper classes for Transport, Channel, and ChannelFile, should be sufficient. Wrapper ChannelFile should have a reference to wrapper Channel should have a reference to wrapper Transport. Only when the user no longer references any of the wrappers, the wrapper transport can be garbage collected and close all the real stuff.

EDIT: ChannelFile doesn't need two separate wrapper/real classes, it just needs to reference the wrapper Channel.

nikolaik mentioned this pull request

Replace coveralls with codecov, comments disabled #956

Closed

Contributor

agronick commented May 8, 2017

I don't get why you need to wrap channel. As long as everything is accessing Transport and only Transport is accessing TransportThread it should be fine. Why would anything need direct access for TransportThread? If channel is GCed, and Transport has no references, the transport thread gets stopped.

Now what is sufficient for making a pass-throughs. It is just a class like the one I posted above or do the methods need to be explicitly there? This would be the easiest way.

2nd easiest is to prefix every method on TransportThread with an underscore and have the public methods on transport pass though to TransportThread. That shows up in debuggers.

3rd Easiest is to start by moving the run method to TransportThread and the functions it calls. Try to split up each function between the two in a way that makes logical sense, while keeping all the public methods on Transport and adding pass-throughs where appropriate.

Contributor Author

ploxiln commented May 8, 2017 •

edited

TransportThread needs references to (real) Channels. So (real) Channels will never be GCed. (But, the TransportThread does not need references to ChannelFiles.)

The longer this thread gets, the smaller the chance of any change being merged ;)

EDIT: just clarifying that I am not one who can merge, so someone else will need to digest this thread. Thus the recommendation to break out the refactor for GC-enablement to a new pull-request, for easier digestion.

Contributor

agronick commented May 13, 2017

Give me two weeks and I'll have something. I'd really like some clarification on how this should be done though. Is it important that the methods are explicitly declared on the helper classes?

agronick pushed a commit to agronick/paramiko that referenced this pull request


          Fix for transport memory leak paramiko#952

f03911a

agronick mentioned this pull request

Memory leaks with Paramiko / Transport wrapper class (Bug #949) #964

Open

ploxiln mentioned this pull request

Paramiko has huge memory leaks (Solution Found) #949

Open

agronick and others added 2 commits

June 2, 2017 03:15


          remove ResourceManager to fix leak of Transport/SSHClient

a811ed3

The back-reference from Transport to SSHClient was added because
the SSHClient had a destructor that would close the Transport,
and some users did not want the Transport closed when the SSHClient
was garbage collected.

The SSHClient destructor was not a normal destructor, it was
implemented with the ResourceManager singleton. Together with
the back-reference, this prevented the GC cycle detector from
freeing the SSHClient and Transport.


          remove back-reference from Transport to SSHClient

60f40bb

It was only recently added, and it's not really needed after the
ResourceManager removal. Removing it allows the SSHClient to be
garbage-collected if only the Transport (and Channels) are still
in use.

ploxiln force-pushed the sshclient_transport_refs branch 3 times, most recently from dc7d0f6 to 6f057de Compare

June 2, 2017 07:36


          update changelog with fix for paramiko#949

91d644a

ploxiln force-pushed the sshclient_transport_refs branch from 6f057de to 91d644a Compare

June 2, 2017 07:37


          re-enable client cleanup test for python3

978b652

Contributor Author

ploxiln commented Jun 2, 2017

rebased to resolve conflicts. also tweaked a related test I happened to notice so it could be re-enabled for python3

ploxiln mentioned this pull request

sshclient / transport leak partial fix (backport) #987

Merged

bitprophet merged commit 978b652 into paramiko:2.0

This was referenced Aug 30, 2017

SFTPClient does not keep a reference to the Client #344

Open

Cannot return channel or file-like object from a function, will close connection!! #440

Open

ploxiln mentioned this pull request

No module named resource #1054

Open

ploxiln deleted the sshclient_transport_refs branch

September 13, 2017 18:41

dwalleck mentioned this pull request

SSH plugin breaks due to bug fixes in Paramiko CafeHub/opencafe#67

Open

borellim mentioned this pull request

Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' aiidateam/aiida-core#1853

Closed

bskinn mentioned this pull request

Close socket on failure #1822

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment