Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix SSHClient/Transport leak, remove unneeded references #952

Merged
merged 4 commits into from Jun 9, 2017

Conversation

ploxiln
Copy link
Contributor

@ploxiln ploxiln commented May 4, 2017

Fix to #949 - thanks @agronick for discovering and debugging the problem.

Caused by a fix in #891. But, it's a matter of choice and policy.

This should also be back-ported to 1.17.z

(This issue description has been updated.)


The back-reference from Transport to SSHClient was added because
the SSHClient had a destructor that would close the Transport,
and some users did not want the Transport closed when the SSHClient
was garbage collected.

The SSHClient destructor was not a normal destructor, it was
implemented with the ResourceManager singleton. This sometimes
prevented the GC cycle detector from freeing the SSHClient and
Transport even after the Transport Thread stopped running.

We can simplify these problems by just getting rid of the
ResourceManager, and the back-reference. Transports cannot be
garbage-collected while their Thread is running, .close() must
be called (on the SSHClient or the Transport).

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

4 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.08% when pulling 94a7061 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

coveralls commented May 4, 2017

Coverage Status

Coverage decreased (-0.09%) to 74.121% when pulling b39ab50 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@ploxiln
Copy link
Contributor Author

ploxiln commented May 4, 2017

side note: is every test variant (with different python version) resulting in a separate coverage report ...

@coveralls
Copy link

coveralls commented May 4, 2017

Coverage Status

Coverage decreased (-0.1%) to 74.101% when pulling f18a2cf on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@ploxiln
Copy link
Contributor Author

ploxiln commented May 4, 2017

I've twiddled the commits a bit, and force-pushed over the previous state.
@aronick I made you "author" of the simple changes you proposed, I hope that's OK.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.4%) to 75.575% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

coveralls commented May 4, 2017

Coverage Status

Coverage increased (+1.4%) to 75.602% when pulling 0fcc709 on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

coveralls commented May 4, 2017

Coverage Status

Coverage decreased (-0.07%) to 74.142% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.101% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@coveralls
Copy link

coveralls commented May 4, 2017

Coverage Status

Coverage decreased (-0.1%) to 74.115% when pulling e74121d on ploxiln:sshclient_transport_refs into bc86eee on paramiko:2.0.

@bitprophet
Copy link
Member

bitprophet commented May 4, 2017

Just turned off coveralls comments because yes, this is completely ridiculous. (CodeCov seems better anyhow and I mean to switch to that sometime...)

Don't have time to specifically review & confirm/test this at the moment but I like how much red is in the PR and at a high level (and re: the related ticket) I'm +1 on the idea that the resource manager concept is cruft that needs axing. Thanks for this!

@ploxiln
Copy link
Contributor Author

ploxiln commented May 4, 2017

@agronick could you run your test case on this branch, to confirm the leak is fixed for you?

@omeranson could you run your use case on this branch, and confirm that the connection streams keep working for you? (since I removed the back-reference which is theoretically no longer needed)

@agronick
Copy link
Contributor

agronick commented May 5, 2017

Sure, I will when I get to work tomorrow. Thanks for getting this merged so fast.

@agronick
Copy link
Contributor

agronick commented May 5, 2017

I'm happy to be the author. I would have made a PR but there were a number of ways to fix it. I didn't know which one would be the most beneficial or if the ResourceManager did anything special I wasn't aware of.

@ploxiln
Copy link
Contributor Author

ploxiln commented May 5, 2017

It's not merged yet ;)

I've tested this branch a bit, and functionality-wise stuff still works, for my use cases (mostly fabric). But I've also run some minimal-case experiments, and Transport objects do not seem to be GCed unless they are explicitly close()ed.

I've tested both python 2.7.13 and 3.6.1 on macOS. SSHClient objects are easily GCed and cause no problems once gone (now). But even if no channels are created on them, Transport objects remain, even after a sleep and a gc.collect(). Unless I call .close() on the transport (before losing my reference to it). So that's why the strange SSHClient destructor was probably done in the first place ...

@ploxiln
Copy link
Contributor Author

ploxiln commented May 5, 2017

It's because of threads.

From transport.py top level (just after imports):

# for thread cleanup
_active_threads = []

def _join_lingering_threads():
    for thr in _active_threads:
        thr.stop_thread()

import atexit
atexit.register(_join_lingering_threads)

from transport.py Transport class:

    def run(self):
        # (use the exposed "run" method, because if we specify a thread target
        # of a private method, threading.Thread will keep a reference to it
        # indefinitely, creating a GC cycle and not letting Transport ever be
        # GC'd. it's a bug in Thread.)

        # Hold reference to 'sys' so we can test sys.modules to detect
        # interpreter shutdown.
        self.sys = sys

        # active=True occurs before the thread is launched, to avoid a race
        _active_threads.append(self)
        if self.server_mode:
            self._log(DEBUG, 'starting thread (server mode): %s' % hex(long(id(self)) & xffffffff))
        else:
            self._log(DEBUG, 'starting thread (client mode): %s' % hex(long(id(self)) & xffffffff))

        ... very long run() method ...

            except Exception as e:
                self._log(ERROR, 'Unknown exception: ' + str(e))
                self._log(ERROR, util.tb_strings())
                self.saved_exception = e
            _active_threads.remove(self)

So this _active_threads reference is to clean up remaining Transport threads when exiting the process - to prevent hangs I guess, I'm not completely sure. Anyway, if I comment out all _active_threads code, the Transport still leaks (obviously because of the Thread it created).

Long story short, it looks like Transport.close() needs to be called, one way or the other, before a Transport can be GCed (and it can't be the other way around).

So that's why this ResourceManager was created to tie the lifetime of the Transport to the SSHClient (but could still probably be replaced with a __del__() method on the SSHClient).

So it seems the choice is between:

  • go back to the way things were, you need to keep a reference to the SSHClient for the Transport and Channel objects to keep working
  • you need to explicitly close the Transport (or the SSHClient)

@ploxiln
Copy link
Contributor Author

ploxiln commented May 5, 2017

Transport.__init__() does self.setDaemon(True)

Python documentation says:

Daemon threads are abruptly stopped at shutdown. Their resources (such as open files, database transactions, etc.) may not be released properly.

So I'm not sure how much the "atexit thread cleanup" is needed. But Transport.stop_thread() is non-trivial ... and even if _active_threads was removed, Transport would still leak. So I dunno...

@ploxiln
Copy link
Contributor Author

ploxiln commented May 5, 2017

I can imagine a possible fix, but it's a really huge refactor:

Make the Transport not a Thread itself, but instead have a reference to a TransportThread. Make Channel a wrapper for a RawChannel (or whatever). The point is that while Channel and Transport refer to each other, and to the inner TransportThread and RawChannel, the inner TransportThread and RawChannel do not have references back out. So if the Transport and all Channel objs are no longer referred to by user code, they can be detected as a cycle and garbage collected, and the Transport can stop the TransportThread on its way out (doing what the SSHClient destructor did, but not while the Transport or a Channel or a ChannelFile are still being used by the user).

I don't think any of us currently have the time for a change of that magnitude though ...

@ploxiln
Copy link
Contributor Author

ploxiln commented May 5, 2017

New conclusion: This branch is an improvement over the current state.

#949 was really about the fact that it was possible for the SSHClient and Transport to leak, even if Transport.close() was called.

There's a bunch of factors that contributed, but the smallest and most recent one is: the Transport._sshclient was cleared in Transport.close(), but not at the end of Transport.run() where it repeats almost all the logic in .close() (and prevents .close() from running). So if the loop in .run() was ended or skipped due to an exception (e.g. protocol error), instead of by the user calling .close(), the leak occurred.

I think the best fix is (still) to get rid of both the back-reference and the ResourceManager. I'm going to remove the addition of the destructor because I think it's useless (Transport will never be destructed until after .close() logic happens, whether in .close() or at the end of .run()).

I've experimented and tested enough now - I think this change is good.
(If your transports stay connected/running, they'll still leak. That's just how it'll have to be though.)

@omeranson
Copy link
Contributor

@ploxiln My scenario works on with this change. Thanks.

@ploxiln
Copy link
Contributor Author

ploxiln commented May 6, 2017

That's not sufficient because of the Channels.

Consider calling SSHClient.invoke_shell() and then just using the Channel it returns. (That's roughly what #891 was about.) That Channel has a reference to the real Transport, TransportThread in your case, not the wrapper Transport. So the wrapper transport can be garbage collected while the Channel is still being used, and break the connection.

So, a wrapper transport needs to override all the methods which return channels, and return wrapper channels (which have references to the wrapper transport).

@ploxiln
Copy link
Contributor Author

ploxiln commented May 6, 2017

I have removed the last commit I added, the one adding a destructor to SSHClient.

Now this pull request contains strictly only small improvement for each preference / use-case.

This is not the GC-able Transport you wanted, but it means you need to monkey-patch only one thing (SSHClient destructor) to get that. This is a bit of progress. I encourage you to add the full wrapper Transport solution (or propose re-adding the simple SSHClient destructor) in a separate follow-up pull request.

@agronick
Copy link
Contributor

agronick commented May 7, 2017

I'll try it. We should iron out specifics first and whats sufficient. Is it enough to implement what I posted above with changes for the channel to reference the transport?

@ploxiln
Copy link
Contributor Author

ploxiln commented May 7, 2017

I think that wrapper classes for Transport, Channel, and ChannelFile, should be sufficient. Wrapper ChannelFile should have a reference to wrapper Channel should have a reference to wrapper Transport. Only when the user no longer references any of the wrappers, the wrapper transport can be garbage collected and close all the real stuff.

EDIT: ChannelFile doesn't need two separate wrapper/real classes, it just needs to reference the wrapper Channel.

@agronick
Copy link
Contributor

agronick commented May 8, 2017

I don't get why you need to wrap channel. As long as everything is accessing Transport and only Transport is accessing TransportThread it should be fine. Why would anything need direct access for TransportThread? If channel is GCed, and Transport has no references, the transport thread gets stopped.

Now what is sufficient for making a pass-throughs. It is just a class like the one I posted above or do the methods need to be explicitly there? This would be the easiest way.

2nd easiest is to prefix every method on TransportThread with an underscore and have the public methods on transport pass though to TransportThread. That shows up in debuggers.

3rd Easiest is to start by moving the run method to TransportThread and the functions it calls. Try to split up each function between the two in a way that makes logical sense, while keeping all the public methods on Transport and adding pass-throughs where appropriate.

@ploxiln
Copy link
Contributor Author

ploxiln commented May 8, 2017

TransportThread needs references to (real) Channels. So (real) Channels will never be GCed. (But, the TransportThread does not need references to ChannelFiles.)

The longer this thread gets, the smaller the chance of any change being merged ;)

EDIT: just clarifying that I am not one who can merge, so someone else will need to digest this thread. Thus the recommendation to break out the refactor for GC-enablement to a new pull-request, for easier digestion.

@agronick
Copy link
Contributor

Give me two weeks and I'll have something. I'd really like some clarification on how this should be done though. Is it important that the methods are explicitly declared on the helper classes?

agronick and others added 2 commits June 2, 2017 03:15
The back-reference from Transport to SSHClient was added because
the SSHClient had a destructor that would close the Transport,
and some users did not want the Transport closed when the SSHClient
was garbage collected.

The SSHClient destructor was not a normal destructor, it was
implemented with the ResourceManager singleton. Together with
the back-reference, this prevented the GC cycle detector from
freeing the SSHClient and Transport.
It was only recently added, and it's not really needed after the
ResourceManager removal. Removing it allows the SSHClient to be
garbage-collected if only the Transport (and Channels) are still
in use.
@ploxiln ploxiln force-pushed the sshclient_transport_refs branch 3 times, most recently from dc7d0f6 to 6f057de Compare June 2, 2017 07:36
@ploxiln
Copy link
Contributor Author

ploxiln commented Jun 2, 2017

rebased to resolve conflicts. also tweaked a related test I happened to notice so it could be re-enabled for python3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants