New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520

Open
colinmcintosh opened this Issue Apr 30, 2015 · 53 comments

Comments

Projects
None yet
@colinmcintosh

colinmcintosh commented Apr 30, 2015

[MAINTAINER NOTE: a variant of this issue under Python 3 has been worked around as per this comment but it's presumably still at large for Python 2 in some scenarios.]

When running SSH connections to multiple devices the script will sometimes hang once the sys.exit(0) is reached. Once this is reached the script will hang indefinitely. It doesn't happen everytime but the more devices the script runs against the more likely it is to happen. It seems like it's related to the amount of time the script takes to run.

The last log message paramiko outputs is DEBUG:EOF in transport thread

Using the faulthandler module I dumped a stacktrace for when it hangs:

Fatal Python error: Aborted

Thread 0x00007f47faff5700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fbff7700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fcff9700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fdffb700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47feffd700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f480590f700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f48050ce700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f4807368700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Current thread 0x00007f480f7c4700 (most recent call first):
  File "/usr/local/lib/python2.7/threading.py", line 358 in wait
  File "/usr/local/lib/python2.7/threading.py", line 960 in join
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1419 in stop_thread
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 558 in close
  File "build/bdist.linux-x86_64/egg/paramiko/resource.py", line 60 in callback
Aborted

It looks like all but LAST ONE of the resources closes correctly but the LAST ONE doesn't close correctly. The same thing happens if .close() is explicitly called on the Transport object. It will hang at that .close() method indefinitely, sometimes.

There is not one specific device it happens for either. I have tried many hosts and OS's with no specific one standing out a problem.

@colinmcintosh

This comment has been minimized.

colinmcintosh commented May 1, 2015

After further investigation it looks like it's hanging on line 204 under read_all in packet.py which is

x = self.__socket.recv(n)

It looks like it never makes it past this line after the close() method on the Transport object is called.

If I explicitly call

del transport_object

before the end of the script it properly closes the Transport object and cleans up as expected.

@rustyscottweber

This comment has been minimized.

rustyscottweber commented May 1, 2015

I have this same issue and believe it has to do with calling the close method during a del method. Have you tried not including any calls to close paramiko from a delete method and let python's garbage collection process take care of the thread?

@colinmcintosh

This comment has been minimized.

colinmcintosh commented May 5, 2015

I don't have have any del methods in my classes

@colinmcintosh

This comment has been minimized.

colinmcintosh commented May 5, 2015

I've altered the library for debugging purposes by changing the following lines in transport.py@1420:

    def stop_thread(self):
        self.active = False
        self.packetizer.close()
        while self.is_alive() and (self is not threading.current_thread()):
            print("Trying to kill thread.")
            self.join(10)
            print("It's alive: {}".format(self.is_alive()))

The result of this is the script outputs the following once it reaches the sys.exit(0):

Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
[truncated, it goes on forever]

It seems to be that the Thread is deadlocked somehow and can't close. I'm not sure what the point of this loop is in transport.py as nothing changes through each loop. Whether you do

while thread.is_alive():
    thread.join(10)

does the same thing, AFAIK, as

thread.join()

I could see value in the loop if there was a debug log there to let you know it's locked up or if there was a loop counter that eventually skips locked threads.

Ideally I would propose that it be changed to

    def stop_thread(self):
        self.active = False
        self.packetizer.close()
        if self is not threading.current_thread():
            self.join(10)
            if self.is_alive():
                raise Exception("Timed out while trying to kill thread")

This doesn't solve whatever is causing the overall locked up Thread issue but this is still a good solution to catch and kill any leftover threads.

@rustyscottweber

This comment has been minimized.

rustyscottweber commented May 13, 2015

What version of paramiko are you using as I think that is a change that has already been made more or less?

@colinmcintosh

This comment has been minimized.

colinmcintosh commented May 13, 2015

What do you mean by more or less? I'm almost positive I'm on the master but if not it's 1.15.1. I took a look at the master and it looks like it will still hang the same way if a thread is hung.

@colinmcintosh

This comment has been minimized.

colinmcintosh commented May 13, 2015

I checked, it's running on version 1.15.2

@Offlein

This comment has been minimized.

Offlein commented Jul 24, 2015

I'm experiencing what might be the same issue on Paramiko 1.15.2, although I don't think it's because the Transport object closed. At least, I've set a breakpoint in the Transport object's close() and it is not being hit. (So... maybe it's not?)

Still, though, I'm otherwise getting stuck in the x = self.__socket.recv(n) area as well. This line keeps triggering the socket.timeout exception, I guess, and hence the if got_timeout: section will run, looping back to the beginning of the "while" statement eternally.

Strangely, this script runs just fine when I test it using most servers. This only occurs when I'm connecting to the servers I definitely 100% need to connect to in my office. As well, I can connect to these servers properly from other SFTP software like FileZilla. It's something in conjunction with paramiko and this server...

@stas

This comment has been minimized.

stas commented Nov 13, 2015

Please try changing the paramiko.sftp_file.SFTPFile.MAX_REQUEST_SIZE value.
Might be just one of those servers.

@dmitrytokarev

This comment has been minimized.

dmitrytokarev commented Dec 3, 2015

this possibly related to #109

@sanseihappa

This comment has been minimized.

Contributor

sanseihappa commented May 26, 2016

#698 Looks like the same issue (Hang in packet.py Packetizer.read_all())

@sanseihappa

This comment has been minimized.

Contributor

sanseihappa commented May 26, 2016

@colinmcintosh thanks for your initial post using faulthandler, I'm duplicating this issue on a regular basis and dumping the stack traces to confirm I'm hitting the same thing using 2.0.0.

@botanicvelious mentions a workaround put in place further up the call stack in transport.py Transport.run(), I'll try this locally, but...

  • Is this workaround appropriate? (And should I generate a PR to get this in?)
  • Is there something else that should be done instead, that might be considered the actual fix?

Thanks!

@botanicvelious

This comment has been minimized.

botanicvelious commented May 29, 2016

I wouldn't use my work-around as it doesn't work in all cases, it just works for our specific use case.

@bitprophet bitprophet added the Hangs label Jun 11, 2016

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

This ticket accurately describes what I'm getting with fabric v2's integration suite and which made me generate fabric/fabric#1473.

Doesn't matter how many actual hosts (it can be a bunch of connections to the same host) but the more independent Client objects generated, the more likely it is that one of them will hang.

Explicitly calling .close() (which percolates down into Transport.close) on all such objects reliably seems to avoid the problem. This is (probably) why client libs like Fabric have historically always had to explicitly close client/transport objects before interpreter shutdown.

Having that be a hard requirement of using that API or Paramiko itself, feels super crummy to me, so I'd like this to get solved eventually. Going to poke at this somewhat this weekend myself to see if I can get to the root of it.

Difficulty is, threading is often fraught with peril, changes that seem to fix one issue can easily spawn others; and in code this old there's always fun landmines. But this has been an intermittent issue forever so I'd like to at least try fixing it.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

  • Firstly, what is calling stop_thread? Either Transport.close() or the module-level transport._join_lingering_threads.
    • which itself is pretty eyebrow-raising and ancient - dates from 2003/4. Is called via atexit. commit is c565d66 - not very confidence inspiring.
    • In testing, which of those two avenues generates the stop_thread call for a given Transport thread seems pretty randomized.
    • However, the stuck thread always seems to end up there due to a call to Transport.close, never _join_lingering_threads.
  • The stuck thread last executes/gets stuck at self.packetizer.read_message() within run(), with the next activity about that thread being the stop_thread call.
  • The EOF in transport thread stuff seems like it might be orthogonal, as it frequently-to-always is logged by the other two Transports that aren't hanging, and not by the hanging thread.
  • This kinda jives with the core problem: the hung thread never receives any sort of "final" message from the network, not even anything that would make it except, which is why it hangs out forever on packetizer.read_message
    • The "welp, dunno how we got here, but let's just exit-via-excepting on join() timeout" approach taken by @colinmcintosh seems like an okay backstop in general - I assume there are multiple ways to end up in this situation. May add it regardless of whether I find out why these Transports end up in this state.
  • Digging deeper re: connection lifecycles, the stuck threads appear to do:
    • channel-data
    • channel-request (of type exit-status, so this would be the higher level recv_exit_status result most likely)
    • channel-eof
    • channel-close
  • That looks in line with normal behavior going by https://tools.ietf.org/html/rfc4254#section-5.3 and it is also the same order of operations seen in the non-hung threads.
  • So how are the other threads successfully exiting their run while the hung one is not? The answer seems to be the EOFErrors, actually; an exception raising is one of the few ways to exit the while loop, given that the threads tend to fall into blocking on read_message.
    • Which raises the question of why this code even bothers with while self.active, but meh.
    • Anyway, my suspicion is that the EOFErrors are generated by the call to self.packetizer.close() in stop_thread(), but I still need to verify this (& then, again, figure out what's different in the hang situation.) (If that is true, then this seems an awful roundabout way to cause the loop to exit...guessing not intended?)
  • The EOFError raised is always on line 272 of Packetizer.read_all, which is within if got_timeout and triggered by if self.__closed (again, yup, seems like...)
  • Instrumenting further, Packetizer.read_all is encountering socket.timeout when it sets got_timeout, so in a "normal" situation, the order of operations is:
    • Packetizer.close is called by Transport.stop_thread, which sets __closed on the packetizer.
    • The read_all loop encounters socket.timeout (presumably, constantly, anytime recv takes longer than the configured timeout - this is probably key) and sets got_timeout
    • Then hits line 272, sees that self.__closed is True, raises EOFError, breaks the transport out of its run loop, happy days.
  • The key difference is that the hanging thread hits socket.timeout same as the others, but when it gets to the self.__closed check, the answer is False, so it loops again - and then sits on the self.__socket.recv forever.
    • So, first, this does seem to be a race condition - if we assume that all threads would end up blocking on that final recv, it comes down to whether self.__closed gets set fast enough or not.
    • Second, why is socket.timeout not being re-raised? I need to doublecheck exactly how it is supposed to work (and then whether there are bugs in it) but if the loop was a "proper" recv -> timeout -> recv -> timeout semi-hot loop, this problem would not exist, because the "hung" thread's packetizer would simply timeout another 0.1s later, see __closed set, and EOFError.
    • And as mentioned earlier, the fact that this is how the whole construct terminates, seems pretty ridiculous anyways - these errors are always ugly looking in the logs and frequently mislead people. (and at least in my testing, they always appear during normal operation - I don't know if this was always the case, but...)
  • The docs for the socket module certainly back up my memory re: how this is all supposed to behave. So I'm now digging around to see what could cause the timeout to not fire, e.g. the socket being closed elsewhere, or whatnot.

Also of note is that this is all exactly the same stuff #380 describes, though they never got resolution. The only extra wrinkle I see is the assertion that the close call comes from "a different thread"; I don't know that this is the case in my setup, but it does come back to "it only pops up when the closure occurs via Transport.close and not via _join_lingering_threads".

Still unclear if that's germane, tho as noted I need to doublecheck the different treatment of the various objects involved in the two scenarios. I'm 99% sure I've seen non-hanging threads also terminate via Transport.close() (especially if the top level client code does it explicitly.)

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

MOAR:

  • If I insert a sleep(1) in the "was not closed on timeout" part of the code, then check __closed after, it is indeed then closed, confirming the guess that this is part of the race condition in question.
  • A slight modification to the previous comment, sometimes the non-hung threads also fulfil part of the race, and get the "isn't closed yet" status; however, they do receive the expected subsequent packet.timeout, and then on that second time through the timeout+closed test, they pass & EOFError.
  • So it still comes down to, why isn't this hung thread's socket raising the timeout?
  • Upon the first "post-hang" loop of the stop_thread join loop, the bad thread's socket object is in a closed state, for whatever good that does us. (This could be another modification on the "backstop" idea, if somehow the straight up "got past 10 seconds without successful join()" condition were not stringent enough.)
  • This specific StackOverflow comment is the first mention I can find of socket.recv not always honoring socket.settimeout, due to a race condition in how Python implements the timeout.
    • The poster seems to walk this back, but I think they're walking back the specific explanation of the race condition, not the fact that it exists?
    • I read the relevant parts https://github.com/python/cpython/blob/8707d182ea722da19ace7fe994f3785cb77a679d/Modules/socketmodule.c and it seems to jive with other explanations of how this works (presence of timeout sets the socket to nonblocking mode; select() used with the given timeout as the select interval; then actual [again, non-blocking] socket call executed; loop).
    • What I don't see there (but granted, I barely know C or network programming at this level) is where a race condition would pop up, because the socket should be in nonblocking mode, so sock_func() should return and then the loop should continue.
  • Both that thread and this one occasionally mention implementing the "nonblocking socket + select" approach at user-level, though none of them say this as a workaround for bugs in settimeout.

Took a different tack and looked at how exactly these bad threads are having .close() called on them; it's via ResourceManager (from resource.py).

  • If I comment out the bit in Client that does that manager registration -- no hangs!
    • But also no third EOFError or other normal shutdown; the last debug message out of the would-have-been-closed-via-ResourceManager thread is the "about to recv()" one it normally hangs on.
    • No obvious ill effects, but with this sort of thing, I'd still worry it would cause issues in some situations. (Whether they are worse than an infinite hang...arguably not?)
  • In some of the success/no-race-condition scenarios, the ResourceManager is still in play - i.e. it's not the sole cause (or symptom) of the issue.
  • The 'bad' thread is almost always the "middle" thread in my "execute 3 commands in sequence" test script. Not sure what that means exactly.
  • Checked timing to figure out what exactly is causing some of these to end up in the lingering-threads list and the 'bad' one to be resource-managered, in case that is a clue...
    • Bad thread: �1465705574.463436
    • Good thread 1: 1465705574.41101
    • Good thread 2: 1465705574.342657
    • So the bad thread's stop_thread is being called 0.12s after the later "good" thread.
    • In subsequent runs, the difference is smaller - 0.04s and 0.05s - but it is still always last by a decent margin. Not that this tells me much.

Other random notes:

  • Slapping a time.sleep at the end of my test script has the expected effect - all 3 threads' Packetizer.read_all are in a hot loop with the recv timeout working great, until the script truly exits, at which time we're at the mercy of the race condition again (though it seems a little less likely to pop up in this case.)
  • Triple checked that the timeout was the one I expected (0.1s) - it is, according to gettimeout. Including just prior to the "bad" recv.
  • Recreated the issue identically on Debian 8, so it's not an OS X wrinkle. (But, as on OS X, the issue only appears present on Python 3.)
  • Python 2 has the same "two threads get linger-killed, one gets resource-managed" behavior - it's just that the thread resource-managed doesn't encounter the race condition / incorrectly-blocking recv.
  • Crawled all over the Python bug tracker, nothing directly relevant seemed to pop up.
    • One bug (https://bugs.python.org/issue23863) did tip me off that Python 3.5 changed a lot of socket timeout related things, so I wonder if, somehow, that is related.
      • Made a Python 3.4.4 venv (thanks, pyenv!) but...nope! same issue there. So unlikely to be PEP-0475's fault.
    • https://bugs.python.org/issue22043 was another possibly related change, but that one was also targeted for 3.5, so.
  • In interests of narrowing things further, I also tested the issue under Python 3.3.6; it's present there too. So, is a general Python 3 issue apparently.
    • I scanned some of the Python 3.x changelogs but nothing useful turned up there either; at this point this is diminishing returns.
@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

Ways forward brainstorm:

  • Still perplexed by the discrepancy between the atexit hook and the ResourceManager closing; while clearly the latter isn't the direct cause, I still wonder if it's a clue. May poke that further; is the ResourceManager firing while the atexit hook loop is running? Or is the 'bad' thread somehow being evicted from _active_threads? If so, why?
  • I'm not super happy leaving things at "use the join timeout as an ultimatum" because it still means the timeout - whatever we set it to - will be an annoyingly long, unexpected wait time at the end of the interpreter session for anybody on Python 3. Including Fabric 2's integration test suite.
    • But, it's still an option, and waiting, say, 1-5s is better than waiting forever. Especially if accompanied by "erm...had to timeout joining the transport thread. sorry! you may have hit bug #520...if you were NOT at the end of your session, please file a bug!".
  • Given that an explicit user-driven close() always seems to work (still unclear why - perhaps the race condition with the socket object has to do with end-of-interpreter shenanigans?) and is arguably good practice, we could list this as "documentation only"...

Another idea occurred to me which I think I like better: keep the ultimatum-style join timeout, but set it to a much shorter value if - by the time we're calling it - the transport's socket and packetizer both appear to have entered their closed states.

That detects the symptoms of this problem, lets the interpreter exit in a reasonable-to-humans amount of time, but limits the possibility of accidentally terminating "too early" in scenarios unlike the one I am testing under.

A couple more wrinkles on this could also be:

  • A while-loop checking those closed states, whose body is join(0.1) (or similar) ; that would ostensibly work even better for "taking a while to shut down" scenarios as they would take as much time as needed;
  • Use threadsafe signaling within Packetizer, around the recv call, so we can truly know whether or not we're in this particular scenario (socket closed, packetizer closed, packetizer "I was reading!" flag still set). I doubt this is necessary but it could be a (IMO pretty stupid) sanity check, if edge cases pop up.
@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

Yea, mutating the loop to be "not current thread + not socket.closed + not packetizer.closed" and turning the join timeout down to 0.1s seems to do the trick pretty well. I don't have a great way of testing unusually-slow server endpoints right now (something I'd like to get sometime...) but I'm probably going to at least trial this change while I continue hacking on other things.

Minor side note, socket objects have no public "is closed?" flag that I can see, but _closed is a) what's used by __repr__ and b) available in Python 2+ 3, so it'll have to do for now. I'm not crazy worried about that changing anytime soon.

bitprophet added a commit that referenced this issue Jun 12, 2016

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

FWIW problem + fix both seem present/workable on 1.16 too (so, this is in no way related to the switch to pyca/crypto - not that I thought it would've been). I committed a cleaned-up version of what I was testing with and forward-ported (1.16, 1.17, 2.0, master) - it's live now.

If anyone watching this can give one of those branches a shot and give a thumbs-up (both re: fixing the issue, if they have it; or at least proving it's not making things worse for them) that would be cool.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

Was reminded by tef on twitter that I never chased down the assertions made in #380 about the issue potentially being how the socket in question is closed from a different thread during the recv call (re this SO thread. So if this needs more love that's probably the next place to look.

Offhand (recalling that threading is not my expertise) if that's the true race condition, it would mean we do want "I'm recving here!" locking in Packetizer.read_all, which is honored by Packetizer.close() when it calls self.__sock.close. And/or in the other 2-3 places where that same socket, since it is also Transport.sock, is closed by Transport methods such as close or the end of run...sigh. But pretty sure in this case the race would be between the two Packetizer methods.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 12, 2016

Sadly doesn't seem like a workable avenue:

  • (With my above 'fix' reverted temporarily,) I added locking around recv and self.__socket.close in Packetizer, but we still end up in the blocking recv unfortunately. Implies that Packetizer.__socket.close (from .close) is not actually the trigger.
  • Confirmed this by throwing a 0.5s sleep before the socket close in the original code; changes nothing about the behavior of the race condition, the recv fires and blocks.
    • Also confirmed that this is the right place and the socket isn't closing via some other avenue: if I comment out the socket close entirely, the recv still blocks, and printing the socket object in the join loop shows it staying open indefinitely.
  • So I don't think that's the root cause here (seems more likely it's something to do with this all happening at interpreter shutdown perhaps?), but at least I took a crack at it.

openstack-gerrit pushed a commit to openstack/fuel-qa that referenced this issue Jul 6, 2016

Run octane fuel-restore in silent mode
Since 9.0 there is no docker containers and now we should apply all
changes to the Fuel node by applying puppet manifests. Unfortunately,
it's generating tons of lines in stdout and paramiko can hangs on
reading this lines. Disable writting output to stdout/err but keeping
logs in /var/log/octane.log can fix this issue.
For details about paramiko issues see
paramiko/paramiko#520

Change-Id: If3fd0e6f3490d37e486ce70c97de92f83cd6741d
@daboshh

This comment has been minimized.

daboshh commented Jul 7, 2016

It looks like I'm having the same problem and fix didn't work for me.
I'm new to python and programming so I ask for your patience.
Code below works on some devices, but sadly not on Alcatel Lucent.
I hope this helps.

Win 10
Python 3.5.1
Paramiko 2.0.1

import paramiko
paramiko.common.logging.basicConfig(level=paramiko.common.DEBUG)

ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('x.x.x.x', username='x', password='x')
stdin, stdout, stderr = ssh.exec_command('show uptime')
stdout = stdout.read()
print(stdout)


DEBUG:paramiko.transport:starting thread (client mode): 0x2b714518
DEBUG:paramiko.transport:Local version/idstring: SSH-2.0-paramiko_2.0.1
DEBUG:paramiko.transport:Remote version/idstring: SSH-2.0-OpenSSH_3.5p1
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_3.5p1)
DEBUG:paramiko.transport:kex algos: TOO LONG TO COPY PASTE
DEBUG:paramiko.transport:Kex agreed: diffie-hellman-group1-sha1
DEBUG:paramiko.transport:Cipher agreed: aes128-cbc
DEBUG:paramiko.transport:MAC agreed: hmac-md5
DEBUG:paramiko.transport:Compression agreed: none
DEBUG:paramiko.transport:kex engine KexGroup1 specified hash_algo <built-in function openssl_sha1>
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Adding ssh-rsa host key for X.X.X.X: b'ad920529adabf592990dcc7d9236881c'
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (password) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
DEBUG:paramiko.transport:EOF in transport thread
Traceback (most recent call last):
  File "C:/Users/od/PycharmProjects/ssh/ssh.py", line 7, in <module>
    stdin, stdout, stderr = ssh.exec_command('show uptime')
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\client.py", line 424, in exec_command
    chan.exec_command(command)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 60, in _check
    return func(self, *args, **kwds)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 234, in exec_command
    self._wait_for_event()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 1103, in _wait_for_event
    raise e
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\transport.py", line 1754, in run
    ptype, m = self.packetizer.read_message()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 391, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 256, in read_all
    raise EOFError()
EOFError

And line 256 corresponds to previously reported code

               x = self.__socket.recv(n)
                if len(x) == 0:
                    raise EOFError()
@sanseihappa

This comment has been minimized.

Contributor

sanseihappa commented Jul 11, 2016

@daboshh I believe your issue is different than the hang being discussed here. In the cases here, no exception is ever being thrown, much less the EOFError you report. You might look at #687 and see if perhaps you're having a similar issue with the other end of the SSH connection?

@sanseihappa

This comment has been minimized.

Contributor

sanseihappa commented Jul 18, 2016

@bitprophet Should I file a separate issue for the hang that can occur in Transport.start_client()?

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 20, 2016

@sanseihappa Yea, that sounds like a good idea, seems orthogonal to me offhand & anything we can do to empower users to get exceptions instead of hangs, would be useful. Please file a PR - thanks!

EDIT: if it wasn't obvious, please drop a ref to "#520" in the ticket body somewhere (not the title, GH doesn't scan those for some reason).

@daboshh

This comment has been minimized.

daboshh commented Jul 22, 2016

@sanseihappa I tried switching exchange algorithms, but with no luck. Both sides agree on both exchange algorithms, but return same errors.

@cool-RR

This comment has been minimized.

cool-RR commented Dec 7, 2016

I'm using 2.0.2 and I have hangs which I think are caused by this problem. Is there a workaround until a solution is released?

@bitprophet

This comment has been minimized.

Member

bitprophet commented Dec 9, 2016

I'm planning to pop out 2.0.3 today, which has a couple related fixes in it.

@cool-RR

This comment has been minimized.

cool-RR commented Dec 9, 2016

@bitprophet

This comment has been minimized.

Member

bitprophet commented Dec 9, 2016

FTR this issue isn't marked as solved yet because it feels like one of those "many causes, similar symptoms" things. We'll see how 2.0.3 and friends do re: fixing it for involved users :)

@daboshh

This comment has been minimized.

daboshh commented Dec 12, 2016

It didn't work for me. Looks like I'm having the same errors as previously.

import paramiko
paramiko.common.logging.basicConfig(level=paramiko.common.DEBUG)

ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('x.x.x.x', username='x', password='x')
stdin, stdout, stderr = ssh.exec_command('show uptime')
stdout = stdout.read()
print(stdout)

DEBUG:paramiko.transport:starting thread (client mode): 0x4225c668
DEBUG:paramiko.transport:Local version/idstring: SSH-2.0-paramiko_2.1.0
DEBUG:paramiko.transport:Remote version/idstring: SSH-2.0-OpenSSH_3.5p1
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_3.5p1)
DEBUG:paramiko.transport:kex algos:TOO LONG TO COPY PASTE
DEBUG:paramiko.transport:Kex agreed: diffie-hellman-group1-sha1
DEBUG:paramiko.transport:Cipher agreed: aes128-cbc
DEBUG:paramiko.transport:MAC agreed: hmac-md5
DEBUG:paramiko.transport:Compression agreed: none
DEBUG:paramiko.transport:kex engine KexGroup1 specified hash_algo <built-in function openssl_sha1>
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Adding ssh-rsa host key for X.X.X.X: b'e412c1b06a2f4c5606c4252390064731'
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (password) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
DEBUG:paramiko.transport:EOF in transport thread
Traceback (most recent call last):
  File "C:X\X\X\X", line 8, in <module>
    stdin, stdout, stderr = ssh.exec_command('show uptime')
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\client.py", line 441, in exec_command
    chan.exec_command(command)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 60, in _check
    return func(self, *args, **kwds)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 234, in exec_command
    self._wait_for_event()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 1161, in _wait_for_event
    raise e
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\transport.py", line 1760, in run
    ptype, m = self.packetizer.read_message()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 391, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 256, in read_all
    raise EOFError()
EOFError
@rustyscottweber

This comment has been minimized.

rustyscottweber commented Dec 14, 2016

I ran into a similar problem with this that was related to the Linux kernel I was working on refused to close the socket which would hang the transport thread. Maybe also check the OS you are running on and the kernel version.

@daboshh

This comment has been minimized.

daboshh commented Dec 14, 2016

I ran into a similar problem with this that was related to the Linux kernel I was working on refused to close the socket which would hang the transport thread. Maybe also check the OS you are running on and the kernel version.

I'm using updated win 10. :(

@andreycizov

This comment has been minimized.

andreycizov commented Dec 17, 2016

It seems I am running into the same issue when paramiko.SFTPClient.open(). paramiko.sftp_file.SFTPFile.close() would hang indefinitely when called. I have solved this by calling _close(async=True). Not sure if it solves the issue of properly closing the file, but it definitely solves the issue of hangups.

@urban-1

This comment has been minimized.

urban-1 commented Dec 23, 2016

Hi all,

I am not sure that I have the same exactly problem but paramiko hangs in packet.py, line 276, in read_all. This is happening only in python 3, details:

$ python3 --version
Python 3.4.3
$ pip3 freeze | grep paramiko
paramiko==1.17.0
$ lsb_release -a
LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.7 (Santiago)
Release:        6.7
Codename:       Santiago

Now, in order to investigate this I made a small function that walks all the threads in sys._current_frames(). Funny thing is that when I run that function the problem disappears!! The code that works is:

            # Close client or transport here ****

            import threading
            import traceback
            for thread_id, frame in sys._current_frames().items():
                for thread in threading.enumerate():
                    if thread.ident == thread_id:
                        name = thread.name
                  ''.join(traceback.format_stack(frame))

EDIT: explicit .close() is required

The most interesting thing is that if you remove the last line you get the symptoms back... I have absolutely no clue or explanation why traceback.format_stack(frame) fixes the problem... I am posting this in case someone has the same issue.

Let me know if I should create a new Issue

Cheers,

Andreas

@tyler-8

This comment has been minimized.

tyler-8 commented Dec 28, 2016

Found this thread through Google, I'm running Python 2.7.13 and running into this issue with a multiprocessed script that uses paramiko 2.1.1. Here's my log output where it hangs:

INFO:paramiko.transport:Authentication (password) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
DEBUG:paramiko.transport:[chan 0] Sesch channel 0 request ok
DEBUG:paramiko.transport:[chan 0] Sesch channel 0 request ok
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:EOF in transport thread
@paltsevpavel

This comment has been minimized.

paltsevpavel commented Mar 31, 2017

Hi, i've the same problem. May be somebody help.

Python 2.7.12.
DEBUG:paramiko.transport:starting thread (client mode): 0x3555f90L
DEBUG:paramiko.transport:Local version/idstring: SSH-2.0-paramiko_2.1.2
DEBUG:paramiko.transport:Remote version/idstring: SSH-1.99-OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11
INFO:paramiko.transport:Connected (version 1.99, client OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11)
DEBUG:paramiko.transport:kex algos:[u'curve25519-sha256@libssh.org', u'ecdh-sha2-nistp256', u'ecdh-sha2-nistp384', u'ecdh-sha2-nistp521', u'diffie-hellman-group-exchange-sha256', u'diffie-hellman-group16-sha512', u'diffie-hellman-group18-sha512', u'diffie-hellman-group14-sha256', u'diffie-hellman-group14-sha1'] server key:[u'ssh-rsa', u'rsa-sha2-512', u'rsa-sha2-256', u'ecdsa-sha2-nistp256', u'ssh-ed25519'] client encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-cbc', u'3des-cbc'] server encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-cbc', u'3des-cbc'] client mac:[u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] server mac:[u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] client compress:[u'none', u'zlib@openssh.com'] server compress:[u'none', u'zlib@openssh.com'] client lang:[u''] server lang:[u''] kex follows?False
DEBUG:paramiko.transport:Kex agreed: diffie-hellman-group14-sha1
DEBUG:paramiko.transport:Cipher agreed: aes128-ctr
DEBUG:paramiko.transport:MAC agreed: hmac-sha2-256
DEBUG:paramiko.transport:Compression agreed: none
DEBUG:paramiko.transport:kex engine KexGroup14 specified hash_algo
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Adding ssh-rsa host key for 10.67.43.189: d75da3a54f79a2d0de07c4f98164530b
DEBUG:paramiko.transport:userauth is OK
DEBUG:paramiko.transport:EOF in transport thread

Server's parameters:
OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11, OpenSSL 1.0.2h 3 May 2016
HP-UX Secure Shell-A.07.30.003, HP-UX Secure Shell version

@dmitrytokarev

This comment has been minimized.

dmitrytokarev commented Mar 31, 2017

@tyler-8 @paltsevpavel Try this: #109 (comment)
(it helped me with very similar hang issue, which could have same root cause)

@vdloo

This comment has been minimized.

vdloo commented Jul 18, 2017

Encountered something similar in Python 3.5.2. Blocks forever if an exception occurs in the auth_handler:

For example:

  File "/usr/lib/python3/dist-packages/kamikaze3/importer/ssh.py", line 212, in get_connection
    sock=proxy
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 367, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 584, in _auth
    raise saved_exception
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 571, in _auth
    self._transport.auth_password(username, password)
  File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1262, in auth_password
    return self.auth_handler.wait_for_response(my_event)
  File "/usr/lib/python3/dist-packages/paramiko/auth_handler.py", line 208, in wait_for_response
    raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed.

and

  File "/usr/lib/python3/dist-packages/kamikaze3/importer/ssh.py", line 212, in get_connection
    sock=proxy
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 367, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 584, in _auth
    raise saved_exception
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 571, in _auth
    self._transport.auth_password(username, password)
  File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1262, in auth_password
    return self.auth_handler.wait_for_response(my_event)
  File "/usr/lib/python3/dist-packages/paramiko/auth_handler.py", line 208, in wait_for_response
    raise e
paramiko.ssh_exception.BadAuthenticationType: ('Bad authentication type', ['publickey']) (allowed_types=['publickey'])

I'm not running the latest paramiko (1.16.0-1) so I don't know if this has already been fixed, but if someone stumbles on this via google, you can band aid the issue by setting a timer in a thread to close the connection after a certain amount of time so the deadlock won't block your program indefinitely:

from threading import Timer
from contextlib import contextmanager

@contextmanager
def close_conn_on_timeout(paramiko_ssh_client):
    timer = Timer(10, paramiko_ssh_client.close)
    timer.start()
    yield
    timer.cancel()

with close_conn_on_timeout(paramiko_ssh_client):
    paramiko_ssh_client.connect(host)

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment