One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520

colinmcintosh · 2015-04-30T21:54:01Z

[MAINTAINER NOTE: a variant of this issue under Python 3 has been worked around as per this comment but it's presumably still at large for Python 2 in some scenarios.]

When running SSH connections to multiple devices the script will sometimes hang once the sys.exit(0) is reached. Once this is reached the script will hang indefinitely. It doesn't happen everytime but the more devices the script runs against the more likely it is to happen. It seems like it's related to the amount of time the script takes to run.

The last log message paramiko outputs is DEBUG:EOF in transport thread

Using the faulthandler module I dumped a stacktrace for when it hangs:

Fatal Python error: Aborted

Thread 0x00007f47faff5700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fbff7700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fcff9700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47fdffb700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f47feffd700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f480590f700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f48050ce700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007f4807368700 (most recent call first):
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 204 in read_all
  File "build/bdist.linux-x86_64/egg/paramiko/packet.py", line 341 in read_message
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1590 in run
  File "/usr/local/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 783 in __bootstrap

Current thread 0x00007f480f7c4700 (most recent call first):
  File "/usr/local/lib/python2.7/threading.py", line 358 in wait
  File "/usr/local/lib/python2.7/threading.py", line 960 in join
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1419 in stop_thread
  File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 558 in close
  File "build/bdist.linux-x86_64/egg/paramiko/resource.py", line 60 in callback
Aborted

It looks like all but LAST ONE of the resources closes correctly but the LAST ONE doesn't close correctly. The same thing happens if .close() is explicitly called on the Transport object. It will hang at that .close() method indefinitely, sometimes.

There is not one specific device it happens for either. I have tried many hosts and OS's with no specific one standing out a problem.

The text was updated successfully, but these errors were encountered:

colinmcintosh · 2015-05-01T00:05:55Z

After further investigation it looks like it's hanging on line 204 under read_all in packet.py which is

x = self.__socket.recv(n)

It looks like it never makes it past this line after the close() method on the Transport object is called.

If I explicitly call

del transport_object

before the end of the script it properly closes the Transport object and cleans up as expected.

rustyscottweber · 2015-05-01T17:59:23Z

I have this same issue and believe it has to do with calling the close method during a del method. Have you tried not including any calls to close paramiko from a delete method and let python's garbage collection process take care of the thread?

colinmcintosh · 2015-05-05T14:17:28Z

I don't have have any del methods in my classes

colinmcintosh · 2015-05-05T19:40:50Z

I've altered the library for debugging purposes by changing the following lines in transport.py@1420:

    def stop_thread(self):
        self.active = False
        self.packetizer.close()
        while self.is_alive() and (self is not threading.current_thread()):
            print("Trying to kill thread.")
            self.join(10)
            print("It's alive: {}".format(self.is_alive()))

The result of this is the script outputs the following once it reaches the sys.exit(0):

Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
Trying to kill thread.
It's alive: True
[truncated, it goes on forever]

It seems to be that the Thread is deadlocked somehow and can't close. I'm not sure what the point of this loop is in transport.py as nothing changes through each loop. Whether you do

while thread.is_alive():
    thread.join(10)

does the same thing, AFAIK, as

thread.join()

I could see value in the loop if there was a debug log there to let you know it's locked up or if there was a loop counter that eventually skips locked threads.

Ideally I would propose that it be changed to

    def stop_thread(self):
        self.active = False
        self.packetizer.close()
        if self is not threading.current_thread():
            self.join(10)
            if self.is_alive():
                raise Exception("Timed out while trying to kill thread")

This doesn't solve whatever is causing the overall locked up Thread issue but this is still a good solution to catch and kill any leftover threads.

rustyscottweber · 2015-05-13T02:00:51Z

What version of paramiko are you using as I think that is a change that has already been made more or less?

colinmcintosh · 2015-05-13T18:06:02Z

What do you mean by more or less? I'm almost positive I'm on the master but if not it's 1.15.1. I took a look at the master and it looks like it will still hang the same way if a thread is hung.

colinmcintosh · 2015-05-13T18:09:59Z

I checked, it's running on version 1.15.2

Offlein · 2015-07-24T20:25:13Z

I'm experiencing what might be the same issue on Paramiko 1.15.2, although I don't think it's because the Transport object closed. At least, I've set a breakpoint in the Transport object's close() and it is not being hit. (So... maybe it's not?)

Still, though, I'm otherwise getting stuck in the x = self.__socket.recv(n) area as well. This line keeps triggering the socket.timeout exception, I guess, and hence the if got_timeout: section will run, looping back to the beginning of the "while" statement eternally.

Strangely, this script runs just fine when I test it using most servers. This only occurs when I'm connecting to the servers I definitely 100% need to connect to in my office. As well, I can connect to these servers properly from other SFTP software like FileZilla. It's something in conjunction with paramiko and this server...

stas · 2015-11-13T01:36:18Z

Please try changing the paramiko.sftp_file.SFTPFile.MAX_REQUEST_SIZE value.
Might be just one of those servers.

dmitrytokarev · 2015-12-03T23:48:01Z

this possibly related to #109

sanseihappa · 2016-05-26T18:12:59Z

#698 Looks like the same issue (Hang in packet.py Packetizer.read_all())

sanseihappa · 2016-05-26T18:20:00Z

@colinmcintosh thanks for your initial post using faulthandler, I'm duplicating this issue on a regular basis and dumping the stack traces to confirm I'm hitting the same thing using 2.0.0.

@botanicvelious mentions a workaround put in place further up the call stack in transport.py Transport.run(), I'll try this locally, but...

Is this workaround appropriate? (And should I generate a PR to get this in?)
Is there something else that should be done instead, that might be considered the actual fix?

Thanks!

botanicvelious · 2016-05-29T03:11:52Z

I wouldn't use my work-around as it doesn't work in all cases, it just works for our specific use case.

bitprophet · 2016-06-12T00:05:57Z

This ticket accurately describes what I'm getting with fabric v2's integration suite and which made me generate fabric/fabric#1473.

Doesn't matter how many actual hosts (it can be a bunch of connections to the same host) but the more independent Client objects generated, the more likely it is that one of them will hang.

Explicitly calling .close() (which percolates down into Transport.close) on all such objects reliably seems to avoid the problem. This is (probably) why client libs like Fabric have historically always had to explicitly close client/transport objects before interpreter shutdown.

Having that be a hard requirement of using that API or Paramiko itself, feels super crummy to me, so I'd like this to get solved eventually. Going to poke at this somewhat this weekend myself to see if I can get to the root of it.

Difficulty is, threading is often fraught with peril, changes that seem to fix one issue can easily spawn others; and in code this old there's always fun landmines. But this has been an intermittent issue forever so I'd like to at least try fixing it.

bitprophet · 2016-06-12T02:27:06Z

Firstly, what is calling stop_thread? Either Transport.close() or the module-level transport._join_lingering_threads.
- which itself is pretty eyebrow-raising and ancient - dates from 2003/4. Is called via atexit. commit is c565d66 - not very confidence inspiring.
- In testing, which of those two avenues generates the stop_thread call for a given Transport thread seems pretty randomized.
- However, the stuck thread always seems to end up there due to a call to Transport.close, never _join_lingering_threads.
The stuck thread last executes/gets stuck at self.packetizer.read_message() within run(), with the next activity about that thread being the stop_thread call.
The EOF in transport thread stuff seems like it might be orthogonal, as it frequently-to-always is logged by the other two Transports that aren't hanging, and not by the hanging thread.
This kinda jives with the core problem: the hung thread never receives any sort of "final" message from the network, not even anything that would make it except, which is why it hangs out forever on packetizer.read_message
- The "welp, dunno how we got here, but let's just exit-via-excepting on join() timeout" approach taken by @colinmcintosh seems like an okay backstop in general - I assume there are multiple ways to end up in this situation. May add it regardless of whether I find out why these Transports end up in this state.
Digging deeper re: connection lifecycles, the stuck threads appear to do:
- channel-data
- channel-request (of type exit-status, so this would be the higher level recv_exit_status result most likely)
- channel-eof
- channel-close
That looks in line with normal behavior going by https://tools.ietf.org/html/rfc4254#section-5.3 and it is also the same order of operations seen in the non-hung threads.
So how are the other threads successfully exiting their run while the hung one is not? The answer seems to be the EOFErrors, actually; an exception raising is one of the few ways to exit the while loop, given that the threads tend to fall into blocking on read_message.
- Which raises the question of why this code even bothers with while self.active, but meh.
- Anyway, my suspicion is that the EOFErrors are generated by the call to self.packetizer.close() in stop_thread(), but I still need to verify this (& then, again, figure out what's different in the hang situation.) (If that is true, then this seems an awful roundabout way to cause the loop to exit...guessing not intended?)
The EOFError raised is always on line 272 of Packetizer.read_all, which is within if got_timeout and triggered by if self.__closed (again, yup, seems like...)
Instrumenting further, Packetizer.read_all is encountering socket.timeout when it sets got_timeout, so in a "normal" situation, the order of operations is:
- Packetizer.close is called by Transport.stop_thread, which sets __closed on the packetizer.
- The read_all loop encounters socket.timeout (presumably, constantly, anytime recv takes longer than the configured timeout - this is probably key) and sets got_timeout
- Then hits line 272, sees that self.__closed is True, raises EOFError, breaks the transport out of its run loop, happy days.
The key difference is that the hanging thread hits socket.timeout same as the others, but when it gets to the self.__closed check, the answer is False, so it loops again - and then sits on the self.__socket.recv forever.
- So, first, this does seem to be a race condition - if we assume that all threads would end up blocking on that final recv, it comes down to whether self.__closed gets set fast enough or not.
- Second, why is socket.timeout not being re-raised? I need to doublecheck exactly how it is supposed to work (and then whether there are bugs in it) but if the loop was a "proper" recv -> timeout -> recv -> timeout semi-hot loop, this problem would not exist, because the "hung" thread's packetizer would simply timeout another 0.1s later, see __closed set, and EOFError.
- And as mentioned earlier, the fact that this is how the whole construct terminates, seems pretty ridiculous anyways - these errors are always ugly looking in the logs and frequently mislead people. (and at least in my testing, they always appear during normal operation - I don't know if this was always the case, but...)
The docs for the socket module certainly back up my memory re: how this is all supposed to behave. So I'm now digging around to see what could cause the timeout to not fire, e.g. the socket being closed elsewhere, or whatnot.

Also of note is that this is all exactly the same stuff #380 describes, though they never got resolution. The only extra wrinkle I see is the assertion that the close call comes from "a different thread"; I don't know that this is the case in my setup, but it does come back to "it only pops up when the closure occurs via Transport.close and not via _join_lingering_threads".

Still unclear if that's germane, tho as noted I need to doublecheck the different treatment of the various objects involved in the two scenarios. I'm 99% sure I've seen non-hanging threads also terminate via Transport.close() (especially if the top level client code does it explicitly.)

bitprophet · 2016-06-12T05:23:36Z

MOAR:

If I insert a sleep(1) in the "was not closed on timeout" part of the code, then check __closed after, it is indeed then closed, confirming the guess that this is part of the race condition in question.
A slight modification to the previous comment, sometimes the non-hung threads also fulfil part of the race, and get the "isn't closed yet" status; however, they do receive the expected subsequent packet.timeout, and then on that second time through the timeout+closed test, they pass & EOFError.
So it still comes down to, why isn't this hung thread's socket raising the timeout?
Upon the first "post-hang" loop of the stop_thread join loop, the bad thread's socket object is in a closed state, for whatever good that does us. (This could be another modification on the "backstop" idea, if somehow the straight up "got past 10 seconds without successful join()" condition were not stringent enough.)
This specific StackOverflow comment is the first mention I can find of socket.recv not always honoring socket.settimeout, due to a race condition in how Python implements the timeout.
- The poster seems to walk this back, but I think they're walking back the specific explanation of the race condition, not the fact that it exists?
- I read the relevant parts https://github.com/python/cpython/blob/8707d182ea722da19ace7fe994f3785cb77a679d/Modules/socketmodule.c and it seems to jive with other explanations of how this works (presence of timeout sets the socket to nonblocking mode; select() used with the given timeout as the select interval; then actual [again, non-blocking] socket call executed; loop).
- What I don't see there (but granted, I barely know C or network programming at this level) is where a race condition would pop up, because the socket should be in nonblocking mode, so sock_func() should return and then the loop should continue.
Both that thread and this one occasionally mention implementing the "nonblocking socket + select" approach at user-level, though none of them say this as a workaround for bugs in settimeout.

Took a different tack and looked at how exactly these bad threads are having .close() called on them; it's via ResourceManager (from resource.py).

If I comment out the bit in Client that does that manager registration -- no hangs!
- But also no third EOFError or other normal shutdown; the last debug message out of the would-have-been-closed-via-ResourceManager thread is the "about to recv()" one it normally hangs on.
- No obvious ill effects, but with this sort of thing, I'd still worry it would cause issues in some situations. (Whether they are worse than an infinite hang...arguably not?)
In some of the success/no-race-condition scenarios, the ResourceManager is still in play - i.e. it's not the sole cause (or symptom) of the issue.
The 'bad' thread is almost always the "middle" thread in my "execute 3 commands in sequence" test script. Not sure what that means exactly.
Checked timing to figure out what exactly is causing some of these to end up in the lingering-threads list and the 'bad' one to be resource-managered, in case that is a clue...
- Bad thread: �1465705574.463436
- Good thread 1: 1465705574.41101
- Good thread 2: 1465705574.342657
- So the bad thread's stop_thread is being called 0.12s after the later "good" thread.
- In subsequent runs, the difference is smaller - 0.04s and 0.05s - but it is still always last by a decent margin. Not that this tells me much.

Other random notes:

Slapping a time.sleep at the end of my test script has the expected effect - all 3 threads' Packetizer.read_all are in a hot loop with the recv timeout working great, until the script truly exits, at which time we're at the mercy of the race condition again (though it seems a little less likely to pop up in this case.)
Triple checked that the timeout was the one I expected (0.1s) - it is, according to gettimeout. Including just prior to the "bad" recv.
Recreated the issue identically on Debian 8, so it's not an OS X wrinkle. (But, as on OS X, the issue only appears present on Python 3.)
Python 2 has the same "two threads get linger-killed, one gets resource-managed" behavior - it's just that the thread resource-managed doesn't encounter the race condition / incorrectly-blocking recv.
Crawled all over the Python bug tracker, nothing directly relevant seemed to pop up.
- One bug (https://bugs.python.org/issue23863) did tip me off that Python 3.5 changed a lot of socket timeout related things, so I wonder if, somehow, that is related.
  - Made a Python 3.4.4 venv (thanks, pyenv!) but...nope! same issue there. So unlikely to be PEP-0475's fault.
- https://bugs.python.org/issue22043 was another possibly related change, but that one was also targeted for 3.5, so.
In interests of narrowing things further, I also tested the issue under Python 3.3.6; it's present there too. So, is a general Python 3 issue apparently.
- I scanned some of the Python 3.x changelogs but nothing useful turned up there either; at this point this is diminishing returns.

bitprophet · 2016-06-12T05:51:56Z

Ways forward brainstorm:

Still perplexed by the discrepancy between the atexit hook and the ResourceManager closing; while clearly the latter isn't the direct cause, I still wonder if it's a clue. May poke that further; is the ResourceManager firing while the atexit hook loop is running? Or is the 'bad' thread somehow being evicted from _active_threads? If so, why?
I'm not super happy leaving things at "use the join timeout as an ultimatum" because it still means the timeout - whatever we set it to - will be an annoyingly long, unexpected wait time at the end of the interpreter session for anybody on Python 3. Including Fabric 2's integration test suite.
- But, it's still an option, and waiting, say, 1-5s is better than waiting forever. Especially if accompanied by "erm...had to timeout joining the transport thread. sorry! you may have hit bug One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520...if you were NOT at the end of your session, please file a bug!".
Given that an explicit user-driven close() always seems to work (still unclear why - perhaps the race condition with the socket object has to do with end-of-interpreter shenanigans?) and is arguably good practice, we could list this as "documentation only"...

Another idea occurred to me which I think I like better: keep the ultimatum-style join timeout, but set it to a much shorter value if - by the time we're calling it - the transport's socket and packetizer both appear to have entered their closed states.

That detects the symptoms of this problem, lets the interpreter exit in a reasonable-to-humans amount of time, but limits the possibility of accidentally terminating "too early" in scenarios unlike the one I am testing under.

A couple more wrinkles on this could also be:

A while-loop checking those closed states, whose body is join(0.1) (or similar) ; that would ostensibly work even better for "taking a while to shut down" scenarios as they would take as much time as needed;
Use threadsafe signaling within Packetizer, around the recv call, so we can truly know whether or not we're in this particular scenario (socket closed, packetizer closed, packetizer "I was reading!" flag still set). I doubt this is necessary but it could be a (IMO pretty stupid) sanity check, if edge cases pop up.

bitprophet · 2016-06-12T06:00:34Z

Yea, mutating the loop to be "not current thread + not socket.closed + not packetizer.closed" and turning the join timeout down to 0.1s seems to do the trick pretty well. I don't have a great way of testing unusually-slow server endpoints right now (something I'd like to get sometime...) but I'm probably going to at least trial this change while I continue hacking on other things.

Minor side note, socket objects have no public "is closed?" flag that I can see, but _closed is a) what's used by __repr__ and b) available in Python 2+ 3, so it'll have to do for now. I'm not crazy worried about that changing anytime soon.

bitprophet · 2016-06-12T06:30:21Z

FWIW problem + fix both seem present/workable on 1.16 too (so, this is in no way related to the switch to pyca/crypto - not that I thought it would've been). I committed a cleaned-up version of what I was testing with and forward-ported (1.16, 1.17, 2.0, master) - it's live now.

If anyone watching this can give one of those branches a shot and give a thumbs-up (both re: fixing the issue, if they have it; or at least proving it's not making things worse for them) that would be cool.

bitprophet · 2016-06-12T16:40:22Z

Was reminded by tef on twitter that I never chased down the assertions made in #380 about the issue potentially being how the socket in question is closed from a different thread during the recv call (re this SO thread. So if this needs more love that's probably the next place to look.

Offhand (recalling that threading is not my expertise) if that's the true race condition, it would mean we do want "I'm recving here!" locking in Packetizer.read_all, which is honored by Packetizer.close() when it calls self.__sock.close. And/or in the other 2-3 places where that same socket, since it is also Transport.sock, is closed by Transport methods such as close or the end of run...sigh. But pretty sure in this case the race would be between the two Packetizer methods.

bitprophet · 2016-06-12T17:26:17Z

Sadly doesn't seem like a workable avenue:

(With my above 'fix' reverted temporarily,) I added locking around recv and self.__socket.close in Packetizer, but we still end up in the blocking recv unfortunately. Implies that Packetizer.__socket.close (from .close) is not actually the trigger.
Confirmed this by throwing a 0.5s sleep before the socket close in the original code; changes nothing about the behavior of the race condition, the recv fires and blocks.
- Also confirmed that this is the right place and the socket isn't closing via some other avenue: if I comment out the socket close entirely, the recv still blocks, and printing the socket object in the join loop shows it staying open indefinitely.
So I don't think that's the root cause here (seems more likely it's something to do with this all happening at interpreter shutdown perhaps?), but at least I took a crack at it.

sanseihappa · 2016-07-18T20:02:22Z

@bitprophet Should I file a separate issue for the hang that can occur in Transport.start_client()?

bitprophet · 2016-07-20T00:49:34Z

@sanseihappa Yea, that sounds like a good idea, seems orthogonal to me offhand & anything we can do to empower users to get exceptions instead of hangs, would be useful. Please file a PR - thanks!

EDIT: if it wasn't obvious, please drop a ref to "#520" in the ticket body somewhere (not the title, GH doesn't scan those for some reason).

daboshh · 2016-07-22T13:38:29Z

@sanseihappa I tried switching exchange algorithms, but with no luck. Both sides agree on both exchange algorithms, but return same errors.

cool-RR · 2016-12-07T17:01:48Z

I'm using 2.0.2 and I have hangs which I think are caused by this problem. Is there a workaround until a solution is released?

bitprophet · 2016-12-09T18:15:36Z

I'm planning to pop out 2.0.3 today, which has a couple related fixes in it.

cool-RR · 2016-12-09T18:28:14Z

Awesome! Thanks!

…

On Fri, Dec 9, 2016 at 8:15 PM, Jeff Forcier ***@***.***> wrote: I'm planning to pop out 2.0.3 today, which has a couple related fixes in it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#520 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADdyuQ8R8HFaRNTRIaUrLMbOC5Aj1tSks5rGZrMgaJpZM4EM_WF> .

bitprophet · 2016-12-09T18:29:26Z

FTR this issue isn't marked as solved yet because it feels like one of those "many causes, similar symptoms" things. We'll see how 2.0.3 and friends do re: fixing it for involved users :)

daboshh · 2016-12-12T10:30:24Z

It didn't work for me. Looks like I'm having the same errors as previously.

import paramiko
paramiko.common.logging.basicConfig(level=paramiko.common.DEBUG)

ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect('x.x.x.x', username='x', password='x')
stdin, stdout, stderr = ssh.exec_command('show uptime')
stdout = stdout.read()
print(stdout)

DEBUG:paramiko.transport:starting thread (client mode): 0x4225c668
DEBUG:paramiko.transport:Local version/idstring: SSH-2.0-paramiko_2.1.0
DEBUG:paramiko.transport:Remote version/idstring: SSH-2.0-OpenSSH_3.5p1
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_3.5p1)
DEBUG:paramiko.transport:kex algos:TOO LONG TO COPY PASTE
DEBUG:paramiko.transport:Kex agreed: diffie-hellman-group1-sha1
DEBUG:paramiko.transport:Cipher agreed: aes128-cbc
DEBUG:paramiko.transport:MAC agreed: hmac-md5
DEBUG:paramiko.transport:Compression agreed: none
DEBUG:paramiko.transport:kex engine KexGroup1 specified hash_algo <built-in function openssl_sha1>
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Adding ssh-rsa host key for X.X.X.X: b'e412c1b06a2f4c5606c4252390064731'
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (password) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
DEBUG:paramiko.transport:EOF in transport thread
Traceback (most recent call last):
  File "C:X\X\X\X", line 8, in <module>
    stdin, stdout, stderr = ssh.exec_command('show uptime')
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\client.py", line 441, in exec_command
    chan.exec_command(command)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 60, in _check
    return func(self, *args, **kwds)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 234, in exec_command
    self._wait_for_event()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\channel.py", line 1161, in _wait_for_event
    raise e
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\transport.py", line 1760, in run
    ptype, m = self.packetizer.read_message()
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 391, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "C:\Program Files\Python 3.5.1\lib\site-packages\paramiko\packet.py", line 256, in read_all
    raise EOFError()
EOFError

rustyscottweber · 2016-12-14T00:40:18Z

I ran into a similar problem with this that was related to the Linux kernel I was working on refused to close the socket which would hang the transport thread. Maybe also check the OS you are running on and the kernel version.

daboshh · 2016-12-14T08:26:34Z

I ran into a similar problem with this that was related to the Linux kernel I was working on refused to close the socket which would hang the transport thread. Maybe also check the OS you are running on and the kernel version.

I'm using updated win 10. :(

andreycizov · 2016-12-17T01:38:32Z

It seems I am running into the same issue when paramiko.SFTPClient.open(). paramiko.sftp_file.SFTPFile.close() would hang indefinitely when called. I have solved this by calling _close(async=True). Not sure if it solves the issue of properly closing the file, but it definitely solves the issue of hangups.

urban-1 · 2016-12-23T11:59:46Z

Hi all,

I am not sure that I have the same exactly problem but paramiko hangs in packet.py, line 276, in read_all. This is happening only in python 3, details:

$ python3 --version
Python 3.4.3
$ pip3 freeze | grep paramiko
paramiko==1.17.0
$ lsb_release -a
LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.7 (Santiago)
Release:        6.7
Codename:       Santiago

Now, in order to investigate this I made a small function that walks all the threads in sys._current_frames(). Funny thing is that when I run that function the problem disappears!! The code that works is:

            # Close client or transport here ****

            import threading
            import traceback
            for thread_id, frame in sys._current_frames().items():
                for thread in threading.enumerate():
                    if thread.ident == thread_id:
                        name = thread.name
                  ''.join(traceback.format_stack(frame))

EDIT: explicit .close() is required

The most interesting thing is that if you remove the last line you get the symptoms back... I have absolutely no clue or explanation why traceback.format_stack(frame) fixes the problem... I am posting this in case someone has the same issue.

Let me know if I should create a new Issue

Cheers,

Andreas

tyler-8 · 2016-12-28T21:47:10Z

Found this thread through Google, I'm running Python 2.7.13 and running into this issue with a multiprocessed script that uses paramiko 2.1.1. Here's my log output where it hangs:

INFO:paramiko.transport:Authentication (password) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
DEBUG:paramiko.transport:[chan 0] Sesch channel 0 request ok
DEBUG:paramiko.transport:[chan 0] Sesch channel 0 request ok
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:EOF in transport thread

paltsevpavel · 2017-03-31T08:23:35Z

Hi, i've the same problem. May be somebody help.

Python 2.7.12.
DEBUG:paramiko.transport:starting thread (client mode): 0x3555f90L
DEBUG:paramiko.transport:Local version/idstring: SSH-2.0-paramiko_2.1.2
DEBUG:paramiko.transport:Remote version/idstring: SSH-1.99-OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11
INFO:paramiko.transport:Connected (version 1.99, client OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11)
DEBUG:paramiko.transport:kex algos:[u'curve25519-sha256@libssh.org', u'ecdh-sha2-nistp256', u'ecdh-sha2-nistp384', u'ecdh-sha2-nistp521', u'diffie-hellman-group-exchange-sha256', u'diffie-hellman-group16-sha512', u'diffie-hellman-group18-sha512', u'diffie-hellman-group14-sha256', u'diffie-hellman-group14-sha1'] server key:[u'ssh-rsa', u'rsa-sha2-512', u'rsa-sha2-256', u'ecdsa-sha2-nistp256', u'ssh-ed25519'] client encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-cbc', u'3des-cbc'] server encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-cbc', u'3des-cbc'] client mac:[u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] server mac:[u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-sha1'] client compress:[u'none', u'zlib@openssh.com'] server compress:[u'none', u'zlib@openssh.com'] client lang:[u''] server lang:[u''] kex follows?False
DEBUG:paramiko.transport:Kex agreed: diffie-hellman-group14-sha1
DEBUG:paramiko.transport:Cipher agreed: aes128-ctr
DEBUG:paramiko.transport:MAC agreed: hmac-sha2-256
DEBUG:paramiko.transport:Compression agreed: none
DEBUG:paramiko.transport:kex engine KexGroup14 specified hash_algo
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Adding ssh-rsa host key for 10.67.43.189: d75da3a54f79a2d0de07c4f98164530b
DEBUG:paramiko.transport:userauth is OK
DEBUG:paramiko.transport:EOF in transport thread

Server's parameters:
OpenSSH_7.3p1+sftpfilecontrol-v1.3-hpn14v11, OpenSSL 1.0.2h 3 May 2016
HP-UX Secure Shell-A.07.30.003, HP-UX Secure Shell version

dmitrytokarev · 2017-03-31T08:32:57Z

@tyler-8 @paltsevpavel Try this: #109 (comment)
(it helped me with very similar hang issue, which could have same root cause)

vdloo · 2017-07-18T11:50:29Z

Encountered something similar in Python 3.5.2. Blocks forever if an exception occurs in the auth_handler:

For example:

  File "/usr/lib/python3/dist-packages/kamikaze3/importer/ssh.py", line 212, in get_connection
    sock=proxy
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 367, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 584, in _auth
    raise saved_exception
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 571, in _auth
    self._transport.auth_password(username, password)
  File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1262, in auth_password
    return self.auth_handler.wait_for_response(my_event)
  File "/usr/lib/python3/dist-packages/paramiko/auth_handler.py", line 208, in wait_for_response
    raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed.

and

  File "/usr/lib/python3/dist-packages/kamikaze3/importer/ssh.py", line 212, in get_connection
    sock=proxy
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 367, in connect
    look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 584, in _auth
    raise saved_exception
  File "/usr/lib/python3/dist-packages/paramiko/client.py", line 571, in _auth
    self._transport.auth_password(username, password)
  File "/usr/lib/python3/dist-packages/paramiko/transport.py", line 1262, in auth_password
    return self.auth_handler.wait_for_response(my_event)
  File "/usr/lib/python3/dist-packages/paramiko/auth_handler.py", line 208, in wait_for_response
    raise e
paramiko.ssh_exception.BadAuthenticationType: ('Bad authentication type', ['publickey']) (allowed_types=['publickey'])

I'm not running the latest paramiko (1.16.0-1) so I don't know if this has already been fixed, but if someone stumbles on this via google, you can band aid the issue by setting a timer in a thread to close the connection after a certain amount of time so the deadlock won't block your program indefinitely:

from threading import Timer
from contextlib import contextmanager

@contextmanager
def close_conn_on_timeout(paramiko_ssh_client):
    timer = Timer(10, paramiko_ssh_client.close)
    timer.start()
    yield
    timer.cancel()

with close_conn_on_timeout(paramiko_ssh_client):
    paramiko_ssh_client.connect(host)

baid86 · 2019-01-21T04:42:33Z

I am also facing the similar issue and I am running paramiko 2.4.1,
Do we have any workaround for this?

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1967, in run
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'
Exception in thread Thread-1465 (most likely raised during interpreter shutdown):Exception in thread Thread-1657 (most likely raised during interpreter shutdown):Exception in thread Thread-1521 (most likely raised during interpreter shutdown):Exception in thread Thread-401 (most likely raised during interpreter shutdown):Exception in thread Thread-1857 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1967, in run
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'

bitprophet added Needs investigation Bug labels Jul 30, 2015

sanseihappa mentioned this issue May 26, 2016

Occasional hang #698

Closed

This was referenced May 26, 2016

Potential to hang during authentication? #719

Closed

Channel._wait_for_event() has no timeout #739

Closed

bitprophet added the Hangs label Jun 11, 2016

This was referenced Jun 12, 2016

Possible Python 3-only session shutdown hangs fabric/fabric#1473

Open

Packet.read_all() hangs on socket.recv() when transport is closed by different thread #380

Closed

bitprophet added a commit that referenced this issue Jun 12, 2016

Experimental fix re #520

0e54d0f

sanseihappa mentioned this issue Jul 26, 2016

Add timeout to Transport.start_client() #780

Closed

bitprophet mentioned this issue Jul 30, 2016

ProxyCommand objects need the socket-like-obj _closed fix too #789

Closed

bitprophet mentioned this issue Aug 10, 2016

recv_ready reports true when false? #796

Closed

bitprophet mentioned this issue Aug 18, 2016

Transport object thread remains active after close() #799

Closed

bitprophet mentioned this issue Dec 7, 2016

found a hang bug in poor network condition #793

Closed

ploxiln mentioned this issue May 6, 2017

fix SSHClient/Transport leak, remove unneeded references #952

Merged

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

Experimental fix re paramiko#520

e5e88b8

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

Revert Python 2 to original join logic re paramiko#520

3808af6

dkhapun pushed a commit to cyberx-labs/paramiko that referenced this issue Jun 7, 2018

Enhance docs a bit re paramiko#520 and related hang issues.

5421b23

borellim mentioned this issue Oct 3, 2018

Paramiko occasionally fails to open transport -- 'SSHException: Error reading SSH protocol banner' aiidateam/aiida-core#1853

Closed

christopherdoyle mentioned this issue May 24, 2021

Tests hanging forever ulope/pytest-sftpserver#10

Open

wpbonelli mentioned this issue Dec 19, 2021

Paramiko SFTP failures Computational-Plant-Science/plantit#239

Closed

ni-balexand mentioned this issue Feb 10, 2023

Intermittent Python crash when gc traverses the Packetizer.read_all frame #2184

Closed

One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520

One 'bad' Transport thread hangs indefinitely at shutdown when multiple Transports are active #520

Comments

colinmcintosh commented Apr 30, 2015 • edited by bitprophet Loading

colinmcintosh commented May 1, 2015

rustyscottweber commented May 1, 2015

colinmcintosh commented May 5, 2015

colinmcintosh commented May 5, 2015

rustyscottweber commented May 13, 2015

colinmcintosh commented May 13, 2015

colinmcintosh commented May 13, 2015

Offlein commented Jul 24, 2015

stas commented Nov 13, 2015

dmitrytokarev commented Dec 3, 2015

sanseihappa commented May 26, 2016 • edited Loading

sanseihappa commented May 26, 2016

botanicvelious commented May 29, 2016

bitprophet commented Jun 12, 2016

bitprophet commented Jun 12, 2016

bitprophet commented Jun 12, 2016

bitprophet commented Jun 12, 2016 • edited Loading

bitprophet commented Jun 12, 2016

bitprophet commented Jun 12, 2016 • edited Loading

bitprophet commented Jun 12, 2016

bitprophet commented Jun 12, 2016 • edited Loading

sanseihappa commented Jul 18, 2016

bitprophet commented Jul 20, 2016 • edited Loading

daboshh commented Jul 22, 2016 • edited Loading

cool-RR commented Dec 7, 2016

bitprophet commented Dec 9, 2016

cool-RR commented Dec 9, 2016 via email

bitprophet commented Dec 9, 2016

daboshh commented Dec 12, 2016

rustyscottweber commented Dec 14, 2016

daboshh commented Dec 14, 2016

andreycizov commented Dec 17, 2016

urban-1 commented Dec 23, 2016 • edited Loading

tyler-8 commented Dec 28, 2016

paltsevpavel commented Mar 31, 2017

dmitrytokarev commented Mar 31, 2017 • edited Loading

vdloo commented Jul 18, 2017

baid86 commented Jan 21, 2019

colinmcintosh commented Apr 30, 2015 •

edited by bitprophet

Loading

sanseihappa commented May 26, 2016 •

edited

Loading

bitprophet commented Jun 12, 2016 •

edited

Loading

bitprophet commented Jun 12, 2016 •

edited

Loading

bitprophet commented Jun 12, 2016 •

edited

Loading

bitprophet commented Jul 20, 2016 •

edited

Loading

daboshh commented Jul 22, 2016 •

edited

Loading

urban-1 commented Dec 23, 2016 •

edited

Loading

dmitrytokarev commented Mar 31, 2017 •

edited

Loading