fix hang if using proxycommand #252

Closed
wants to merge 1 commit into
from

Projects

None yet

4 participants

@gza
gza commented Jan 19, 2014

[Maintainer note: this patch can't be used as-is but see comments for eventual diagnosis & fixes.]

problem was introduced by commit 068bf63
when using a proxycommand, self.is_alive() loop in stop_thread never
ends

Don't know if it is the right way to do it, but, works for me

@gza gza fix hang if using proxycommand
problem was introduced by commit 068bf63
when using a proxycommand, self.is_alive() loop in stop_thread never
ends

Signed-off-by: gza <github.guillaume@zitta.fr>
ecea10b
@mgedmin
Contributor
mgedmin commented Feb 11, 2014

Ping? There are quite a few Fabric users suffering from this.

@michaelballantyne

@mgedmin I just tried with this fix and it didn't seem to solve the problem I was having with fabric. Does it work for you?

@mgedmin
Contributor
mgedmin commented Feb 12, 2014

This PR fixes my problem. (Which is a hang during disconnect if I use Fabric to talk to a server using ProxyCommand.)

If I ^C the fab process after it hangs (while using paramiko master without this fix), I get the following traceback, which is the same as described in the original comment:

Done.
^CDisconnecting from muskatas... Traceback (most recent call last):
  File "/home/mg/.venv/bin/fab", line 9, in <module>
    load_entry_point('Fabric==1.8.0', 'console_scripts', 'fab')()
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/main.py", line 760, in main
    disconnect_all()
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/network.py", line 597, in disconnect_all
    connections[key].close()
  File "/home/mg/src/paramiko/paramiko/client.py", line 350, in close
    self._transport.close()
  File "/home/mg/src/paramiko/paramiko/transport.py", line 618, in close
    self.stop_thread()
  File "/home/mg/src/paramiko/paramiko/transport.py", line 1407, in stop_thread
    self.join(10)
  File "/usr/lib/python2.7/threading.py", line 958, in join
    self.__block.wait(delay)
  File "/usr/lib/python2.7/threading.py", line 358, in wait
    _sleep(delay)
KeyboardInterrupt
@bitprophet
Member

Thanks for suggesting this!

I feel like I've either seen this change before, or we already merged something that moved the close call in the opposite direction. Will poke at history to see if I'm crazy or not, and also try to recreate on my end so I can verify that this fixes it.

@bitprophet
Member

Yea, I was right, this was modified in #156 (which came out in 1.11.2, which is when users are reporting the bug started to appear as per Fabric #1020 and Fabric #1042. Digging into the implications still, as this PR would not completely reverse that patch but might still cause issues (and they may be speaking to slightly different use cases as well.)

@mgedmin
Contributor
mgedmin commented Feb 12, 2014

I checked out the ref for this pull request, and pip install -e'd it into a virtualenv (to see if it fixes Fabric #1020, which it does). Now I tried the test.py from #156 and, unfortunately, you're right: the test script hangs. If I switch back to master, the test script exits normally.

@michaelballantyne

I'd done something wrong in my earlier testing; my previous comment is incorrect - this does indeed fix Fabric #1020 for me.

@bitprophet
Member

Thanks @mgedmin - that's exactly what I was expecting and my (ongoing) investigation definitely implies that shuffling the socket close is not the right approach here.

So I won't merge this actual PR (sorry @gza!) but I will keep it open as the Paramiko-level discussion of the issue. Will comment again in a sec w/ condensed version of what I have so far. Think I'm close.

@bitprophet
Member
  • The code hanging is within Transport.stop_thread, specifically a while self.isAlive(): self.join(10) loop, which is all threading.Thread material. thread.isAlive() is defined as True until the thread body (run()) runs to completion
  • Transport.run has its own while loop testing self.active, and Transport.stop_thread sets that var to False prior to hitting the while self.isAlive() loop
  • However Transport.run is not hitting the bottom (or top) of the while loop - something is blocking here that isn't normally blocking, and the "moving the socket close" semi-fix implies it's a network read.
  • Sure enough, it IS a blocking call in the while loop, line 1566, self.packetizer.read_message().
    • This bubbles down to a recv call on Packetizer.__socket, which here is a ProxyCommand. When the Ctrl-C happens, it seems like this call unblocks & returns 0, which triggers the EOF, and so on.
    • Why is this different from (& if so, why) a regular network socket? Is it because the subprocess is not shutting down? Why does a normal, direct connection socket behave differently here?
  • OK: w/o using ProxyCommand, this code still EOFs, but earlier, upon a socket.timeout exception - probably triggered by other parts of the code tearing things down re: the real socket.
    • IOW, ProxyCommand is a partial socket implementation and clearly isn't behaving correctly in whatever situation that is. So I need to dig more & figure out how to make that class respond in the same way that the real socket does, if possible.
@bitprophet
Member

More digging, still no solution:

  • The key here is that with a real/regular socket/connection, completion of the remote command (I presume) closes the connection; this triggers the socket.timeout, cascading to the EOFError, which breaks Transport.run out of its loop, and then things close up from there.
  • With ProxyCommand, the "socket connection" is a subprocess to ssh -W which is not invoking any commands itself - it stays open effectively forever and proxies traffic. Thus the socket.recv() call is actually an os.read() call on that subprocess' pipe. This call sits around forever.
  • An explicit ProxyCommand.close() (e.g. what this PR would make happen before the main thread falls into the "wait for the Transport thread" loop) kills the subprocess, causing the os.read() to unblock & return (an empty string.)
  • However, as in earlier comments, this is unacceptable as it causes deadlock in other situations (such as using the native gateway feature.)
@bitprophet
Member

The crux then seems to be that in non ProxyCommand situations, we need to rely on that socket timeout to tell when the remote end is truly done sending us data. Something else needs to happen in the ProxyCommand scenario to the same effect:

  • os.read() has no timeout functionality so that won't help;
  • can't rely on normal subprocess "are you done?" tricks because the subprocess itself doesn't actually exit until we stop it.
  • May need to do something really horrible like use threading.Timer close to the os.read() to fake a timeout?
  • Or see whether things move fast enough in practice that we can get away with calling ProxyCommand.close() in Transport.stop_thread, if and only if the socket is actually a ProxyCommand.
    • Feels like this is likely to engender more fun-to-debug race conditions when there is data coming down the gateway pipe that hasn't been recv()'d yet?
@bitprophet
Member

threading.Timer won't work because it's its own subthread so things like raise socket.timeout only raise in that thread - not the main thread or the transport thread. Derp.

Could try faking a timeout by using select.select around the os.read() & tracking time?

@bitprophet
Member

The 'select + manually timeout' approach works but is prone to banner errors unless the timeout is inflated (implying math's off or something re: manual timing out?) which then probably makes things a bit slow.

But I'm way past my time/stress threshold for such a specific feature, especially one which has a workaround (the gateway/use-nested-channels approach) so I'm tempted to push out a branch w/ this code in it & merge if I get sufficient +1s.

@bitprophet
Member

https://github.com/paramiko/paramiko/tree/fix-proxycommand-infinite-loop-252 - this is that branch. Works for me in the trivial base case.

@mgedmin
Contributor
mgedmin commented Feb 13, 2014

Fails here (Fabric 1.8.0, paramiko commit 58489c8) :

mg@platonas: ~/src/paramiko [git:fix-proxycommand-infinite-loop-252=] $ pov-fab muskatas id
[muskatas] Executing task 'id'
[muskatas] run: id
No handlers could be found for logger "paramiko.transport"
Traceback (most recent call last):
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/main.py", line 743, in main
    *args, **kwargs
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 368, in execute
    multiprocessing
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 264, in _execute
    return task.run(*args, **kwargs)
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 171, in run
    return self.wrapped(*args, **kwargs)
  File "/home/mg/src/pov-admin/fabfile.py", line 152, in id
    run("id")
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/network.py", line 578, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/operations.py", line 1042, in run
    shell_escape=shell_escape)
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/operations.py", line 909, in _run_command
    channel=default_channel(), command=wrapped_command, pty=pty,
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/state.py", line 388, in default_channel
    chan = _open_session()
  File "/home/mg/.venv/local/lib/python2.7/site-packages/fabric/state.py", line 380, in _open_session
    return connections[env.host_string].get_transport().open_session()
  File "/home/mg/src/paramiko/paramiko/transport.py", line 660, in open_session
    return self.open_channel('session')
  File "/home/mg/src/paramiko/paramiko/transport.py", line 762, in open_channel
    raise e
NameError: global name 'BadProxyCommand' is not defined
Disconnecting from muskatas... done.
@mgedmin
Contributor
mgedmin commented Feb 13, 2014

Filed #268 for that NameError (it also affects master, not just this branch).

I'm guessing that code shouldn't be trying to convert a socket.timeout into a BadProxyCommand, so with this diff I get a bit further:

diff --git a/paramiko/proxy.py b/paramiko/proxy.py
index abdd157..6262b39 100644
--- a/paramiko/proxy.py
+++ b/paramiko/proxy.py
@@ -101,6 +101,8 @@ class ProxyCommand(object):
                     read.append(b)
             result = ''.join(read)
             return result
+        except socket.timeout:
+            raise
         except IOError, e:
             raise BadProxyCommand(' '.join(self.cmd), e.strerror)

Unfortunately "further" is this:

mg@platonas: ~/src/paramiko [git:fix-proxycommand-infinite-loop-252 *=] $ pov-fab muskatas id
[muskatas] Executing task 'id'
[muskatas] run: id
No handlers could be found for logger "paramiko.transport"
[muskatas] Login password for 'mg': 

So key authentication stops working.

For the record, the ProxyCommand I use for muskatas is

ProxyCommand ssh -q -a -x fridge nc %h %p
@bitprophet
Member

I'm using key auth in my base case and it works fine there, so I wonder what else in your situation could be killing it. I also tried your invocation (though AFAIK ssh -W %h:%p is identical in effect & a bit shorter :) which is what I developed this against) and same result - a simple ls / works fine.

Feels to me like your slightly uncommon use case (gateway == target) may be complicating things? Are you able to try a more vanilla use case (gateway != target) temporarily so we can compare behaviors on your end?

EDIT: also, enabling debug-level stdlib logging in your fabfile would help us see what Paramiko is trying to log.

@bitprophet bitprophet added a commit that referenced this pull request Feb 13, 2014
@bitprophet bitprophet Preliminary changelog entry re #252 c3eb903
@bitprophet
Member

@mgedmin oh, I see - socket.error is a subclass of IOError (something I wasn't cognizant of). I'll add that bit in and see if it helps - I am still getting banner errors occasionally and it feels really grody to keep pumping up the timeout. EDIT: didn't help, but I think the banner specific stuff has its own timeout which may not be getting transmitted correctly. Digging.

@bitprophet
Member

This banner error is pretty strange but I'm gonna operate as if it's not a coincidence.

When things work, the data gotten back from the initial read is:

SSH-2.0-OpenSSH_5.5p1 Debian6+squeeze2\r\n\x00\x00\x03\x0c\n\x14&\xaf1\x86\x8a\x1bB\x15\xe5\x0f\xa8\xc2\xb0\xa2\x8eg\x00\x00\x00~diffie-hellman-group-exchange-sha256,diffie-hellman-group-exc

When it's broken:

\x00\x00\x03\x0c\n\x14\xfc\x80\xa8t\xaa\xc201\x9b\x90T\x97-\xa5\x9c\xfc\x00\x00\x00~diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,di

When broken, it keeps sucking down more, I think trying to find the banner, but it looks like the front (the part being tested for - this code just wants to see SSH-<stuff>) is being dropped or something, leaving us with the raw bytes followed by cypherlists. (So when broken, we see more of the cyphers and such, but the SSH- part never appears.)

This data always comes from Packetizer.readline -> Packetizer._read_timeout which performs its own timeout (much like ours, tho they use time.time but whatever) around calls to __socket.recv(128) - specifically it just loops on recv, skipping socket.timeout until recv actually returns something or the timeout passes.

So suspicion is that our inner timeout is potentially reading that header, but then timing out & raising, then gets called again by the outer loop - thus "eating" the data. Unsure why the Packetizer code is acting like a socket.timeout doesn't change state - perhaps our shitty fake socket is still (surprise) not matching OS sockets in terms of behavior (e.g. a timeout means both ends of the connection understand no data was truly received, and to 'start over'? which could be expected in TCP for example.)

@bitprophet
Member

Yup, keeping a buffer around persistently (again very similar to Packetizer's own timeout stuff, though they're different enough that I doubt they'd be factorable) seems to work, e.g. I can see it containing the "missing" banner prefix in situations where it would've otherwise taken a few seconds + then errored on me.

This also likely means other situations that would goof up, will work now. Who knows if that's the extent of this class' non-real-socket-ness though.

@mgedmin, I am pushing this update momentarily, please give it a shot.

EDIT: Also realized this explains the oddness where I had to bump up the inner timeout to avoid errors. So that can be removed now. Sweet.

@bitprophet
Member

Tested latest code out with a less trivial fabfile (does some runs, some with cd + sudos, then some more sudos, then a put, then more runs.) Works fine. Added in some interactivity (run("python")) - also works fine.

Hopefully means this is stable, potentially excepting more specific-to-proxying-itself things like @mgedmin's situation.

@mgedmin
Contributor
mgedmin commented Feb 14, 2014

I've pulled the latest changes in the fix-proxycommand-infinite-loop-252 branch and things work fine now. Perhaps that timeout-drops-data bug you fixed was randomly breaking pubkey auth for me.

๐ŸŽ‰ ๐Ÿฐ ๐ŸŽ†

BTW note that no proxying itself was involved in #252 (comment). The can't-proxy-itself case was with the native gateway mode, and it came up only because there's no way to specify different gateways for different hosts in Fabric, AFAIK. (Being able to specify different gateways would be more useful than being able to gateway to itself, IMHO.)

(I've been using netcat in my ProxyCommand since before ssh -W existed. Thanks for bringing it to my attention!)

@bitprophet
Member

Noted re: gateways, and yea, I have plans for Fab 2 to be far more object oriented which will enable per-host settings for everything - gateway, etc.

Glad this fixes it for you (also see comments in the Fab ticket). Will release today hopefully. Thanks again for the assist!

@bitprophet bitprophet closed this Feb 14, 2014
@bitprophet
Member

Merged my branch for this into 1.11 -> 1.12 -> master.

@gza gza deleted the unknown repository branch Feb 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment