race condition between channel.fileno() and closing channel when using select.select() with paramiko channels. #537

Closed
pstrzelczak opened this Issue Jun 8, 2015 · 9 comments

Projects

None yet

3 participants

@pstrzelczak

The following code

channel.exec_command(command)
select.select([channel], [], [])

may lead to deadlock because a pipe created internally in channel.fileno() can be created after channel is closed().
python 2.7.6
paramiko 1.10.1
Linux Mint 17

@pstrzelczak

full working code exposing this problem (assumes passwordless ssh to localhost):

#!/usr/bin/env python

import logging
import os
import paramiko
import paramiko.rsakey
import socket
import sys
import select
import time
import getpass

def communicateChannel(channel):
    stdout, stderr = [], []
    bufferSize = 64 * 1024
    gotSomeData = True
    while gotSomeData:
        logging.info('communicateChannel(), channel closed: %s', channel.closed)
        ready, _, _ = select.select([channel], [], [])
        logging.info('communicateChannel(), ready: %s, recv_ready: %s, recv_err_ready: %s', ready, channel.recv_ready(), channel.recv_stderr_ready())

        if ready:
            gotSomeData = False
            if channel.recv_ready():
                data = channel.recv(bufferSize)
                if data:
                    stdout.append(data)
                    gotSomeData = True
            if channel.recv_stderr_ready():
                data = channel.recv_stderr(bufferSize)
                if data:
                    stderr.append(data)
                    gotSomeData = True
    return ''.join(stdout), ''.join(stderr)

def showBug():
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect(("localhost", 22))
    conn = paramiko.Transport(sock)

    conn.start_client()
    assert conn.is_active()

    user = getpass.getuser()
    conn.auth_publickey(user, paramiko.rsakey.RSAKey.from_private_key_file("/home/%s/.ssh/id_rsa" % user))
    assert conn.is_authenticated()

    channel = conn.open_session()
    command = '/bin/true'
    logging.info('showBug() #1, launching %s' % command)
    channel.exec_command('/bin/true')
    logging.info('showBug() #2, launched %s' % command)
    time.sleep(1)
    logging.info('showBug() #3, channel closed: %s' % channel.closed)
    time.sleep(0.1)
    logging.info('showBug() #4')
    stdout, stderr = communicateChannel(channel)
    logging.info('showBug() #5')
    retcode = channel.recv_exit_status()
    logging.info('showBug() #6')

logging.basicConfig(level=logging.DEBUG)
showBug()
@pstrzelczak

Output of above program (interrupted after hang)

DEBUG:paramiko.transport:starting thread (client mode): 0x88926110L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_6.6.1p1)
DEBUG:paramiko.transport:kex algos:['curve25519-sha256@libssh.org', 'ecdh-sha2-nistp256', 'ecdh-sha2-nistp384', 'ecdh-sha2-nistp521', 'diffie-hellman-group-exchange-sha256', 'diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss', 'ecdsa-sha2-nistp256', 'ssh-ed25519'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'chacha20-poly1305@openssh.com', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-gcm@openssh.com', 'aes256-gcm@openssh.com', 'chacha20-poly1305@openssh.com', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-ripemd160-etm@openssh.com', 'hmac-sha1-96-etm@openssh.com', 'hmac-md5-96-etm@openssh.com', 'hmac-md5', 'hmac-sha1', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5-etm@openssh.com', 'hmac-sha1-etm@openssh.com', 'umac-64-etm@openssh.com', 'umac-128-etm@openssh.com', 'hmac-sha2-256-etm@openssh.com', 'hmac-sha2-512-etm@openssh.com', 'hmac-ripemd160-etm@openssh.com', 'hmac-sha1-96-etm@openssh.com', 'hmac-md5-96-etm@openssh.com', 'hmac-md5', 'hmac-sha1', 'umac-64@openssh.com', 'umac-128@openssh.com', 'hmac-sha2-256', 'hmac-sha2-512', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
INFO:root:showBug() #1, launching /bin/true
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
INFO:root:showBug() #2, launched /bin/true
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
INFO:root:showBug() #3, channel closed: True
INFO:root:showBug() #4
INFO:root:communicateChannel(), channel closed: True
^CTraceback (most recent call last):
  File "./select_support_bug.py", line 63, in <module>
    showBug()
  File "./select_support_bug.py", line 57, in showBug
    stdout, stderr = communicateChannel(channel)
  File "./select_support_bug.py", line 19, in communicateChannel
    ready, _, _ = select.select([channel], [], [])
KeyboardInterrupt
@bitprophet
Member

Can you please try recreating this on Paramiko 1.15.x? A lot has changed since 1.10.x.

@pstrzelczak

The bug is easily reproducible with above program on paramiko 1.5.12 (under docker in order not to mess my environment, ubuntu 14.04).

root@128f25c5a980:/# python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paramiko
>>> print paramiko__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'paramiko__version__' is not defined
>>> print paramiko.__version__
1.15.2
>>>
root@128f25c5a980:/# python /repo/select_support_bug.py 
DEBUG:paramiko.transport:starting thread (client mode): 0x73b33050L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_6.6.1p1)
DEBUG:paramiko.transport:kex algos:[u'curve25519-sha256@libssh.org', u'ecdh-sha2-nistp256', u'ecdh-sha2-nistp384', u'ecdh-sha2-nistp521', u'diffie-hellman-group-exchange-sha256', u'diffie-hellman-group-exchange-sha1', u'diffie-hellman-group14-sha1', u'diffie-hellman-group1-sha1'] server key:[u'ssh-rsa', u'ssh-dss', u'ecdsa-sha2-nistp256', u'ssh-ed25519'] client encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-gcm@openssh.com', u'aes256-gcm@openssh.com', u'chacha20-poly1305@openssh.com', u'aes128-cbc', u'3des-cbc', u'blowfish-cbc', u'cast128-cbc', u'aes192-cbc', u'aes256-cbc', u'arcfour', u'rijndael-cbc@lysator.liu.se'] server encrypt:[u'aes128-ctr', u'aes192-ctr', u'aes256-ctr', u'arcfour256', u'arcfour128', u'aes128-gcm@openssh.com', u'aes256-gcm@openssh.com', u'chacha20-poly1305@openssh.com', u'aes128-cbc', u'3des-cbc', u'blowfish-cbc', u'cast128-cbc', u'aes192-cbc', u'aes256-cbc', u'arcfour', u'rijndael-cbc@lysator.liu.se'] client mac:[u'hmac-md5-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-ripemd160-etm@openssh.com', u'hmac-sha1-96-etm@openssh.com', u'hmac-md5-96-etm@openssh.com', u'hmac-md5', u'hmac-sha1', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-ripemd160', u'hmac-ripemd160@openssh.com', u'hmac-sha1-96', u'hmac-md5-96'] server mac:[u'hmac-md5-etm@openssh.com', u'hmac-sha1-etm@openssh.com', u'umac-64-etm@openssh.com', u'umac-128-etm@openssh.com', u'hmac-sha2-256-etm@openssh.com', u'hmac-sha2-512-etm@openssh.com', u'hmac-ripemd160-etm@openssh.com', u'hmac-sha1-96-etm@openssh.com', u'hmac-md5-96-etm@openssh.com', u'hmac-md5', u'hmac-sha1', u'umac-64@openssh.com', u'umac-128@openssh.com', u'hmac-sha2-256', u'hmac-sha2-512', u'hmac-ripemd160', u'hmac-ripemd160@openssh.com', u'hmac-sha1-96', u'hmac-md5-96'] client compress:[u'none', u'zlib@openssh.com'] server compress:[u'none', u'zlib@openssh.com'] client lang:[u''] server lang:[u''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group14-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 0] Max packet in: 32768 bytes
DEBUG:paramiko.transport:[chan 0] Max packet out: 32768 bytes
DEBUG:paramiko.transport:Secsh channel 0 opened.
INFO:root:showBug() #1, launching /bin/true
DEBUG:paramiko.transport:[chan 0] Sesch channel 0 request ok
INFO:root:showBug() #2, launched /bin/true
DEBUG:paramiko.transport:[chan 0] EOF received (0)
DEBUG:paramiko.transport:[chan 0] EOF sent (0)
INFO:root:showBug() #3, channel closed: True
INFO:root:showBug() #4
INFO:root:communicateChannel(), channel closed: True
^CTraceback (most recent call last):
  File "/repo/select_support_bug.py", line 63, in <module>
    showBug()
  File "/repo/select_support_bug.py", line 57, in showBug
    stdout, stderr = communicateChannel(channel)
  File "/repo/select_support_bug.py", line 19, in communicateChannel
    ready, _, _ = select.select([channel], [], [])
KeyboardInterrupt
DEBUG:paramiko.transport:EOF in transport thread
@rusek rusek added a commit to rusek/paramiko that referenced this issue Feb 11, 2016
@rusek rusek Issue #537 reproduction test and fix 306b24d
@rusek
Contributor
rusek commented Feb 12, 2016

I prepared a simple reproduction test for this issue, along with a proposed fix.

@bitprophet
Member
bitprophet commented Jun 9, 2016 edited

With the code in this ticket, I cannot recreate the hang on OS X 10.10; but I can recreate it on Debian 8.

@rusek's commit's test, executed without his fix in place, fails on both platforms. And his fix makes the test pass on both platforms, as well as causing the OP's example to pass on Debian. So that's good.

Re: why the fix works, threading is not my forte but I think staring at things long enough made sense...would appreciate a sanity check of the below before I merge:

  • How the hang happens, at a basic level:
    • BufferedPipe's self._event.set/clear bubbles all the way down to an OS pipe, via paramiko.pipe.OrPipe and then the platform-appropriate pipe wrapper class.
    • That pipe wrapper class toggles the real pipe's ready/not-ready status, which is what select.select is watching.
    • When the pipe is "set", select.select returns; when the pipe is "clear", it blocks/waits/etc, depending on the local implementation.
    • Thus, if BufferedPipe enters a situation where its event has .clear() called, and .set() is never called after, select will wait forever.
  • The lifecycle of BufferedPipe._event:
    • History starts, Channel created, is handed to select, select calls Channel.fileno for the first time
    • fileno notices _pipe is not set, creates the pipe-wrapping setup noted previously, calls BufferedPipe.set_event
    • set_event assigns _event, and may either set or clear the event depending on whether data is already being buffered.
    • Things happen, buffers get written to/read from, _event gets set and cleared by the rest of BufferedPipe. Everything's great.
    • Channel.close happens; this deletes the memoized Channel-level pipe wrapper, and eventually calls BufferedPipe.close, which does _event.set and also _closed = True.
      • Additionally, if the data in the pipes is fully read and no more is written, the pipe will enter the "always set" state for the rest of the session, regardless of closure.
    • Normally (or at least, as was apparently intended), from this point on, the real OS pipe that select is polling, is "set" and thus does not block. Everything cleans up and we're good.
      • Even in the current state of things, we're also happy if the rest of the system cleans up & terminates before select gets around to calling Channel.fileno again.
  • The error case & why it causes the "pipe never gets set()" situation:
    • Channel.fileno is called again by select before things fully shut down
    • It creates a new OS-level pipe, pipe wrapper, OrPipes, etc (because Channel._pipe is unset again) and hands them to the BufferedPipes
    • Unlike the above lifecycle, these pipes will never be written to (because everything that would write to or read from them has already been shut down).
    • Thus, because BufferedPipe.set_event ends up with the pipe in a clear() state (due to the buffer not being > 0, and the check for "...are we closed?" not existing despite the docstring), we enter the deadlock - set will never be called and select will wait forever.
  • The fix then solves this by adding the if self._closed check, causing this "terminal" set_event to result in a set state for the pipe instead of clear.
    • It's not 100% clear why the locking was also added; but I haven't dissected the rest of the class' lifecycle yet so I might be missing something.
@bitprophet bitprophet added this to the 1.16.2 / 1.17.1 / 2.0.1 milestone Jun 9, 2016
@rusek
Contributor
rusek commented Jun 9, 2016

The error case is actually quite simple, and does not involve recreating OS pipes. It is enough that client receives either MSG_CHANNEL_EOF or MSG_CHANNEL_CLOSE from remote site before calling select for the first time (where pipe is created).

The scenario from the ticket works roughly as follows:

  1. client main thread: Transport.open_session()
  2. client main thread: Channel.exec_command('/bin/true')
  3. client main thread: time.sleep(1) starts
  4. server: accept new channel, execute /bin/true command, send back MSG_CHANNEL_EOF (optionally followed by MSG_CHANNEL_CLOSE) once /bin/true ends
  5. client transport thread: receive MSG_CHANNEL_EOF, pass it to Channel._handle_eof(), where Channel.in_buffer and Channel.in_stderr_buffer are closed by invoking BufferedPipe.close(); Channel._pipe is None at this time
  6. client main thread: time.sleep(1) ends
  7. client main thread: call select() with a Channel object, which in turn invokes Channel.fileno(); because Channel._pipe is None, a new pipe is created, wrapped in OrPipes, and passed to BufferedPipe.set_event() for both Channel.in_buffer and Channel.in_stderr_buffer
  8. client main thread: select() hangs, because .set_forever() is never called on Channel._pipe (normally, .set_forever() would be called by Channel._handle_eof, but this method has been executed when pipe was not created yet)

After giving it some thought, I now see that Channel.fileno() should also call self._pipe.set_forever() right after creating the pipe if either Channel.eof_received or Channel.closed is set to True. But the use of .set_forever() seems actually redundant: when EOF is received / channel is closed, stdout/stderr BufferedPipes are closed, and this alone guarantees that event pipe will always be set.

As for the locking added to BufferedPipe.set_event(), I think it may not be strictly necessary, but it's easier to understand how synchronization works in BufferedPipe if BufferedPipe._buffer and BufferedPipe._close fields are only accessed with BufferedPipe._lock held. BufferedPipe.set_event() was the only exception to this rule.

@bitprophet
Member

Thanks! That makes sense; when I reread my notes, it does jive - the creation of the new OS level pipe is incidental, it's having any pipe - original or new - enter the "always clear" state that triggers the issue. (And yea, it also makes sense that we can enter this state after any closure of the BufferedPipe objects, since once closed and then set_event, nothing else will ever trigger set().)

You're probably right about set_forever, both re: it being arguably most-correct but also redundant. I'll probably just slap a comment in there (was planning to anyways, to save me or others a bit of time in the future).

👍 on the locking, I suspected it was more of a cleanup-while-there change, but I'm fine with that & agree that consistency is a virtue here.

@bitprophet
Member

FTR, I cherry-picked @rusek's change back to the 1.16 branch and above, and will close this once I push.

@bitprophet bitprophet added a commit that referenced this issue Jun 10, 2016
@bitprophet bitprophet Changelog re #537 e0a4588
@bitprophet bitprophet closed this Jun 10, 2016
@bitprophet bitprophet added the Hangs label Jun 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment