SFTP get stalls #577

Closed
stephencpope opened this Issue Aug 24, 2015 · 2 comments

Projects

None yet

2 participants

@stephencpope
Contributor

I believe I have identified and fixed the problem of SFTP gets hanging (probably the cause of at least issues #560 and #515 ). I was experiencing this problem when retrieving large files (15G +) over long haul WAN links (low bandwidth and large latency) most reliably.

I diagnosed this using python-2.7.9 on CentOS 5.8 using paramiko 1.15.2 although the fix should more or less apply for any recent version; until recently I had been using paramiko 1.9.0 with python-2.5.1 without encountering this error.

Here is a simple program to demonstrate the problem:

#!/usr/bin/env python

from __future__ import with_statement

import sys
import os
import time
import threading
import paramiko

lock = threading.Lock()
progress = None

def callback(sofar, total):
    global lock, progress
    with lock:
        progress = (sofar, total)

if len(sys.argv) != 5:
    print >>sys.stderr, 'usage: %s host username remote local' % sys.argv[0]
    sys.exit(1)

host = sys.argv[1]
username = sys.argv[2]
remote = sys.argv[3]
local = sys.argv[4]

transport = paramiko.Transport((host,22))
path = os.path.join(os.environ['HOME'], '.ssh', 'id_dsa')
key = paramiko.DSSKey.from_private_key_file(path)
transport.connect(username=username,pkey=key)
sftp = paramiko.SFTPClient.from_transport(transport)

last = 0

t = threading.Thread(target=sftp.get, args=(remote,local, callback))
t.setDaemon(True)
t.start()

while True:
    time.sleep(5)
    with lock:
        if progress is not None:
            if progress[0] == progress[1]:
                break
            if progress[0] == last:
                print >>sys.stderr, 'stalled %d of %d' % progress
                sys.exit(1)
            print '%d of %d' % progress
            last = progress[0]

With 1.15 HEAD an example run (you'll need to use your own appropriate remote host, user, and file as mine are all internal to a corporate WAN) looks like this:

% ./test-sftp-stall remote-host remote-user remote-file local-file 
20348928 of 15472925029
23199744 of 15472925029
23855104 of 15472925029
stalled 23855104 of 15472925029

I will submit a pull request for the fix shortly, but the synopsis is thus: when the file to get is large (i.e. very many async prefetch requests) and the network slow (slow to receive async results), it is very likely that the window will become full causing the prefetch thread to block on a send waiting for the window to open up. As written, the SFTPFile._prefetch_lock is held by the prefetch thread while attempting to write. But this then blocks the reader thread (typically some client application thread, e.g. main) from reading result packets, such that the window can never open up. A violation of the principal that one should always attempt to read before attempting to write.

The proposed fix avoids holding the SFTPFile._prefetch lock (or any other lock) in SFTPFile._prefetch_thread() while attempting to send the prefetch request; it requires some extra spinning when receiving a response to ensure that the SFTP._prefetch_extents has been updated with the request to avoid a possible race condition. Additionally, SFTPClient._lock must similarly not be held in SFTPClient._async_request() when sending a request (there is no requirement that the send operation itself be part of the protected region). Finally, SFTPClient._read_response() was incorrectly accessing and modifying SFTPClient._expecting without holding SFTPClient._lock; while this in itself was not contributing to the problem, it is clearly incorrect in terms of thread safety.

With the proposed fix, I am able to reliably get very large files over the long, slow WAN without hanging.

@stephencpope
Contributor

my apologies - being a github nube I didn't realize that the pull request would create a whole new issue. I thought I had to create a new issue first. Oh well ...

@bitprophet
Member

Yea, Github still needs some feature love there IMHO, don't worry about it :) thanks!

@bitprophet bitprophet closed this Sep 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment