Fix issue 34 (SFTPFile prefetch assumes response order matches requests) #35

Merged
merged 1 commit into from Feb 11, 2014

Projects

None yet

3 participants

@ewxrjk
Contributor
ewxrjk commented Aug 29, 2011

SFTPFile._async_response gets a new 'num' parameter giving the request
number. This can be matched up with the return value of
SFTPClient._async_request to retrieve data specific to that request.

The prefetch queue SFTPFile._prefetch_reads is replaced with the dict
_prefetch_extents, which maps request numbers to (offset, length)
tuples.

A lock is used to exclude the case where a response arrives in
_async_response before _prefetch_thread has updated it.

[Maintainer note: See #34 for initial bug report.]

@ewxrjk ewxrjk Fix issue 34 (SFTPFile prefetch assumes response order matches requests)
SFTPFile._async_response gets a new 'num' parameter giving the request
number.  This can be matched up with the return value of
SFTPClient._async_request() to retrieve data specific to that request.

The prefetch queue SFTPFile._prefetch_reads is replaced with the dict
_prefetch_extents, which maps request numbers to (offset,length)
tuples.

A lock is used to exclude the case where a response arrives in
_async_response before _prefetch_thread has updated it.
974294a
@lndbrg
Contributor
lndbrg commented Jan 21, 2014

This looks sane. +1

@bitprophet
Member

Thanks! Gonna merge this & make those couple of line-notice tweaks myself soon [EDIT: done in my branch], want to wrap up some changelog related work first because ugh, merging NEWS is super annoying.

@bitprophet
Member

This actually fails the test suite! Ohnoes.

======================================================================                                        
ERROR: test_7_prefetch_readv (test_sftp_big.BigSFTPTest)                                                      
----------------------------------------------------------------------                                        
Traceback (most recent call last):                                                                            
  File "tests/test_sftp_big.py", line 303, in test_7_prefetch_readv                                           
    for data in f.readv(chunks):                                                                              
  File "/Users/jforcier/Code/oss/paramiko/paramiko/sftp_file.py", line 422, in readv                          
    if self._data_in_prefetch_buffers(offset) or self._data_in_prefetch_requests(offset, size):               
  File "/Users/jforcier/Code/oss/paramiko/paramiko/sftp_file.py", line 95, in _data_in_prefetch_requests      
    k = [self._prefetch_extents[i] for i in self._prefetch_extents if self._prefetch_extents[i][0] <= offset] 
RuntimeError: dictionary changed size during iteration                                                                                                                                                                  

Gonna poke.

@bitprophet
Member

Feels like race condition due to threading?

  • Made some trivialish changes, ran test suite for the first time, got above result.
  • Made another trivial change but to the line that was breaking (tightening up the list comprehension without changing its actual meaning/behavior), ran test suite again, passed.
  • Reverted that change; still passed. Oh dear.
  • Reverted to original merge of @ewxrjk's branch; tests still pass.
  • Went back to my latest HEAD (with the trivial changes); still passing.

Now running in a loop to try and detect this, using the below in tasks.py:

@task                                                
def ugh(ctx):                                       
    runs = 0                                         
    import sys                                       
    while True:                                      
        runs += 1                                    
        sys.stderr.write('.')                        
        sys.stderr.flush()                           
        result = ctx.run("python test.py", hide=True)
        if result.failed:                            
            sys.stderr.write(runs)                   
            sys.stderr.flush()                       
            runs = 0                                 

Even with all other test suites besides the failing one, commented out, the test run still takes ~18s (this is why Robey put in a flag for skipping that suite, apparently. hah!) so this may take a while to pin down.


If it really is due to one of the two mutations running in a thread "behind" the list comprehension then we'll have to figure out if we can avoid the core problem while retaining the same fix behavior. (It's probably obvious but I am at EOD and kinda frazzled, also haven't really looked at it hard yet.)

@bitprophet
Member

OK:

  • The code fired off by regular use & the test always ends up triggering self._start_prefetch which kicks off a thread bearing self._prefetch_thread which is where the positive mutation takes effect (adding data to the dict.)
  • That spot and the other (deletion) spot are both protected with the self._prefetch_lock block.
  • However, the problem spot - the iteration over self._prefetch_extents - is not using a thread lock, which is likely the issue.
  • The question is then whether it's feasible to wrap that listcomp in another lock-obtaining block, or if that would somehow defeat the purpose of prefetching or otherwise get us in a sticky situation.
  • Comments in self._prefetch_thread imply that the point of threading is to shove blocking network calls into the background so they don't hang the rest of the execution. Feels like then requiring a handle on that lock in the "foreground"/main thread could indeed expose us to blocking & thus slowness.
  • I don't see an obvious other way to perform this sort of prefetch request/response order tracking that wouldn't have the same problem (foreground thread needing access to "how much prefetched data do I have", background thread needing ability to update that data structure in a potentially blocking manner.)
  • Unless we can forcibly snapshot a copy of the data struct w/o actually iterating over the "live" copy?
  • Additionally - I've been unable to recreate this over ~100 runs of the test suite, so either the change to using .values() is effectively doing what I just said, or the race condition is so infrequent it may need to be something we release & gather info on how many people it affects, before we spend more time poking it.
@bitprophet
Member

Preliminary results running the script again w/ the 'cleaner' listcomp reverted, seems to be turning up the race condition more reliably - ~20 runs, 8 runs, and that was just before I fixed it up some to not actually bail out. Now it should report each failure & keep trucking/recounting.

Implies that there's a difference in for x in dict and for x in dict.values() which seems wrong to me (need to verify their behaviors), but it's too early to tell if I was just unlucky in reproducing the problem earlier.

EDIT: ooh, right. values() != itervalues() - values() is not an iterator; however for x in dict is implicitly applying the iterable protocol to the dict, which does result in an iterator. So that feels like it's almost definitely the issue at hand.

@bitprophet
Member

Script running against the iterator-using, non-cleaned-up version of the listcomp is definitely yielding the race condition; here, a dot occurs every run, then when an error occurs, the number of runs since last error is printed:

...............15....4...................19..2...............................31...
.......10.....5.^C

Going to run overnight against the "cleaned up", values()-using version of the code and see if it ever seems to have the problem.

Will be hilarious if one of my normal style nitpicks (and one that is "wrong" too, since normally one does want itervalues/iteritems/etc) was actually a fix to this problem.

@bitprophet
Member

Ran overnight, hundreds of times, no error. Calling that done & merging.

@bitprophet bitprophet added a commit that referenced this pull request Feb 11, 2014
@bitprophet bitprophet Changelog re #35 0a2887a
@bitprophet bitprophet merged commit 974294a into paramiko:master Feb 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment