Race condition(s) around handshake timeout functionality #612

Closed
bitprophet opened this Issue Nov 6, 2015 · 40 comments

Projects

None yet

6 participants

@bitprophet
Member
bitprophet commented Nov 6, 2015 edited

Edited to add: this ticket began life under assumption the issues were Travis-specific, but it seems more likely that Travis and/or the test suite are just exacerbating underlying, real problems. Specifically, a race condition shown in test_L_handshake_timeout.

The other issue, centering on test_3_multiple_key_files, seems unrelated & received at least some workarounds/fixes mid-ticket, and should be considered closed for now.

Original description follows.


These were most often seen under Python 3.2, which has been nixed, but they pop up on other interpreters as well (for example https://travis-ci.org/paramiko/paramiko/builds/89238099 hit them 3 times in one build!) and it seems to be getting worse.

The problems appear to be most easily replicated under Python 3 but we've had at least a few confirmed reports of it occurring on Python 2 as well (though as per below comments I've been unable to reproduce it locally - only on Travis).

The specific examples that appear to occur are:

@bitprophet
Member

Travis' docs mention slapping travis_wait in front of a command to make the 10min timeout more like 20min. Unlikely to be a permanent solution but maybe worth testing to see if the issue is "slow" vs "literally hung".


On a Debian VM, tests take a while to run (inv test 267.63s user 687.69s system 95% cpu 16:37.56 total !!!) (EDIT: except I'm a horrible idiot who left some debug related things enabled when I ran this early test, see later comments for more realistic figures) but still wrote info often, so I don't think they went >=10min w/o any output. Not like I watched it for 16 minutes straight though.

This might be the time to port the kinda-awful test harness to my spec nose plugin, which recently grew per-test timing info that could really help here.


Also noticed that there is a difference between my setup and Travis - I don't habitually run --coverage and coverage could definitely be affecting things. Maybe worth disabling for now to see if it makes a clear difference.

@bitprophet
Member

Another option is for me to test against an Ubuntu 12.04 (Precise) since http://docs.travis-ci.com/user/ci-environment/ notes that's what they currently use for both the standard and containerbased runs (we should be containered now since our .travis.yml says sudo: false). Just in case there's some random dumb difference between "Linux in general" (as exemplified by my Debian 8 VM) and Travis' env.

That's assuming their docs are up to date, of course, but the same doc notes their experimental Ubuntu 14.04 based setup, so it's probably relatively current.

@bitprophet
Member

Ran a few more cycles of tests on both my workstation proper & the Debian VM, because single data points are bad science. Started with Python 2 as a baseline, will do Python 3 after.

Because computers are terrible and I'm slow, the Debian VM tests started out with non-master & packet dumps enabled plus 512 MB ram only; once I upped to 2 GB and stock master things got a lot better & in fact faster than OS X.

OS X (Yosemite) w/ all the awesome power of my dual-core 11" MBA 1.7GHz and 8 GB RAM at its disposal:

  • inv test 25.08s user 5.17s system 61% cpu 49.371 total
  • inv test 25.86s user 5.10s system 63% cpu 49.085 total
  • inv test 26.51s user 5.11s system 61% cpu 51.038 total

Debian 8, virtualized, with access to 2 GB RAM (seems unhappy allocating more than that, which is bizarre but I've no time to dig) & "2 cores" (hyperthreading, so VMware thinks I've 4 total):

  • inv test 12.03s user 8.72s system 51% cpu 40.135 total
  • inv test 10.97s user 7.58s system 50% cpu 36.499 total
  • inv test 10.46s user 8.72s system 51% cpu 36.922 total

So, preliminarily, the issue definitely isn't "Linux is slow", it's actually faster with less RAM than my native system. However, since I had it take VERY long with much less RAM (512 MB) it's possible that something is RAM hungry and if Travis is having RAM resource issues (their page only says "up to 4 GB" for the instance type we should be using) that might be a factor.

@bitprophet
Member

Yea, looks like the issue was the packet dumps (note to self: leverage snapshots even for once-in-a-blue-moon OSS test VMs...), not the RAM. 512 MB and still getting these figures:

  • inv test 10.15s user 9.58s system 51% cpu 38.134 total
  • inv test 12.40s user 8.14s system 50% cpu 40.469 total
  • inv test 10.73s user 8.69s system 49% cpu 39.111 total

Starting here and working back up...Python 3 at 512 MB:

  • inv test 5.87s user 9.35s system 63% cpu 23.904 total
  • inv test 6.31s user 9.30s system 64% cpu 24.122 total
  • inv test 6.49s user 9.17s system 63% cpu 24.699 total

OK...clearly the issue is not "Linux" or even, probably, "Linux + Python 3" - Python 3's nearly 2x as fast as Python 2!

I am now running the Python 3 version of the test in a loop for 100 iterations just for sanity's sake but doubt I'll see anything.

So next is to try with --coverage, then variations of this on Ubuntu 12.04.

@bitprophet
Member

Ah, got ahead of myself. On the 12th iteration of the loop, the test run got stuck for hours. Will need to try this with Python 2 probably, but...yea.

EDIT: did it again (w/ Python 3) for science reasons, got stuck on 9th iteration this time.

Ran the same loop under Python 2, it got all the way to 100 iterations no problem. Good argument for it being Python 3 specific.

EDIT 2: Python 3 again and stuck at iteration 19. To confirm, this is without using --coverage. I wasn't tracking process output earlier but I am now; it got stuck at test_3_multiple_files, surprise! Time to take a closer look at that test.

@edk0
edk0 commented Nov 6, 2015

I believe this is due to the garbage collector choosing to fire while (a) a SSHClient is unreachable, and (b) a log message is being emitted.

The garbage collection runs "on top" of the thread writing the log message (retaining lock ownership), and promptly deletes the SSHClient, triggering its Transport's close via the ResourceManager (I've deleted some irrelevant frames):

  File "/usr/local/lib/python3.5/threading.py", line 1067, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/home/edk/src/paramiko/paramiko/transport.py", line 1536, in stop_thread
    self.join(10)
  File "/home/edk/src/paramiko/paramiko/transport.py", line 644, in close
    self.stop_thread()
  File "/home/edk/src/paramiko/paramiko/resource.py", line 60, in callback
    resource.close()
  Garbage-collecting
  <built-in method flush of _io.BufferedWriter object at remote 0x7f2daeacb888>
  File "/usr/local/lib/python3.5/logging/__init__.py", line 964, in flush
    self.stream.flush()
  File "/usr/local/lib/python3.5/logging/__init__.py", line 984, in emit
    self.flush()
  File "/usr/local/lib/python3.5/logging/__init__.py", line 855, in handle
    self.emit(record)
…

However the transport in question wants to emit a log message before it closes:

  <built-in method acquire of _thread.RLock object at remote 0x7f2db3a23660>
  File "/usr/local/lib/python3.5/logging/__init__.py", line 804, in acquire
    self.lock.acquire()
  File "/usr/local/lib/python3.5/logging/__init__.py", line 853, in handle
    self.acquire()
  File "/usr/local/lib/python3.5/logging/__init__.py", line 1487, in callHandlers
    hdlr.handle(record)
  File "/usr/local/lib/python3.5/logging/__init__.py", line 1425, in handle
    self.callHandlers(record)
  File "/usr/local/lib/python3.5/logging/__init__.py", line 1415, in _log
    self.handle(record)
  File "/usr/local/lib/python3.5/logging/__init__.py", line 1345, in log
    self._log(level, msg, args, **kwargs)
  File "/home/edk/src/paramiko/paramiko/transport.py", line 1545, in _log
    self.logger.log(level, msg, *args)
  File "/home/edk/src/paramiko/paramiko/transport.py", line 1774, in run
    self._log(DEBUG, 'EOF in transport thread')
  File "/usr/local/lib/python3.5/threading.py", line 923, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.5/threading.py", line 891, in _bootstrap
    self._bootstrap_inner()

This is a deadlock. The latter thread is waiting for an I/O lock in logging, which is owned by the former thread, which in turn is waiting for the other thread to finish.

I'm inclined to think that the only way to really avoid bugs like this is to require SSHClients to be explicitly closed; shutting one down potentially has a lot of effects, and hunting down every possible lock they might acquire seems impossibly difficult.

I think at least one other, apparently rarer, hang might be related—still investigating.

@bitprophet
Member

Yup a deadlock would explain the other issue too since it's simply a differently-located hang.

Wonder why this is only happening in 3, but I'm guessing 3 changed nontrvial amounts of things from 2, including threads and gc.

Thanks a bunch for poking at this while I slept. Open source distributed teams in action!

@bitprophet
Member

Also, re: solution, I feel like we'd still run into problems if an SSHClient got gc'd (which can happen, there was a ticket just the other day in fact) and didn't auto-close? But GC and object references aren't my strong suit so this is mostly paranoia/confusion speaking.

Maybe something dumb like having the code that currently autocloses, instead log (or print to stderr, which presumably would never be prone to locks, heh) an error message to the tune of "Hey, you allowed your SSHClient to be garbage collected, that's a bad idea and the results won't be pretty, please explicitly close".

@bitprophet
Member

Looking at test_3_multiple_keys now to see if I can follow your work...the main diff between it and its sibling tests seems to be that it calls self._test_connection multiple times in a loop, whereas the rest all only call it once.

Presumably the problem because self.tc (within _test_connection) gets reassigned to a new SSHClient each iteration of that loop, leaving up to 3 distinct orphaned clients lying around just waiting to get GC'd and trigger the bug?

Notably, the overall test suite's self.tearDown explicitly calls self.tc.close(), which is why this never pops up on the "only calls _test_connection once" tests.

tl;dr this is a crummy test that should instead have been using something like nose or pytest's "emit multiple tests from a single function" features, instead of a raw loop. (So, more fuel for my desire to rework the test suite into something using nose/spec...)

@edk0
edk0 commented Nov 6, 2015

Presumably the problem because self.tc (within _test_connection) gets reassigned to a new SSHClient each iteration of that loop, leaving up to 3 distinct orphaned clients lying around just waiting to get GC'd and trigger the bug?

Yes, I think so.

@bitprophet
Member

[11:41:21] Would be nice to solve the underlying problem but I'm not even sure there is a foolproof method, and either way, if I can at least cut down on test failures that lets me proceed to more critical things :x
[11:46:40] the closest i can think of to a foolproof method is to remove the autoclose and instead have Transports emit a warning if they are finalized before being closed
[11:51:38] I think that's roughly what I was proposing in one of my ticket comments, yea
[11:51:44] yeah
[11:52:30] I'm on the fence, it's tempting and would probably present a lot of the "buh why my shit hang" tickets, otoh in single-threaded or well-managed-threading situations, auto-close is pretty darn useful and a lot of ppl would have to update their code to add explicit closing
[11:52:35] s/present/prevent/
[11:53:32] Even in 2.0 where such a change would be technically acceptable, I'd still feel like it was introducing a lot of burden on intermediate+ users for the sake of removing some confusion for newer ones
[11:54:00] I wonder if instead we could make the "trying to log" part of the deadlock smarter?
[11:55:06] a timeout that then causes a non-logging print/exception, or (depending on what it's logging, exactly) removing the log line, or maaaybe trying to log without caring about the lock (tho that's pretty meh, I'd rather just pollute stderr instead)
[11:55:30] I should capture most of this into the ticket :3
[11:58:13] i think it's probably slightly easier to deal with manually closing things than having to keep track of which things have living references to them; people (including me) kind of expect gc to happen behind their backs
[11:58:51] hmm, i'm not sure if that's what i mean. like they expect it to just sort of silently work
[11:59:20] well right, which is probably why the "close self on gc" code was added however-long ago
[11:59:25] yeah
[11:59:45] the problem is that IO things and gc never seem to just work when mixed :(
[11:59:52] I also presume that having multiple concurrently active SSHClient instances is an unusual case
[12:00:01] Most code I have seen in the wild has one per process
[12:00:19] (And where that isn't true, they're still typically well-managed in terms of scope and references)
[12:01:14] Would be nice to scan all our tickets to see how many issues could reasonably be chalked up to this
[12:01:25] i wrote something with potentially multiple SSHClients; solved the reference problem by doing something like transport.client = client
[12:01:38] not sure I have the time though, as above, I'd rather at least remove this particular instance in the suite and move on, while making sure this is all recorded
[12:01:53] that seems fair enough
[12:01:54] Yea. I think that other recent ticket had the same thing mentioned (maybe that was you replying though heh)
[12:02:17] BTW, thanks again for poking at this, I probably wouldn't have gotten to the bottom of it as fast myself :)
[12:02:32] worth noting, though, this could happen to any program that emits log messages and has an SSHClient die at some point
[12:04:02] and no problem, it was interesting

@bitprophet
Member

While confirming I could still reproduce the issue, got this different output, though I didn't check the output file prior to Ctrl-C so unclear when exactly it logged (but, probably at kill time):

Ran 153 tests in 20.705s

OK
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/jforcier/tmp/paramiko/paramiko/transport.py", line 79, in _join_lingering_threads
    thr.stop_thread()
  File "/home/jforcier/tmp/paramiko/paramiko/transport.py", line 1536, in stop_thread
    self.join(10)
  File "/usr/lib/python3.4/threading.py", line 1064, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Possible this is related to the occasional Travis instances of tests completing (fail or pass) but then things hanging anyway.

@bitprophet
Member

got to 100 iterations on Python 3 with some light application of explicit setup/teardown. (Simpler than tearing the test apart, lets me defer spec port until later.)

Gonna commit that and see if the other problem cases also keep popping up, I think this was the majority case though.

@bitprophet
Member

Backported to 1.15 branch too, that's as far as the crap test goes (it was my own dumb fault, apparently. No blaming 10 year old code this time!)

Naturally, one of the next Travis builds is still hitting the timeout test fail bug under 3.5: https://travis-ci.org/paramiko/paramiko/jobs/89750927#L347

In fact, they're all hitting it now, wtf. And one of them is seeing it under Python 2.7. Sweet. https://travis-ci.org/paramiko/paramiko/builds/89750929

@bitprophet
Member

What's interesting is my most recent "test everything 100 times" run didn't turn that failure up at all, meaning it may actually be more Travis-specific. Trying another run now just in case.

@bitprophet
Member

Poked code while the tests ran, this one is "set an obscenely short timeout, then expect an EOFError, but there's no EOFError after all". Was added recentlyish, in d1f7285. Given that the timeout is set to something smaller than a billionth of a second, I'm assuming one of:

  • Math doesn't actually work that way & the effective timeout is much larger - still small, but large enough that an occasional very-fast execution is falling within it somehow (seems super unlikely, but...);
  • Some other error is occurring instead but not being raised correctly;
  • Something else funky but not error-bearing is going on with the threading.Timer implementation of the handshake timeout feature (which is mostly implemented in Packetizer and partly in Transport).
@bitprophet
Member

Assuming this current run yields no instances, gonna try on an Ubuntu 12.04 VM since that's an easy next step.

Then --coverage.

Then will instrument the test in question and then just rerun the build for 3.5 (since that's where it happens most of the time) on Travis. remote debugging!

@edk0
edk0 commented Nov 7, 2015

I can get the error fairly readily (ubuntu 15.04, python 3.5.0 debug). Replacing the assertRaises with a try/catch and pdb.set_trace() if nothing is caught—there is no exception thrown, and the transport connects properly.

Generally speaking thread switching intervals are much larger than 0.000000000001 seconds; I think this test is really just checking which threads your OS decides to run before the packetizer checks if it's timed out.

@bitprophet
Member

After some fun sysadmin tasks, got my 12.04 environment working and I got the fail 3 times in a row, so...yea. (On Python 3.5 as installed via a PPA.)

The Python 3 on my Debian VM is 3.4 so wondering if simply "Python 3.5 on Linux" is the hot trigger (esp given @edk0 is also reproducing it...unless the commonality is "lol Ubuntu" which is always possible).

Either way, as @edk0 notes, it's likely one of those threading race conditions that can pop up anywhere and are just more likely to appear in certain environments.


Re: cause/effect/solution, I'm crap at threading and also not in the sharpest mental state right now. It sounds like the deal is: the test assumes the thread performing the timer/timeout-check is going to actually get woken up & executed (& then fail the vanishingly small timeout) prior to the test's thread performing the assertRaises. But in some environments, the test thread just keeps choochin' & the timer thread is never given the opportunity to raise its EOFError in time.

Wondering if we can explicitly tell the test thread to defer its scheduling, thus ensuring whichever thread is performing Packetizer.handshake_timed_out() (the thing that actually raises EOFError - and arguably the right way to test this may be to mock it instead of setting the really small timeout) actually gets a chance to raise its exception?

EDIT: Though my memory may be betraying me, cuz I don't see any methods on Thread in the Python 3 docs that map to this sort of thing. Maybe something gross involving a thread-specific (perhaps again injected via mocking) Lock or Event (that the test thread can wait on & the waiting-for-timeout thread can release/set)? I don't think I like that idea though.

@bitprophet
Member

Also /cc @lndbrg who wrote the current version of this test, in case he has bright ideas and/or opinions. (No pressure tho :))

@edk0
edk0 commented Nov 7, 2015

I think it'd be a lot easier to make the server really slow—hack in a time.sleep(5) before _send_kex_init or so. Any computer that takes 5 seconds to schedule the timeout-ing thread probably has bigger things to worry about than unit tests.

@bitprophet
Member

Yea I guess that'd have the same effect, wouldn't it :) normally I hate the idea of adding sleeps but in this case the timeout ought to make the sleep actually only last $THREAD_SWITCH_PERIOD...

@bitprophet bitprophet added this to the 1.16.1 milestone Dec 6, 2015
@lndbrg
Contributor
lndbrg commented Dec 9, 2015

Ah, damn, I had my fears that it might introduce some deadlocks. :/ but yeah, making the server insanely slow would fix that. @edk0 @bitprophet

@techtonik

🐮 Shaving yaks right now - can't try pyinfra on Windows machine, because it uses paramiko and paramiko requires pycrypto which fails without Microsoft Visual C++ 9.0. So I came to check if paramiko can use cryptography already and it appears that nobody knows, because Travis CI tests are failing. So..

What is the progress on this issue? =)

@bitprophet
Member

@coreywright reports he can replicate this under Python 2 in #572, FWIW.

@bitprophet
Member

@techtonik next time I get back to Paramiko I'll be testing out a fix for sure :) hopefully soon...

@coreywright

specifically, the test_L_handshake_timeout test failure with python2.7 is captured in https://travis-ci.org/paramiko/paramiko/jobs/106268057

@techtonik

It might be that inv bug - if, for example, stdout buffer is discarded and not processed from child test runner (or deadlock occurs). Travis log doesn't show timing for every line of output, so it is hard to say if something was killed by a timeout due to deadlock.

@coreywright

to add to the data points: i encountered this today locally the first time i went to test a re-opened PR. it was the first test run in a new repo (as i couldn't find my previous local repo), but i haven't been able to recreate it since, so maybe that has something to do with it (or maybe not).

@techtonik

Can we remove one intermediate layer - inv - from the chain of runners and observe if it will help?

@warsaw
warsaw commented Apr 11, 2016

I'm seeing test_L_handshake_timeout failures fairly consistently in Ubuntu and Debian chroot builds for 1.16.0 (and earlier). It only seems to fail for Python 3.5, and sometimes it'll work by making handshake_timeout even smaller, but that's not guaranteed.

@techtonik

Hi guys. This should help - http://fitzgeraldnick.com/weblog/64/

@bitprophet bitprophet changed the title from Irritating and increasingly frequent Travis-only failures to Race condition(s) around handshake timeout functionality Apr 22, 2016
@bitprophet
Member
bitprophet commented Apr 22, 2016 edited

Thanks to everyone who poked this further while I was busy elsewhere. Updated description a bit re: current status and will be trying to solve it today (even if it ends up being a test-oriented workaround for the time being).

Going to try:

  • Sanity check to make sure I have a recipe for local reproduction
  • See whether removing Invoke makes a difference
    • If so, file a ticket over there so it's tracked
  • Earlier "make the test server process slower to ensure no race condition in the test" idea
  • Techtonik's mentioned rr debugger if I can get it running swiftly enough
@bitprophet
Member
  • Successfully recreating the error locally (Yosemite) under Python 3.5, intermittently but typically running 10-20 times has it pop up at least once.
  • Temporarily cut out everything except the one test (incl sftp server spinup, loading of other test modules, & logging to file), bringing exec down from 22s to about 2-3s
  • Set up a "run 100 times, measure EOFErrors" (Z)shell construct:
    for x inseq 100; do inv test; done |& grep -c EOFError
  • Interestingly, I ran 3 of those concurrently and got 3, 2, and 1 EOFErrors respectively (again out of 100 cycles), a much lower rate than I was (non-scientifically) getting when running earlier loops of the full suite.
    • Whether this reflects on timing issues related to other stuff being loaded, the logging to file, etc, I'm unsure.
    • Ran it again just once, got 10 errors that time. Meh. Statistics?
  • Ran 3 more concurrent tests, but with python test.py --verbose instead of inv test: 1, 1, 2 instances of the error. So that indicates Invoke isn't a major factor.
@bitprophet
Member
bitprophet commented Apr 23, 2016 edited

Ok, well semi good news, with a kinda-hardcoded 5s sleep as recommended, I ran a while loop for many minutes and got up to 426 iterations with no failed tests. So yay for that! EDIT: I also experimented with dropping it down to 1s; got up to 210 before I, again, manually told it ok enough already. So that's probably also good enough.

Unfortunately, there's a couple problems with this particular workaround:

  • The server Transport thread will still hang out until this time.sleep finishes, even if the main/test thread sees the EOFError nearly immediately. So that means the test always takes just over 5s (or 1s) to run.
    • Granted, we have other crummy sleeps in the test suite, so it's not the end of the world...
    • However, every time we make the suite longer we make iteration more of a pain (including automated iteration to identify intermittent issues like this one!)
    • For now, given that 1s seems to work as well as 5s, and given this particular functionality really wants the actual timing-out to be tested...probably good enough for now.
  • More importantly, there's no clean way to "inject" the sleep just for the test; the test-specific NullServer object isn't actually referenced much inside Transport. (This is an artifact of how old & non-test-driven the codebase is unfortunately.)
    • I effected it temporarily with a if self.server_mode: <sleep> but that's not going to fly in the real world for obvious reasons.

However, I think I see an alternate solution, which is still not amazing but sucks less than e.g. copy/paste/modifying the (large) body of Transport.run to create a test-specific subclass.

Specifically, tweak the server Transport's Packetizer instance so its read_message method does the sleep. This method is called in Transport.run early on & should prevent the server from finishing the handshake (indeed, it's the same call on the client side that likely sees most real-world "taking too long because network" problems; in fact we could just put the sleep in the client instead...hrm).

This lets us reach into the appropriate object & trigger a sleep on the server, without having to muck with Transport.run directly. Trying it now.

@bitprophet
Member

While letting that tweak run in another while loop in the bg, I checked Travis to see how often this is still coming up and yup, every PR-related test build in the last 2+ months (https://travis-ci.org/paramiko/paramiko/pull_requests) is red, and spot checks are all showing this as the culprit (and all only under Python 3.5).

@bitprophet
Member
bitprophet commented Apr 23, 2016 edited

Grump, that's not working unfortunately, so my hope that this initial use of the more-easily-manipulated Packetizer was in the critical path, seems wrong. The sleep is pretty clearly only firing some of the time (going by test runtime reporting) and whether or not it occurs seems orthogonal to whether the test fails or passes - but it is failing about as often as on master. Strongly implies that the initial read_message on the server isn't actually necessary to complete handshaking.

Of note is that a reduction of this tweak, namely slapping an unqualified sleep in the actual Packetizer.read_message, does appear to work fine (214 runs w/o fail). Makes me think the right thing here may be to make the client sleep, not the server (since this change affects both of those threads).

Trying another loop with with this theory applied.

@bitprophet
Member
bitprophet commented Apr 23, 2016 edited

30 minutes / 877 loop iterations w/o test failures. I think that wraps it up. Time to undo all my commented out related junk & get this up on github so people can update their PRs 😖 (that is, unless travis puts the lie to all of tonight's flailing. sob.)

@bitprophet
Member

FTR I merged this into #394 and its tests pass now on travis, a couple times in a row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment