Race condition(s) around handshake timeout functionality #612
Comments
Travis' docs mention slapping On a Debian VM, tests take a while to run ( This might be the time to port the kinda-awful test harness to my Also noticed that there is a difference between my setup and Travis - I don't habitually run |
Another option is for me to test against an Ubuntu 12.04 (Precise) since http://docs.travis-ci.com/user/ci-environment/ notes that's what they currently use for both the standard and containerbased runs (we should be containered now since our That's assuming their docs are up to date, of course, but the same doc notes their experimental Ubuntu 14.04 based setup, so it's probably relatively current. |
Ran a few more cycles of tests on both my workstation proper & the Debian VM, because single data points are bad science. Started with Python 2 as a baseline, will do Python 3 after. Because computers are terrible and I'm slow, the Debian VM tests started out with non-master & packet dumps enabled plus 512 MB ram only; once I upped to 2 GB and stock master things got a lot better & in fact faster than OS X. OS X (Yosemite) w/ all the awesome power of my dual-core 11" MBA 1.7GHz and 8 GB RAM at its disposal:
Debian 8, virtualized, with access to 2 GB RAM (seems unhappy allocating more than that, which is bizarre but I've no time to dig) & "2 cores" (hyperthreading, so VMware thinks I've 4 total):
So, preliminarily, the issue definitely isn't "Linux is slow", it's actually faster with less RAM than my native system. However, since I had it take VERY long with much less RAM (512 MB) it's possible that something is RAM hungry and if Travis is having RAM resource issues (their page only says "up to 4 GB" for the instance type we should be using) that might be a factor. |
Yea, looks like the issue was the packet dumps (note to self: leverage snapshots even for once-in-a-blue-moon OSS test VMs...), not the RAM. 512 MB and still getting these figures:
Starting here and working back up...Python 3 at 512 MB:
OK...clearly the issue is not "Linux" or even, probably, "Linux + Python 3" - Python 3's nearly 2x as fast as Python 2! I am now running the Python 3 version of the test in a loop for 100 iterations just for sanity's sake but doubt I'll see anything. So next is to try with |
Ah, got ahead of myself. On the 12th iteration of the loop, the test run got stuck for hours. Will need to try this with Python 2 probably, but...yea. EDIT: did it again (w/ Python 3) for science reasons, got stuck on 9th iteration this time. Ran the same loop under Python 2, it got all the way to 100 iterations no problem. Good argument for it being Python 3 specific. EDIT 2: Python 3 again and stuck at iteration 19. To confirm, this is without using |
I believe this is due to the garbage collector choosing to fire while (a) a SSHClient is unreachable, and (b) a log message is being emitted. The garbage collection runs "on top" of the thread writing the log message (retaining lock ownership), and promptly deletes the SSHClient, triggering its Transport's
However the transport in question wants to emit a log message before it closes:
This is a deadlock. The latter thread is waiting for an I/O lock in I'm inclined to think that the only way to really avoid bugs like this is to require SSHClients to be explicitly closed; shutting one down potentially has a lot of effects, and hunting down every possible lock they might acquire seems impossibly difficult. I think at least one other, apparently rarer, hang might be related—still investigating. |
Yup a deadlock would explain the other issue too since it's simply a differently-located hang. Wonder why this is only happening in 3, but I'm guessing 3 changed nontrvial amounts of things from 2, including threads and gc. Thanks a bunch for poking at this while I slept. Open source distributed teams in action! |
Also, re: solution, I feel like we'd still run into problems if an SSHClient got gc'd (which can happen, there was a ticket just the other day in fact) and didn't auto-close? But GC and object references aren't my strong suit so this is mostly paranoia/confusion speaking. Maybe something dumb like having the code that currently autocloses, instead log (or print to stderr, which presumably would never be prone to locks, heh) an error message to the tune of "Hey, you allowed your SSHClient to be garbage collected, that's a bad idea and the results won't be pretty, please explicitly close". |
Looking at Presumably the problem because Notably, the overall test suite's tl;dr this is a crummy test that should instead have been using something like nose or pytest's "emit multiple tests from a single function" features, instead of a raw loop. (So, more fuel for my desire to rework the test suite into something using nose/spec...) |
Yes, I think so. |
|
While confirming I could still reproduce the issue, got this different output, though I didn't check the output file prior to Ctrl-C so unclear when exactly it logged (but, probably at kill time):
Possible this is related to the occasional Travis instances of tests completing (fail or pass) but then things hanging anyway. |
Backported to 1.15 branch too, that's as far as the crap test goes (it was my own dumb fault, apparently. No blaming 10 year old code this time!) Naturally, one of the next Travis builds is still hitting the timeout test fail bug under 3.5: https://travis-ci.org/paramiko/paramiko/jobs/89750927#L347 In fact, they're all hitting it now, wtf. And one of them is seeing it under Python 2.7. Sweet. https://travis-ci.org/paramiko/paramiko/builds/89750929 |
What's interesting is my most recent "test everything 100 times" run didn't turn that failure up at all, meaning it may actually be more Travis-specific. Trying another run now just in case. |
Poked code while the tests ran, this one is "set an obscenely short timeout, then expect an EOFError, but there's no EOFError after all". Was added recentlyish, in d1f7285. Given that the timeout is set to something smaller than a billionth of a second, I'm assuming one of:
|
Assuming this current run yields no instances, gonna try on an Ubuntu 12.04 VM since that's an easy next step. Then Then will instrument the test in question and then just rerun the build for 3.5 (since that's where it happens most of the time) on Travis. remote debugging! |
I can get the error fairly readily (ubuntu 15.04, python 3.5.0 debug). Replacing the assertRaises with a try/catch and Generally speaking thread switching intervals are much larger than 0.000000000001 seconds; I think this test is really just checking which threads your OS decides to run before the packetizer checks if it's timed out. |
After some fun sysadmin tasks, got my 12.04 environment working and I got the fail 3 times in a row, so...yea. (On Python 3.5 as installed via a PPA.) The Python 3 on my Debian VM is 3.4 so wondering if simply "Python 3.5 on Linux" is the hot trigger (esp given @edk0 is also reproducing it...unless the commonality is "lol Ubuntu" which is always possible). Either way, as @edk0 notes, it's likely one of those threading race conditions that can pop up anywhere and are just more likely to appear in certain environments. Re: cause/effect/solution, I'm crap at threading and also not in the sharpest mental state right now. It sounds like the deal is: the test assumes the thread performing the timer/timeout-check is going to actually get woken up & executed (& then fail the vanishingly small timeout) prior to the test's thread performing the Wondering if we can explicitly tell the test thread to defer its scheduling, thus ensuring whichever thread is performing EDIT: Though my memory may be betraying me, cuz I don't see any methods on |
Also /cc @lndbrg who wrote the current version of this test, in case he has bright ideas and/or opinions. (No pressure tho :)) |
I think it'd be a lot easier to make the server really slow—hack in a time.sleep(5) before |
Yea I guess that'd have the same effect, wouldn't it :) normally I hate the idea of adding sleeps but in this case the timeout ought to make the sleep actually only last |
Ah, damn, I had my fears that it might introduce some deadlocks. :/ but yeah, making the server insanely slow would fix that. @edk0 @bitprophet |
What is the progress on this issue? =) |
@coreywright reports he can replicate this under Python 2 in #572, FWIW. |
@techtonik next time I get back to Paramiko I'll be testing out a fix for sure :) hopefully soon... |
specifically, the |
This test is the last one in the file - paramiko/tests/test_transport.py Line 799 in e8142be |
It might be that |
to add to the data points: i encountered this today locally the first time i went to test a re-opened PR. it was the first test run in a new repo (as i couldn't find my previous local repo), but i haven't been able to recreate it since, so maybe that has something to do with it (or maybe not). |
Can we remove one intermediate layer - |
I'm seeing |
Hi guys. This should help - http://fitzgeraldnick.com/weblog/64/ |
Thanks to everyone who poked this further while I was busy elsewhere. Updated description a bit re: current status and will be trying to solve it today (even if it ends up being a test-oriented workaround for the time being). Going to try:
|
|
Ok, well semi good news, with a kinda-hardcoded 5s sleep as recommended, I ran a while loop for many minutes and got up to 426 iterations with no failed tests. So yay for that! EDIT: I also experimented with dropping it down to 1s; got up to 210 before I, again, manually told it ok enough already. So that's probably also good enough. Unfortunately, there's a couple problems with this particular workaround:
However, I think I see an alternate solution, which is still not amazing but sucks less than e.g. copy/paste/modifying the (large) body of Specifically, tweak the server This lets us reach into the appropriate object & trigger a sleep on the server, without having to muck with |
While letting that tweak run in another while loop in the bg, I checked Travis to see how often this is still coming up and yup, every PR-related test build in the last 2+ months (https://travis-ci.org/paramiko/paramiko/pull_requests) is red, and spot checks are all showing this as the culprit (and all only under Python 3.5). |
Grump, that's not working unfortunately, so my hope that this initial use of the more-easily-manipulated Packetizer was in the critical path, seems wrong. The sleep is pretty clearly only firing some of the time (going by test runtime reporting) and whether or not it occurs seems orthogonal to whether the test fails or passes - but it is failing about as often as on master. Strongly implies that the initial Of note is that a reduction of this tweak, namely slapping an unqualified sleep in the actual Trying another loop with with this theory applied. |
30 minutes / 877 loop iterations w/o test failures. I think that wraps it up. Time to undo all my commented out related junk & get this up on github so people can update their PRs |
FTR I merged this into #394 and its tests pass now on travis, a couple times in a row. |
Edited to add: this ticket began life under assumption the issues were Travis-specific, but it seems more likely that Travis and/or the test suite are just exacerbating underlying, real problems. Specifically, a race condition shown in
test_L_handshake_timeout
.The other issue, centering on
test_3_multiple_key_files
, seems unrelated & received at least some workarounds/fixes mid-ticket, and should be considered closed for now.Original description follows.
These were most often seen under Python 3.2, which has been nixed, but they pop up on other interpreters as well (for example https://travis-ci.org/paramiko/paramiko/builds/89238099 hit them 3 times in one build!) and it seems to be getting worse.
The problems appear to be most easily replicated under Python 3 but we've had at least a few confirmed reports of it occurring on Python 2 as well (though as per below comments I've been unable to reproduce it locally - only on Travis).
The specific examples that appear to occur are:
test_L_handshake_timeout
fails withAssertionError: EOFError not raised by connect
: https://travis-ci.org/paramiko/paramiko/jobs/89548222#L505No output has been received in the last 10 minutes
hangs/kills, often (always? needs lots of scanning) while runningtest_3_multiple_key_files (test_client.SSHClientTest)
: https://travis-ci.org/paramiko/paramiko/jobs/89548214#L464The text was updated successfully, but these errors were encountered: