-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix extended networking test that could potentially wait forever #14167
Fix extended networking test that could potentially wait forever #14167
Conversation
[test] |
lgtm assuming it passes. |
Evaluated for origin testextended up to c20142e |
continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/381/) (Base Commit: f24a57f) (Extended Tests: core(networking)) |
"extended:core(networking)" doesn't work; it tries to run the networking tests (which assume openshift-sdn) under the standard extended test environment (which uses kubenet). You have to say "extended:networking". But just the extended-networking-minimal that gets run as part of "test" should cover this code anyway. |
ok. deleted my test comment. |
Running the test locally it does fix the hanging, although all of the tests just fail now with
I guess that's consistent with the previous failure, though I don't know why it would be failing since Endpoints seem to work fine outside of the extended tests. |
ugh, actually, WaitForEndpoint might be broken |
a56474f
to
5cbf61e
Compare
(Temporarily borrowing a kubernetes method with a bugfix that isn't upstream yet.)
5cbf61e
to
07125e1
Compare
Evaluated for origin test up to 07125e1 |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1406/) (Base Commit: f24a57f) |
test_pull_request_origin_extended_networking_minimal passed, though it looks like test_pull_request_origin_extended_conformance_install failed for infrastructure reasons. Maybe merge by hand? (Note that the commit has changed from the original version; I had to pull in the definition of WaitForEndpoint because it wasn't dealing with Endpoints().Get() returning 404 if it polled before the Endpoint had been created.) |
Manually merging something that's got failing tests would be a bad idea, could block the entire merge queue if the failures are not a flakes. We really only resort to that if the merge queue is already broken and the PR is the fix for the issue. I'll put a merge tag on it, though. I've also opened a flake for the issue you hit: #14176 [merge] |
continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 14 |
Evaluated for origin merge up to 07125e1 |
Force merging |
Specifically because this is breaking the merge queue and we have 14 items in it. Confirmed that the test is safe in the run that passed. |
A minute may not be enough time, use 3 minutes (consistent with other waits) |
Flaking is expected; we didn't fix the underlying "endpoints not being created" bug that was causing the hang before, we just made it time out properly so the test infrastructure could grab logs afterward so we could debug it |
Though the logs don't show anything useful, and I still can't reproduce the problem locally... |
Are we missing logs that could help you here? Or are they all there but just devoid of the content you need? |
Pod start
May 12 16:32:54.275: INFO: At 2017-05-12 16:31:52.881533922 -0400 EDT
- event for service-8nc7r: {kubelet nettest-node-1} Started: Started
container with id
d68a1e0faefce90efad9beb93a5051133d202bada077250a310f37186266a011
Endpoint wait ended at:
May 12 16:32:49.251: INFO: Endpoint
e2e-tests-net-services1-jnk1d/service-8nc7r is not ready yet
As I said, you have to wait at least 3 minutes
|
Maybe 3 minutes would be a theoretically better wait time than 1 minute, but that's not going to make the test stop flaking; we didn't fix the actual bug, we just made the test time out rather than hanging forever when it happened. (In the 7 service tests in that log that passed, the endpoint always existed by the second time it polled, after 5 seconds. Note that the wait-for-endpoint doesn't start until after the pod is already Running; the "16:32:54" on the first message you quoted is not the time the pod was created, it's the time the test infrastructure re-logged the event while gathering logs after the test failed.)
Can we get the full logs from the master on one of the failed test runs? Particularly any messages from endpoints_controller.go. Is https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_networking_minimal/1955/ from #13910 the first test run that hit the bug? Presumably whatever broke things was merged not too long before "May 11, 2017 8:45:44 AM" then. (Though I don't see any obvious candidates, unless there's something subtly wrong with one of @deads2k 's controller/roles PRs that sometimes breaks the endpoints controller?) I can start running the test repeatedly locally tomorrow to see if I can reproduce. |
Is this missing due to test infra problem or would we be solving this in the networking test scripts? |
I'm saying you can *never* wait for endpoints in our test runs less
than 3 minutes because we have existing known places where the kubelet
misses a sync loop and retries 1 or two minutes later.
If you have a test *anywhere* for endpoints with a timeout less than 3
minutes, it will cause flakes independent of whatever bug you have
here.
|
The networking test scripts copy the logs off the master/node containers over to the machine running the test; I wasn't sure if that would still be available anywhere for a while or if the test VM just got immediately destroyed after the test run. I don't think we'd want to dump the full journal to stdout on test failure, though maybe if we just grepped for warnings and errors it would be small enough? I'm not having any luck reproducing the failure locally... |
If they're going under
We are grabbing the PID1 journal but I can add in the full Docker journal as well. |
Scratch that, we have the full Docker journal -- which other one do you need? |
Ah, ok, I didn't realize all that stuff was available from jenkins. Found it now. Flake bug is #14197, further discussion can happen there |
k8s's
Framework.WaitForAnEndpoint
seems broken in that it's possibly the only "wait" method intest/e2e/framework/
that doesn't have a built-in or caller-provided timeout. I'll submit a patch upstream to do something with it, but for now, as it happens there's already another nearly identical function that does have a timeout (of 1 minute) so we can use that instead.