Fix extended networking test that could potentially wait forever #14167

danwinship · 2017-05-12T14:07:16Z

k8s's Framework.WaitForAnEndpoint seems broken in that it's possibly the only "wait" method in test/e2e/framework/ that doesn't have a built-in or caller-provided timeout. I'll submit a patch upstream to do something with it, but for now, as it happens there's already another nearly identical function that does have a timeout (of 1 minute) so we can use that instead.

danwinship · 2017-05-12T14:07:25Z

[test]

bparees · 2017-05-12T14:15:42Z

lgtm assuming it passes.

openshift-bot · 2017-05-12T14:22:06Z

Evaluated for origin testextended up to c20142e

danwinship · 2017-05-12T14:36:06Z

FYI kubernetes/kubernetes#45733

openshift-bot · 2017-05-12T14:54:09Z

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/381/) (Base Commit: f24a57f) (Extended Tests: core(networking))

danwinship · 2017-05-12T15:06:03Z

"extended:core(networking)" doesn't work; it tries to run the networking tests (which assume openshift-sdn) under the standard extended test environment (which uses kubenet). You have to say "extended:networking". But just the extended-networking-minimal that gets run as part of "test" should cover this code anyway.

bparees · 2017-05-12T15:08:35Z

ok. deleted my test comment.

danwinship · 2017-05-12T15:21:54Z

Running the test locally it does fix the hanging, although all of the tests just fail now with

Expected error:
    <*errors.StatusError | 0xc420c54b00>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "endpoints \"service-g1mgx\" not found",
            Reason: "NotFound",
            Details: {
                Name: "service-g1mgx",
                Group: "",
                Kind: "endpoints",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 404,
        },
    }
    endpoints "service-g1mgx" not found
not to have occurred

I guess that's consistent with the previous failure, though I don't know why it would be failing since Endpoints seem to work fine outside of the extended tests.

danwinship · 2017-05-12T15:34:49Z

ugh, actually, WaitForEndpoint might be broken

(Temporarily borrowing a kubernetes method with a bugfix that isn't upstream yet.)

openshift-bot · 2017-05-12T15:54:06Z

Evaluated for origin test up to 07125e1

openshift-bot · 2017-05-12T17:21:39Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1406/) (Base Commit: f24a57f)

danwinship · 2017-05-12T17:22:24Z

test_pull_request_origin_extended_networking_minimal passed, though it looks like test_pull_request_origin_extended_conformance_install failed for infrastructure reasons. Maybe merge by hand? (Note that the commit has changed from the original version; I had to pull in the definition of WaitForEndpoint because it wasn't dealing with Endpoints().Get() returning 404 if it polled before the Endpoint had been created.)

bparees · 2017-05-12T17:34:37Z

Manually merging something that's got failing tests would be a bad idea, could block the entire merge queue if the failures are not a flakes. We really only resort to that if the merge queue is already broken and the PR is the fix for the issue.

I'll put a merge tag on it, though. I've also opened a flake for the issue you hit: #14176

[merge]

openshift-bot · 2017-05-12T17:37:49Z

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 14

openshift-bot · 2017-05-12T17:37:50Z

Evaluated for origin merge up to 07125e1

smarterclayton · 2017-05-12T17:38:43Z

Force merging

smarterclayton · 2017-05-12T17:42:17Z

Specifically because this is breaking the merge queue and we have 14 items in it. Confirmed that the test is safe in the run that passed.

smarterclayton · 2017-05-12T22:06:41Z

Might be flaking now: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1411/testReport/junit/(root)/Extended/_networking__services_when_using_a_plugin_that_isolates_namespaces_by_default_should_allow_connections_to_services_in_the_default_namespace_from_a_pod_in_another_namespace_on_the_same_node/

smarterclayton · 2017-05-12T22:07:14Z

A minute may not be enough time, use 3 minutes (consistent with other waits)

smarterclayton · 2017-05-12T22:08:41Z

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_networking_minimal/2035/

danwinship · 2017-05-13T13:46:53Z

Flaking is expected; we didn't fix the underlying "endpoints not being created" bug that was causing the hang before, we just made it time out properly so the test infrastructure could grab logs afterward so we could debug it

danwinship · 2017-05-13T15:07:45Z

Though the logs don't show anything useful, and I still can't reproduce the problem locally...

stevekuznetsov · 2017-05-13T16:15:12Z

Are we missing logs that could help you here? Or are they all there but just devoid of the content you need?

smarterclayton · 2017-05-13T16:34:32Z

Pod start May 12 16:32:54.275: INFO: At 2017-05-12 16:31:52.881533922 -0400 EDT - event for service-8nc7r: {kubelet nettest-node-1} Started: Started container with id d68a1e0faefce90efad9beb93a5051133d202bada077250a310f37186266a011 Endpoint wait ended at: May 12 16:32:49.251: INFO: Endpoint e2e-tests-net-services1-jnk1d/service-8nc7r is not ready yet As I said, you have to wait at least 3 minutes

danwinship · 2017-05-14T15:32:09Z

Maybe 3 minutes would be a theoretically better wait time than 1 minute, but that's not going to make the test stop flaking; we didn't fix the actual bug, we just made the test time out rather than hanging forever when it happened. (In the 7 service tests in that log that passed, the endpoint always existed by the second time it polled, after 5 seconds. Note that the wait-for-endpoint doesn't start until after the pod is already Running; the "16:32:54" on the first message you quoted is not the time the pod was created, it's the time the test infrastructure re-logged the event while gathering logs after the test failed.)

Are we missing logs that could help you here? Or are they all there but just devoid of the content you need?

Can we get the full logs from the master on one of the failed test runs? Particularly any messages from endpoints_controller.go.

Is https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_networking_minimal/1955/ from #13910 the first test run that hit the bug? Presumably whatever broke things was merged not too long before "May 11, 2017 8:45:44 AM" then. (Though I don't see any obvious candidates, unless there's something subtly wrong with one of @deads2k 's controller/roles PRs that sometimes breaks the endpoints controller?)

I can start running the test repeatedly locally tomorrow to see if I can reproduce.

stevekuznetsov · 2017-05-14T15:37:09Z

Can we get the full logs from the master on one of the failed test runs? Particularly any messages from endpoints_controller.go.

Is this missing due to test infra problem or would we be solving this in the networking test scripts?

smarterclayton · 2017-05-14T17:57:33Z

I'm saying you can *never* wait for endpoints in our test runs less than 3 minutes because we have existing known places where the kubelet misses a sync loop and retries 1 or two minutes later. If you have a test *anywhere* for endpoints with a timeout less than 3 minutes, it will cause flakes independent of whatever bug you have here.

danwinship · 2017-05-15T14:37:43Z

Is this missing due to test infra problem or would we be solving this in the networking test scripts?

The networking test scripts copy the logs off the master/node containers over to the machine running the test; I wasn't sure if that would still be available anywhere for a while or if the test VM just got immediately destroyed after the test run.

I don't think we'd want to dump the full journal to stdout on test failure, though maybe if we just grepped for warnings and errors it would be small enough?

I'm not having any luck reproducing the failure locally...

stevekuznetsov · 2017-05-15T14:44:46Z

The networking test scripts copy the logs off the master/node containers over to the machine running the test; I wasn't sure if that would still be available anywhere for a while or if the test VM just got immediately destroyed after the test run.

If they're going under $LOG_DIR (which they should!) the files will be uploaded to S3 and be available in the job.

I don't think we'd want to dump the full journal to stdout on test failure, though maybe if we just grepped for warnings and errors it would be small enough?

We are grabbing the PID1 journal but I can add in the full Docker journal as well.

stevekuznetsov · 2017-05-15T14:46:30Z

We are grabbing the PID1 journal but I can add in the full Docker journal as well.

Scratch that, we have the full Docker journal -- which other one do you need?

danwinship · 2017-05-15T15:57:03Z

Ah, ok, I didn't realize all that stuff was available from jenkins. Found it now.

Flake bug is #14197, further discussion can happen there

danwinship added area/tests component/networking labels May 12, 2017

danwinship assigned bparees and stevekuznetsov May 12, 2017

danwinship force-pushed the network-test-waiting branch 2 times, most recently from a56474f to 5cbf61e Compare May 12, 2017 15:51

Fix extended networking test that could potentially wait forever

07125e1

(Temporarily borrowing a kubernetes method with a bugfix that isn't upstream yet.)

danwinship force-pushed the network-test-waiting branch from 5cbf61e to 07125e1 Compare May 12, 2017 15:52

smarterclayton merged commit 2a17360 into openshift:master May 12, 2017

danwinship deleted the network-test-waiting branch May 12, 2017 17:50

stevekuznetsov mentioned this pull request May 14, 2017

Increase networking test endpoint wait to 3 minutes #14185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix extended networking test that could potentially wait forever #14167

Fix extended networking test that could potentially wait forever #14167

danwinship commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

danwinship commented May 12, 2017

danwinship commented May 12, 2017

openshift-bot commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

openshift-bot commented May 12, 2017 •

edited

Loading

openshift-bot commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

danwinship commented May 13, 2017

danwinship commented May 13, 2017

stevekuznetsov commented May 13, 2017

smarterclayton commented May 13, 2017 via email

danwinship commented May 14, 2017

stevekuznetsov commented May 14, 2017

smarterclayton commented May 14, 2017 via email

danwinship commented May 15, 2017

stevekuznetsov commented May 15, 2017

stevekuznetsov commented May 15, 2017

danwinship commented May 15, 2017

Fix extended networking test that could potentially wait forever #14167

Fix extended networking test that could potentially wait forever #14167

Conversation

danwinship commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

danwinship commented May 12, 2017

danwinship commented May 12, 2017

openshift-bot commented May 12, 2017

openshift-bot commented May 12, 2017

danwinship commented May 12, 2017

bparees commented May 12, 2017

openshift-bot commented May 12, 2017 • edited Loading

openshift-bot commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

smarterclayton commented May 12, 2017

danwinship commented May 13, 2017

danwinship commented May 13, 2017

stevekuznetsov commented May 13, 2017

smarterclayton commented May 13, 2017 via email

danwinship commented May 14, 2017

stevekuznetsov commented May 14, 2017

smarterclayton commented May 14, 2017 via email

danwinship commented May 15, 2017

stevekuznetsov commented May 15, 2017

stevekuznetsov commented May 15, 2017

danwinship commented May 15, 2017

openshift-bot commented May 12, 2017 •

edited

Loading