Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix extended networking test that could potentially wait forever #14167

Merged
merged 1 commit into from
May 12, 2017

Conversation

danwinship
Copy link
Contributor

k8s's Framework.WaitForAnEndpoint seems broken in that it's possibly the only "wait" method in test/e2e/framework/ that doesn't have a built-in or caller-provided timeout. I'll submit a patch upstream to do something with it, but for now, as it happens there's already another nearly identical function that does have a timeout (of 1 minute) so we can use that instead.

@danwinship
Copy link
Contributor Author

[test]

@bparees
Copy link
Contributor

bparees commented May 12, 2017

lgtm assuming it passes.

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to c20142e

@danwinship
Copy link
Contributor Author

FYI kubernetes/kubernetes#45733

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/381/) (Base Commit: f24a57f) (Extended Tests: core(networking))

@danwinship
Copy link
Contributor Author

"extended:core(networking)" doesn't work; it tries to run the networking tests (which assume openshift-sdn) under the standard extended test environment (which uses kubenet). You have to say "extended:networking". But just the extended-networking-minimal that gets run as part of "test" should cover this code anyway.

@bparees
Copy link
Contributor

bparees commented May 12, 2017

ok. deleted my test comment.

@danwinship
Copy link
Contributor Author

Running the test locally it does fix the hanging, although all of the tests just fail now with

Expected error:
    <*errors.StatusError | 0xc420c54b00>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "endpoints \"service-g1mgx\" not found",
            Reason: "NotFound",
            Details: {
                Name: "service-g1mgx",
                Group: "",
                Kind: "endpoints",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 404,
        },
    }
    endpoints "service-g1mgx" not found
not to have occurred

I guess that's consistent with the previous failure, though I don't know why it would be failing since Endpoints seem to work fine outside of the extended tests.

@danwinship
Copy link
Contributor Author

ugh, actually, WaitForEndpoint might be broken

@danwinship danwinship force-pushed the network-test-waiting branch 2 times, most recently from a56474f to 5cbf61e Compare May 12, 2017 15:51
(Temporarily borrowing a kubernetes method with a bugfix that isn't
upstream yet.)
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 07125e1

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1406/) (Base Commit: f24a57f)

@danwinship
Copy link
Contributor Author

test_pull_request_origin_extended_networking_minimal passed, though it looks like test_pull_request_origin_extended_conformance_install failed for infrastructure reasons. Maybe merge by hand? (Note that the commit has changed from the original version; I had to pull in the definition of WaitForEndpoint because it wasn't dealing with Endpoints().Get() returning 404 if it polled before the Endpoint had been created.)

@bparees
Copy link
Contributor

bparees commented May 12, 2017

Manually merging something that's got failing tests would be a bad idea, could block the entire merge queue if the failures are not a flakes. We really only resort to that if the merge queue is already broken and the PR is the fix for the issue.

I'll put a merge tag on it, though. I've also opened a flake for the issue you hit: #14176

[merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented May 12, 2017

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 14

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 07125e1

@smarterclayton smarterclayton merged commit 2a17360 into openshift:master May 12, 2017
@smarterclayton
Copy link
Contributor

Force merging

@smarterclayton
Copy link
Contributor

Specifically because this is breaking the merge queue and we have 14 items in it. Confirmed that the test is safe in the run that passed.

@danwinship danwinship deleted the network-test-waiting branch May 12, 2017 17:50
@smarterclayton
Copy link
Contributor

A minute may not be enough time, use 3 minutes (consistent with other waits)

@danwinship
Copy link
Contributor Author

Flaking is expected; we didn't fix the underlying "endpoints not being created" bug that was causing the hang before, we just made it time out properly so the test infrastructure could grab logs afterward so we could debug it

@danwinship
Copy link
Contributor Author

Though the logs don't show anything useful, and I still can't reproduce the problem locally...

@stevekuznetsov
Copy link
Contributor

Are we missing logs that could help you here? Or are they all there but just devoid of the content you need?

@smarterclayton
Copy link
Contributor

smarterclayton commented May 13, 2017 via email

@danwinship
Copy link
Contributor Author

Maybe 3 minutes would be a theoretically better wait time than 1 minute, but that's not going to make the test stop flaking; we didn't fix the actual bug, we just made the test time out rather than hanging forever when it happened. (In the 7 service tests in that log that passed, the endpoint always existed by the second time it polled, after 5 seconds. Note that the wait-for-endpoint doesn't start until after the pod is already Running; the "16:32:54" on the first message you quoted is not the time the pod was created, it's the time the test infrastructure re-logged the event while gathering logs after the test failed.)

Are we missing logs that could help you here? Or are they all there but just devoid of the content you need?

Can we get the full logs from the master on one of the failed test runs? Particularly any messages from endpoints_controller.go.

Is https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_networking_minimal/1955/ from #13910 the first test run that hit the bug? Presumably whatever broke things was merged not too long before "May 11, 2017 8:45:44 AM" then. (Though I don't see any obvious candidates, unless there's something subtly wrong with one of @deads2k 's controller/roles PRs that sometimes breaks the endpoints controller?)

I can start running the test repeatedly locally tomorrow to see if I can reproduce.

@stevekuznetsov
Copy link
Contributor

Can we get the full logs from the master on one of the failed test runs? Particularly any messages from endpoints_controller.go.

Is this missing due to test infra problem or would we be solving this in the networking test scripts?

@smarterclayton
Copy link
Contributor

smarterclayton commented May 14, 2017 via email

@danwinship
Copy link
Contributor Author

Is this missing due to test infra problem or would we be solving this in the networking test scripts?

The networking test scripts copy the logs off the master/node containers over to the machine running the test; I wasn't sure if that would still be available anywhere for a while or if the test VM just got immediately destroyed after the test run.

I don't think we'd want to dump the full journal to stdout on test failure, though maybe if we just grepped for warnings and errors it would be small enough?

I'm not having any luck reproducing the failure locally...

@stevekuznetsov
Copy link
Contributor

The networking test scripts copy the logs off the master/node containers over to the machine running the test; I wasn't sure if that would still be available anywhere for a while or if the test VM just got immediately destroyed after the test run.

If they're going under $LOG_DIR (which they should!) the files will be uploaded to S3 and be available in the job.

I don't think we'd want to dump the full journal to stdout on test failure, though maybe if we just grepped for warnings and errors it would be small enough?

We are grabbing the PID1 journal but I can add in the full Docker journal as well.

@stevekuznetsov
Copy link
Contributor

We are grabbing the PID1 journal but I can add in the full Docker journal as well.

Scratch that, we have the full Docker journal -- which other one do you need?

@danwinship
Copy link
Contributor Author

Ah, ok, I didn't realize all that stuff was available from jenkins. Found it now.

Flake bug is #14197, further discussion can happen there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants