Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace exec, pause pod with platform agnostic agnhost pause pod #79012

Merged
merged 1 commit into from Jun 28, 2019

Conversation

@mgdevstack
Copy link
Member

commented Jun 14, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
E2E shouldn't be verifying failure reasons and messages which are optional and might change in future release of k8s. (Ref: #75324 (comment))

This PR removes "error message" checking codeline and replaces underlying pause and exec pod images with agnhost image. (Image consolidation issue ref: #76342)

These changes would make e2e eligible as a candidate for Conformance.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

/area conformance
/sig testing
@kubernetes/sig-node-pr-reviews
@kubernetes/sig-architecture-pr-reviews
@kubernetes/cncf-conformance-wg

Image: BusyBoxImage,
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"},
Name: "agnhost-pause",
Image: imageutils.GetE2EImage(imageutils.Agnhost),

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 14, 2019

Contributor

This won't work; newExecPodSpec is being used to create pods for kubectl exec-ing into, so it needs to have an image that has stuff in it

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 14, 2019

Contributor

oh, i see agnhost has some binaries in it... but I'm assuming that the reason why a whole bunch of tests failed on this PR is that it doesn't have everything needed...

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 15, 2019

Author Member

I'm troubleshooting failing E2Es. But effort is to use agnhost. agnhost:2.0 or further release holds various cli based app and system binaries. (#78389)

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 17, 2019

Author Member

All tests are passing now with the introduction of agnhost:2.0 from #78389.

This comment has been minimized.

Copy link
@spiffxp

spiffxp Jun 27, 2019

Member

This change looks valid. Can we keep this and drop the change to service.go?

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 28, 2019

Author Member

Done

@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() {
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress)

ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName))
expectedErr := "connection refused"

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 14, 2019

Contributor

This test specifically needs to distinguish different kinds of failure ("connection refused" vs "timed out") and the underlying binaries don't provide any way to do that other than by looking at the error messages.

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 15, 2019

Author Member

Good explanation is documented here in #75324 (comment)

Events and contents of condition reasons and messages are not covered by API compatibility guarantees. They are passed through for human consumption, not machine consumption.

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 15, 2019

Contributor

The error message it's comparing against is from "curl", not from kubernetes, so kubernetes's API compatibility guarantees aren't relevant.

Alternatively, we could build our own test binary into agnhost and have it exit with different error codes in the two cases. Would that be possible?

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 17, 2019

Author Member

error messages itself are "breaking changes" and should be avoided in such situation.

build our own test binary into agnhost and have it exit with different error codes in the two cases

What could be the benefit? ExitCode "zero" and "non-zero" are sufficient enough to verify SUCCESS and FAILURE behaviour or state. Desired state must meet regardless of reasons of success and failure of E2E.

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 17, 2019

Contributor

We aren't testing "is the service reachable", we are testing whether the iptables rules are set up correctly. The way we test that currently is by seeing whether a connection attempt is rejected ("connection refused") or dropped ("timeout"). Your rewrite is not correct. It will always report success when the service is not reachable, regardless of whether the iptables rules are correct or not.

At the moment, the only way to test this correctly is by looking at the error string. If you are going to remove the error string check, then you need to remove the entire test and file an issue indicating that you have done so and that a new test needs to be written.

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 25, 2019

Contributor

There is no HTTP server running. If we get an HTTP status code then something has gone terribly terribly wrong with the test case.

Clarifying this test a bit more: in the past, if you tried to connect to a service that had no endpoints, the connection attempt would hang for a while, and then eventually time out and fail, after you hit some kernel/client timeout. (eg, like what happens if you try "curl http://2.2.2.2/") This was a little bit annoying, so we tweaked the iptables rules so that now if you try to connect to a service with no endpoints, it will fail immediately. (eg, like if you try "curl http://127.0.0.1:1"). And there was much rejoicing. (🎉)

But then, someone changed something in our iptables rules in a way that made these rules not work. And nobody noticed for a while, and we shipped a release where trying to connect to inactive services would hang rather than immediately failing and it was annoying and we were sad. So after it got fixed, we added a test to make sure it stayed fixed. And that's this test.

So, the goal of the test is to make sure that if you try to connect to a service when that service is unavailable, the connection attempt should fail "immediately" rather than "eventually". We can't do that just by looking at the iptables rules, because we want to test that they're working, not just that they're there. In theory maybe we could test it by timing the connection attempt and seeing if it fails "immediately" or "eventually", but that's tricky because sometimes the test machine might be hosed and so it would take it a moment to return the "immediate" error, and other times the kernel might have already cached the fact that the destination is unreachable and so will quickly return the "eventually" error. So it's hard to pick a timeout that we can be certain won't cause flakes. It's more reliable to test if we got "connection refused" (pass) or "no route to host"/"timeout" (fail).

As I said a few comments back, if we could add a tiny test program (like, 10 lines of go) to agnhost then we could use that instead, and have it exit with different status codes depending on the errno value returned by the net.Dial() call, and then there'd be no need for potentially-flaky string comparisons.

But if we're sticking with curl, then I believe we have to stick with string comparisons.

This comment has been minimized.

Copy link
@johnSchnake

johnSchnake Jun 25, 2019

Contributor

So I was thinking/hoping that curl/wget would be setting some sort of httpstatus code based on the failure it encountered (timeout vs refused) but that doesnt seem to be the case.

The exit code of curl is, in both cases, 7 and wget returns 4 in both cases.

Meanwhile, as you proposed the go solution does have a non-string based solution since the net errors have a Timeout() method.

https://play.golang.org/p/Fdhb72pm7CV is a copy of what I did locally. [Note: it doesnt work on the playground, saying not a timeout in both cases, but thats just the playground env).

This seems like the only solution proposed that checks everyone's boxes.

[BTW you've mentioned curl but it looks like its actually using wget, but all the same points apply].

This comment has been minimized.

Copy link
@danwinship

danwinship Jun 26, 2019

Contributor

OK, I've filed #79423 with a new agnhost connect command, and updating this test to use it. Feel free to just absorb those commits into this PR though if you'd prefer.

This comment has been minimized.

Copy link
@spiffxp

spiffxp Jun 27, 2019

Member

I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 28, 2019

Author Member

Dropping this change in favour of #79423 and once this is merged, respective e2e can be picked up for Conformance promotion.

})

if err != wait.ErrWaitTimeout {
framework.ExpectNoError(err)

This comment has been minimized.

Copy link
@neolit123

neolit123 Jun 14, 2019

Member

the error should be annotated.

This comment has been minimized.

Copy link
@mgdevstack

mgdevstack Jun 17, 2019

Author Member

Done.

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jun 15, 2019

/priority important-soon
/milestone v1.16

@mgdevstack

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2019

/work-in-progress
Troubleshooting failing e2e.

@mgdevstack

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2019

/hold

@mgdevstack

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

/test pull-kubernetes-e2e-gce

@mgdevstack mgdevstack force-pushed the mgdevstack:master-service-errmsg branch from a2d34df to 0d4799e Jun 17, 2019

@mgdevstack

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

/hold cancel
Annotated error and tests are passing.

@spiffxp
Copy link
Member

left a comment

Can we re-scope this so it can merge?

@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() {
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress)

ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName))
expectedErr := "connection refused"

This comment has been minimized.

Copy link
@spiffxp

spiffxp Jun 27, 2019

Member

I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge

Image: BusyBoxImage,
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"},
Name: "agnhost-pause",
Image: imageutils.GetE2EImage(imageutils.Agnhost),

This comment has been minimized.

Copy link
@spiffxp

spiffxp Jun 27, 2019

Member

This change looks valid. Can we keep this and drop the change to service.go?

@spiffxp spiffxp moved this from In Review to In Progress in cncf-k8s-conformance-wg Jun 27, 2019

@mgdevstack mgdevstack force-pushed the mgdevstack:master-service-errmsg branch from 0d4799e to 49cc8ef Jun 28, 2019

@k8s-ci-robot k8s-ci-robot added size/XS and removed size/S labels Jun 28, 2019

@mgdevstack mgdevstack changed the title Remove error message check lines from e2e and replace exec, pause pod with platform agnostic agnhost pause pod Replace exec, pause pod with platform agnostic agnhost pause pod Jun 28, 2019

@spiffxp

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 28, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgdevstack, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

commented Jun 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented Jun 28, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 3bd11af into kubernetes:master Jun 28, 2019

23 checks passed

cla/linuxfoundation mgdevstack authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

cncf-k8s-conformance-wg automation moved this from In Progress to Done Jun 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.