Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace exec, pause pod with platform agnostic agnhost pause pod #79012

Merged
merged 1 commit into from
Jun 28, 2019

Conversation

mgdevstack
Copy link
Contributor

@mgdevstack mgdevstack commented Jun 14, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
E2E shouldn't be verifying failure reasons and messages which are optional and might change in future release of k8s. (Ref: #75324 (comment))

This PR removes "error message" checking codeline and replaces underlying pause and exec pod images with agnhost image. (Image consolidation issue ref: #76342)

These changes would make e2e eligible as a candidate for Conformance.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

/area conformance
/sig testing
@kubernetes/sig-node-pr-reviews
@kubernetes/sig-architecture-pr-reviews
@kubernetes/cncf-conformance-wg

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. area/conformance Issues or PRs related to kubernetes conformance tests sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 14, 2019
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test labels Jun 14, 2019
Image: BusyBoxImage,
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"},
Name: "agnhost-pause",
Image: imageutils.GetE2EImage(imageutils.Agnhost),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work; newExecPodSpec is being used to create pods for kubectl exec-ing into, so it needs to have an image that has stuff in it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see agnhost has some binaries in it... but I'm assuming that the reason why a whole bunch of tests failed on this PR is that it doesn't have everything needed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm troubleshooting failing E2Es. But effort is to use agnhost. agnhost:2.0 or further release holds various cli based app and system binaries. (#78389)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests are passing now with the introduction of agnhost:2.0 from #78389.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks valid. Can we keep this and drop the change to service.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() {
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress)

ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName))
expectedErr := "connection refused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test specifically needs to distinguish different kinds of failure ("connection refused" vs "timed out") and the underlying binaries don't provide any way to do that other than by looking at the error messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanation is documented here in #75324 (comment)

Events and contents of condition reasons and messages are not covered by API compatibility guarantees. They are passed through for human consumption, not machine consumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message it's comparing against is from "curl", not from kubernetes, so kubernetes's API compatibility guarantees aren't relevant.

Alternatively, we could build our own test binary into agnhost and have it exit with different error codes in the two cases. Would that be possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error messages itself are "breaking changes" and should be avoided in such situation.

build our own test binary into agnhost and have it exit with different error codes in the two cases

What could be the benefit? ExitCode "zero" and "non-zero" are sufficient enough to verify SUCCESS and FAILURE behaviour or state. Desired state must meet regardless of reasons of success and failure of E2E.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't testing "is the service reachable", we are testing whether the iptables rules are set up correctly. The way we test that currently is by seeing whether a connection attempt is rejected ("connection refused") or dropped ("timeout"). Your rewrite is not correct. It will always report success when the service is not reachable, regardless of whether the iptables rules are correct or not.

At the moment, the only way to test this correctly is by looking at the error string. If you are going to remove the error string check, then you need to remove the entire test and file an issue indicating that you have done so and that a new test needs to be written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no HTTP server running. If we get an HTTP status code then something has gone terribly terribly wrong with the test case.

Clarifying this test a bit more: in the past, if you tried to connect to a service that had no endpoints, the connection attempt would hang for a while, and then eventually time out and fail, after you hit some kernel/client timeout. (eg, like what happens if you try "curl http://2.2.2.2/") This was a little bit annoying, so we tweaked the iptables rules so that now if you try to connect to a service with no endpoints, it will fail immediately. (eg, like if you try "curl http://127.0.0.1:1"). And there was much rejoicing. (:tada:)

But then, someone changed something in our iptables rules in a way that made these rules not work. And nobody noticed for a while, and we shipped a release where trying to connect to inactive services would hang rather than immediately failing and it was annoying and we were sad. So after it got fixed, we added a test to make sure it stayed fixed. And that's this test.

So, the goal of the test is to make sure that if you try to connect to a service when that service is unavailable, the connection attempt should fail "immediately" rather than "eventually". We can't do that just by looking at the iptables rules, because we want to test that they're working, not just that they're there. In theory maybe we could test it by timing the connection attempt and seeing if it fails "immediately" or "eventually", but that's tricky because sometimes the test machine might be hosed and so it would take it a moment to return the "immediate" error, and other times the kernel might have already cached the fact that the destination is unreachable and so will quickly return the "eventually" error. So it's hard to pick a timeout that we can be certain won't cause flakes. It's more reliable to test if we got "connection refused" (pass) or "no route to host"/"timeout" (fail).

As I said a few comments back, if we could add a tiny test program (like, 10 lines of go) to agnhost then we could use that instead, and have it exit with different status codes depending on the errno value returned by the net.Dial() call, and then there'd be no need for potentially-flaky string comparisons.

But if we're sticking with curl, then I believe we have to stick with string comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking/hoping that curl/wget would be setting some sort of httpstatus code based on the failure it encountered (timeout vs refused) but that doesnt seem to be the case.

The exit code of curl is, in both cases, 7 and wget returns 4 in both cases.

Meanwhile, as you proposed the go solution does have a non-string based solution since the net errors have a Timeout() method.

https://play.golang.org/p/Fdhb72pm7CV is a copy of what I did locally. [Note: it doesnt work on the playground, saying not a timeout in both cases, but thats just the playground env).

This seems like the only solution proposed that checks everyone's boxes.

[BTW you've mentioned curl but it looks like its actually using wget, but all the same points apply].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've filed #79423 with a new agnhost connect command, and updating this test to use it. Feel free to just absorb those commits into this PR though if you'd prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this change in favour of #79423 and once this is merged, respective e2e can be picked up for Conformance promotion.

})

if err != wait.ErrWaitTimeout {
framework.ExpectNoError(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error should be annotated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@spiffxp
Copy link
Member

spiffxp commented Jun 15, 2019

/priority important-soon
/milestone v1.16

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 15, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jun 15, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 15, 2019
@mgdevstack
Copy link
Contributor Author

/work-in-progress
Troubleshooting failing e2e.

@mgdevstack
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 15, 2019
@mgdevstack
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce

@mgdevstack
Copy link
Contributor Author

/hold cancel
Annotated error and tests are passing.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2019
@spiffxp spiffxp added this to To Triage in conformance-definition Jun 20, 2019
@timothysc timothysc removed their assignment Jun 20, 2019
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-scope this so it can merge?

@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() {
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress)

ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName))
expectedErr := "connection refused"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge

Image: BusyBoxImage,
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"},
Name: "agnhost-pause",
Image: imageutils.GetE2EImage(imageutils.Agnhost),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks valid. Can we keep this and drop the change to service.go?

@spiffxp spiffxp moved this from In Review to In Progress in conformance-definition Jun 27, 2019
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 28, 2019
@mgdevstack mgdevstack changed the title Remove error message check lines from e2e and replace exec, pause pod with platform agnostic agnhost pause pod Replace exec, pause pod with platform agnostic agnhost pause pod Jun 28, 2019
@spiffxp
Copy link
Member

spiffxp commented Jun 28, 2019

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 28, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgdevstack, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 3bd11af into kubernetes:master Jun 28, 2019
conformance-definition automation moved this from In Progress to Done Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

9 participants