-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace exec, pause pod with platform agnostic agnhost pause pod #79012
Replace exec, pause pod with platform agnostic agnhost pause pod #79012
Conversation
Image: BusyBoxImage, | ||
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"}, | ||
Name: "agnhost-pause", | ||
Image: imageutils.GetE2EImage(imageutils.Agnhost), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work; newExecPodSpec
is being used to create pods for kubectl exec
-ing into, so it needs to have an image that has stuff in it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i see agnhost has some binaries in it... but I'm assuming that the reason why a whole bunch of tests failed on this PR is that it doesn't have everything needed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm troubleshooting failing E2Es. But effort is to use agnhost
. agnhost:2.0 or further release holds various cli based app and system binaries. (#78389)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests are passing now with the introduction of agnhost:2.0
from #78389.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks valid. Can we keep this and drop the change to service.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/network/service.go
Outdated
@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() { | |||
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress) | |||
|
|||
ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName)) | |||
expectedErr := "connection refused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test specifically needs to distinguish different kinds of failure ("connection refused" vs "timed out") and the underlying binaries don't provide any way to do that other than by looking at the error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation is documented here in #75324 (comment)
Events and contents of condition reasons and messages are not covered by API compatibility guarantees. They are passed through for human consumption, not machine consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message it's comparing against is from "curl", not from kubernetes, so kubernetes's API compatibility guarantees aren't relevant.
Alternatively, we could build our own test binary into agnhost and have it exit with different error codes in the two cases. Would that be possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error messages itself are "breaking changes" and should be avoided in such situation.
build our own test binary into agnhost and have it exit with different error codes in the two cases
What could be the benefit? ExitCode "zero" and "non-zero" are sufficient enough to verify SUCCESS and FAILURE behaviour or state. Desired state must meet regardless of reasons of success and failure of E2E.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't testing "is the service reachable", we are testing whether the iptables rules are set up correctly. The way we test that currently is by seeing whether a connection attempt is rejected ("connection refused") or dropped ("timeout"). Your rewrite is not correct. It will always report success when the service is not reachable, regardless of whether the iptables rules are correct or not.
At the moment, the only way to test this correctly is by looking at the error string. If you are going to remove the error string check, then you need to remove the entire test and file an issue indicating that you have done so and that a new test needs to be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no HTTP server running. If we get an HTTP status code then something has gone terribly terribly wrong with the test case.
Clarifying this test a bit more: in the past, if you tried to connect to a service that had no endpoints, the connection attempt would hang for a while, and then eventually time out and fail, after you hit some kernel/client timeout. (eg, like what happens if you try "curl http://2.2.2.2/
") This was a little bit annoying, so we tweaked the iptables rules so that now if you try to connect to a service with no endpoints, it will fail immediately. (eg, like if you try "curl http://127.0.0.1:1
"). And there was much rejoicing. (:tada:)
But then, someone changed something in our iptables rules in a way that made these rules not work. And nobody noticed for a while, and we shipped a release where trying to connect to inactive services would hang rather than immediately failing and it was annoying and we were sad. So after it got fixed, we added a test to make sure it stayed fixed. And that's this test.
So, the goal of the test is to make sure that if you try to connect to a service when that service is unavailable, the connection attempt should fail "immediately" rather than "eventually". We can't do that just by looking at the iptables rules, because we want to test that they're working, not just that they're there. In theory maybe we could test it by timing the connection attempt and seeing if it fails "immediately" or "eventually", but that's tricky because sometimes the test machine might be hosed and so it would take it a moment to return the "immediate" error, and other times the kernel might have already cached the fact that the destination is unreachable and so will quickly return the "eventually" error. So it's hard to pick a timeout that we can be certain won't cause flakes. It's more reliable to test if we got "connection refused" (pass) or "no route to host"/"timeout" (fail).
As I said a few comments back, if we could add a tiny test program (like, 10 lines of go) to agnhost then we could use that instead, and have it exit with different status codes depending on the errno value returned by the net.Dial()
call, and then there'd be no need for potentially-flaky string comparisons.
But if we're sticking with curl, then I believe we have to stick with string comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking/hoping that curl/wget would be setting some sort of httpstatus code based on the failure it encountered (timeout vs refused) but that doesnt seem to be the case.
The exit code of curl is, in both cases, 7 and wget returns 4 in both cases.
Meanwhile, as you proposed the go solution does have a non-string based solution since the net errors have a Timeout()
method.
https://play.golang.org/p/Fdhb72pm7CV is a copy of what I did locally. [Note: it doesnt work on the playground, saying not a timeout in both cases, but thats just the playground env).
This seems like the only solution proposed that checks everyone's boxes.
[BTW you've mentioned curl
but it looks like its actually using wget, but all the same points apply].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've filed #79423 with a new agnhost connect
command, and updating this test to use it. Feel free to just absorb those commits into this PR though if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping this change in favour of #79423 and once this is merged, respective e2e can be picked up for Conformance promotion.
test/e2e/network/service.go
Outdated
}) | ||
|
||
if err != wait.ErrWaitTimeout { | ||
framework.ExpectNoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error should be annotated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/priority important-soon |
/work-in-progress |
/hold |
/test pull-kubernetes-e2e-gce |
a2d34df
to
0d4799e
Compare
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-scope this so it can merge?
test/e2e/network/service.go
Outdated
@@ -1836,21 +1836,17 @@ var _ = SIGDescribe("Services", func() { | |||
cmd := fmt.Sprintf(`wget -T 3 -qO- %v`, serviceAddress) | |||
|
|||
ginkgo.By(fmt.Sprintf("hitting service %v from pod %v on node %v", serviceAddress, podName, nodeName)) | |||
expectedErr := "connection refused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that #79243 looks like the way to go, I think I'd rather see this PR drop changes to this file and merge
Image: BusyBoxImage, | ||
Command: []string{"sh", "-c", "trap exit TERM; while true; do sleep 5; done"}, | ||
Name: "agnhost-pause", | ||
Image: imageutils.GetE2EImage(imageutils.Agnhost), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks valid. Can we keep this and drop the change to service.go
?
0d4799e
to
49cc8ef
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mgdevstack, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
E2E shouldn't be verifying failure reasons and messages which are optional and might change in future release of k8s. (Ref: #75324 (comment))
This PR removes "error message" checking codeline andreplaces underlyingpause
andexec
pod images withagnhost
image. (Image consolidation issue ref: #76342)These changes would make e2e eligible as a candidate for Conformance.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/area conformance
/sig testing
@kubernetes/sig-node-pr-reviews
@kubernetes/sig-architecture-pr-reviews
@kubernetes/cncf-conformance-wg