New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add test confirming service network functions from openshift-apiserver pod #25291
Conversation
|
||
failures := []string{} | ||
for _, check := range connectivityChecks.Items { | ||
if !strings.Contains(check.Name, "kube-service") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !strings.Contains(check.Name, "kube-service") { | |
if !strings.Contains(check.Name, "kubernetes-service") { |
} | ||
} | ||
if len(failures) > 0 { | ||
g.Fail(fmt.Sprintf("the KUBERNETES_SERVICE_HOST:KUBERNETES_SERVICE_PORT was inaccessible via the service network IP (compare against kube-apiserver direct endpoint access):\n%v", strings.Join(failures, "\n"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KUBERNETES_SERVICE_HOST
is the service network IP
. (ie, this is basically saying "172.30.0.1 was inaccessible via 172.30.0.1").
Also, a random developer looking at test failures will have no idea what "(compare against kube-apiserver direct endpoint access)" is supposed to mean. Is that an instruction to the reader? If so, where is the reader supposed to get that other information, and why can't the test case make the comparison itself?
Also, you can use g.Errorf(...)
instead of g.Fail(fmt.Sprintf(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a random developer looking at test failures will have no idea what "(compare against kube-apiserver direct endpoint access)" is supposed to mean. Is that an instruction to the reader? If so, where is the reader supposed to get that other information, and why can't the test case make the comparison itself?
I'll see about adding more information. Basically we are seeing cases today where direct access to a kube-apiserver via the node IP works fine. But access via 172.30.0.1 failed. So we know the kube-apiservers (all of them) are accepting connections, but 172.30.0.1 shows repeated "connection refused"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I want to be sure no one bounces these as "well the kube-apiserver is returning connection refused". In every case we've seen so far, the kube-apiserver is provably functioning and handling connections from the exact same pods, but cannot be accessed via the service network on a node that is reporting the network as ready.
} | ||
} | ||
if len(failures) > 0 { | ||
g.Fail(fmt.Sprintf("the `oc -n openshift-kube-apiserver get services/apiserver` was inaccessible via the service network IP (compare against kube-apiserver direct endpoint access):\n%v", strings.Join(failures, "\n"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not be clear to me from the output of these two tests what the difference is between them is (what does -n openshift-kube-apiserver services/apiserver
point to, if not the kube apiserver?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is maintained by the kube-apiserver directly. The kube-apiserver directly writes into the endpoints resource. This is 172.30.0.1.
The other is used by the service monitor. It is a real service maintained by the service/endpoints controller. This has a different IP.
/test all |
887b17c
to
cf0dcdb
Compare
7384056
to
d563b59
Compare
now has checks on the actual kube-apiserver endpoints to avoid false positives. /hold cancel |
The test ran. /test all |
/retest |
/test all |
1 similar comment
/test all |
d563b59
to
ce1bc0d
Compare
/test all |
/retest |
/test all now that Luis fixed the names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Won't run cleanly until openshift/cluster-kube-apiserver-operator#928 is merged.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
ce1bc0d
to
e8b9115
Compare
New changes are detected. LGTM label has been removed. |
/retest |
e8b9115
to
3cdf0de
Compare
3cdf0de
to
d228c92
Compare
@deads2k: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@deads2k: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This adds checks to be sure we don't experience outages on the service network. We noticed that there were "connection refused" errors from the openshift-apiserver to the kube-apiserver only through the service network. The direct access via the node IP (kube-apiserver are host-network) continued without any failures.
/hold
Need to hold until the openshift-apiserver network checks land. The choice of Fail versus Flake will be determined based on failure rates.