Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Event intervals for Startup Probe failures #27612

Conversation

DennisPeriquet
Copy link
Contributor

@DennisPeriquet DennisPeriquet commented Dec 14, 2022

TRT-724

The kubelet logs contain the startup probe errors. We extract them (similar to how the readiness probe events were extracted) so they will be represented in the event Intervals and the Interval charts.

Sample output:

  • Startup probes occur just before a container goes to Ready state. You can see them in this chart for pods (e.g., redhat-operators-5nk4h) that are in the openshift-marketplace namespace.
  • For the node logs of this job, we see 4 Startup Probe events which show up in the chart. Note that both log line cases (output=" and output=<) are represented:
$ grep 5nk4h ip-10-0-149-44.us-west-2.compute.internal-journal.log|grep '"Probe failed" probeType="Startup"'
Dec 15 14:35:33.232637 ip-10-0-149-44 kubenswrapper[2416]: I1215 14:35:33.209265    2416 prober.go:114] "Probe failed" probeType="Startup" pod="openshift-marketplace/redhat-operators-5nk4h" podUID=b2e47f45-5d77-47c8-94c9-dfbdab5c1834 containerName="registry-server" probeResult=failure output="command timed out"
Dec 15 14:35:43.223691 ip-10-0-149-44 kubenswrapper[2416]: I1215 14:35:43.222242    2416 prober.go:114] "Probe failed" probeType="Startup" pod="openshift-marketplace/redhat-operators-5nk4h" podUID=b2e47f45-5d77-47c8-94c9-dfbdab5c1834 containerName="registry-server" probeResult=failure output="command timed out"
Dec 15 14:35:53.047639 ip-10-0-149-44 kubenswrapper[2416]: I1215 14:35:53.042936    2416 prober.go:114] "Probe failed" probeType="Startup" pod="openshift-marketplace/redhat-operators-5nk4h" podUID=b2e47f45-5d77-47c8-94c9-dfbdab5c1834 containerName="registry-server" probeResult=failure output=<
Dec 15 14:36:03.233780 ip-10-0-149-44 kubenswrapper[2416]: I1215 14:36:03.233339    2416 prober.go:114] "Probe failed" probeType="Startup" pod="openshift-marketplace/redhat-operators-5nk4h" podUID=b2e47f45-5d77-47c8-94c9-dfbdab5c1834 containerName="registry-server" probeResult=failure output="command timed out"

Screen Shot 2022-12-15 at 10 31 23 AM

NOTE: the case where the log ends with output=< is for multi-line output. I purposely did not add the logic to capture the text in the multiline output. If we feel like this info is valuable in the future, we may pursue this.

@@ -223,6 +230,9 @@ <h5 class="modal-title">Resource</h5>
return [item.locator, ` (kubelet container readiness)`, "ContainerReadinessErrored"];
}
}
if (m && isKubeletStartupProbeFailure(item)){
return [item.locator, ` (kubelet container readiness)`, "StartupProbeFailed"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the message be named startup instead of readiness since it is a startup probe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as kubelet container readiness because the StartupProbeFailed is still part of container readiness and the event came from reading the kubelet log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three different probes with kubelet: liveness, readiness and startup. I assume you want to use a separate startup instead of readiness to indicate the kind of probe this is about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right -- thanks!

@xueqzhan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 20, 2022

@DennisPeriquet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling 73e1933 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-metal-ipi-sdn 73e1933 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-azure-ovn-etcd-scaling 73e1933 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-upgrade 73e1933 link false /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dgoodwin
Copy link
Contributor

dgoodwin commented Jan 5, 2023

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 5, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet, dgoodwin, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 5, 2023
@openshift-merge-robot openshift-merge-robot merged commit f4dde64 into openshift:master Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants