-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USHIFT-1389: Enhance greenboot check script to print cluster debugging information on failures #2150
USHIFT-1389: Enhance greenboot check script to print cluster debugging information on failures #2150
Conversation
@ggiguash: This pull request references USHIFT-1389 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ggiguash: This pull request references USHIFT-1389 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @pmtk @dhellmann |
for srv in microshift.service microshift-etcd.service ; do | ||
log_failure_cmd "${srv}" "journalctl -xu ${srv} -n 1000 --no-pager" | ||
done | ||
# Always log the list of pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we log this sort of information closer to the place where it is being checked? For example, "Expected N pods in $namespace, found M: $(oc get pods -n $namespace)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it's difficult because the checks are running in loops/background and there's a wait for status reports.
I think, however, that producing pod list (and events?) before exit helps to identify the problem on the spot.
So, there will be a message that pod restart check failed and pod-list-snapshot taken 1s after that error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is a user supposed to know what pods should be present so they can compare that to the list of pods that are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, that requires some internal knowledge of MicroShift. However, it's "easy" to see if something not started by the pod count.
I'm not saying it's ideal, so I'm open to suggestions on how to improve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the loops where the script looks for pods in specific namespaces, it knows when it does not find them. It should report whatever details it can at that point. Messages like "Expected 5 ready pods in openshift-foo but found 4" point directly to issues within the namespace. "Deployment embedded-component
has no ready pods" is even better because it points directly to the thing that is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added error messages in all the loops. Is this any better?
df7d3ef
to
a61c625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
a61c625
to
f6fe532
Compare
I reverted the "optimization" of service status check here. |
f6fe532
to
a1e7213
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhellmann, ggiguash, pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The e2e-openshift-conformance-reduced test is not related to this change. |
@ggiguash: Overrode contexts on behalf of ggiguash: ci/prow/e2e-openshift-conformance-reduced In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override ci/prow/e2e-openshift-conformance-reduced |
@ggiguash: Overrode contexts on behalf of ggiguash: ci/prow/e2e-openshift-conformance-reduced In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ggiguash: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closes USHIFT-1389