Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USHIFT-1389: Enhance greenboot check script to print cluster debugging information on failures #2150

Merged

Conversation

ggiguash
Copy link
Contributor

@ggiguash ggiguash commented Jul 31, 2023

Closes USHIFT-1389

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 31, 2023

@ggiguash: This pull request references USHIFT-1389 which is a valid jira issue.

In response to this:

Closes USHIFT-1389

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 31, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 31, 2023

@ggiguash: This pull request references USHIFT-1389 which is a valid jira issue.

In response to this:

Closes USHIFT-1389

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from jogeo and pacevedom July 31, 2023 16:05
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2023
@ggiguash
Copy link
Contributor Author

/assign @pmtk @dhellmann

packaging/greenboot/microshift-running-check.sh Outdated Show resolved Hide resolved
for srv in microshift.service microshift-etcd.service ; do
log_failure_cmd "${srv}" "journalctl -xu ${srv} -n 1000 --no-pager"
done
# Always log the list of pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we log this sort of information closer to the place where it is being checked? For example, "Expected N pods in $namespace, found M: $(oc get pods -n $namespace)"

Copy link
Contributor Author

@ggiguash ggiguash Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it's difficult because the checks are running in loops/background and there's a wait for status reports.
I think, however, that producing pod list (and events?) before exit helps to identify the problem on the spot.

So, there will be a message that pod restart check failed and pod-list-snapshot taken 1s after that error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is a user supposed to know what pods should be present so they can compare that to the list of pods that are present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, that requires some internal knowledge of MicroShift. However, it's "easy" to see if something not started by the pod count.
I'm not saying it's ideal, so I'm open to suggestions on how to improve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the loops where the script looks for pods in specific namespaces, it knows when it does not find them. It should report whatever details it can at that point. Messages like "Expected 5 ready pods in openshift-foo but found 4" point directly to issues within the namespace. "Deployment embedded-component has no ready pods" is even better because it points directly to the thing that is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added error messages in all the loops. Is this any better?

Copy link
Contributor

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d0a6112 and 2 for PR HEAD a61c625 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 54c213a and 1 for PR HEAD a61c625 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c96eb62 and 0 for PR HEAD a61c625 in total

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 3, 2023
@ggiguash
Copy link
Contributor Author

ggiguash commented Aug 3, 2023

I reverted the "optimization" of service status check here.
Looks like it was introducing a race condition.

@pmtk
Copy link
Member

pmtk commented Aug 3, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, ggiguash, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dhellmann,ggiguash,pmtk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c96eb62 and 2 for PR HEAD a1e7213 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f7cb029 and 1 for PR HEAD a1e7213 in total

@ggiguash
Copy link
Contributor Author

ggiguash commented Aug 3, 2023

The e2e-openshift-conformance-reduced test is not related to this change.
/override ci/prow/e2e-openshift-conformance-reduced

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2023

@ggiguash: Overrode contexts on behalf of ggiguash: ci/prow/e2e-openshift-conformance-reduced

In response to this:

The e2e-openshift-conformance-reduced test is not related to this change.
/override ci/prow/e2e-openshift-conformance-reduced

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 32f9818 and 0 for PR HEAD a1e7213 in total

@ggiguash
Copy link
Contributor Author

ggiguash commented Aug 3, 2023

/override ci/prow/e2e-openshift-conformance-reduced

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2023

@ggiguash: Overrode contexts on behalf of ggiguash: ci/prow/e2e-openshift-conformance-reduced

In response to this:

/override ci/prow/e2e-openshift-conformance-reduced

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit d0c1a88 into openshift:main Aug 3, 2023
8 of 10 checks passed
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2023

@ggiguash: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/microshift-e2e-arm a1e7213 link false /test microshift-e2e-arm
ci/prow/e2e-openshift-conformance-reduced-arm a1e7213 link false /test e2e-openshift-conformance-reduced-arm

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ggiguash ggiguash deleted the greenboot_error_logging branch August 4, 2023 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants