Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WRKLDS-593: pkg/cli/admin/mustgather: Include must-gather namespace in inspect fallback #1333

Closed

Conversation

wking
Copy link
Member

@wking wking commented Feb 1, 2023

Carrying and adjusting #1273.

In a CI run, the must-gather failed:

  [must-gather      ] OUT namespace/openshift-must-gather-9dzzn created
  [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs created
  [must-gather      ] OUT pod for plug-in image registry.build01.ci.openshift.org/ci-op-4ryqvzt5/stable@sha256:c68fcf3161719f600a803627bca6d0ebf7894bf75282937aebe908a9ef54dedb created
  [must-gather-lxz2x] POD 2022-10-21T10:17:43.213934752Z Gathering data for ns/openshift-cluster-version...
  ...
  [must-gather-lxz2x] POD 2022-10-21T10:18:07.577813555Z Gathering data for ns/openshift-monitoring...
  [must-gather-lxz2x] OUT gather logs unavailable: read tcp 10.131.188.56:50182->3.12.171.103:6443: read: connection reset by peer
  [must-gather-lxz2x] OUT waiting for gather to complete
  [must-gather-lxz2x] OUT gather never finished: pods "must-gather-lxz2x" not found
  [must-gather      ] OUT namespace/openshift-must-gather-9dzzn deleted
  [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs deleted
  Gathering data for ns/openshift-config...

And from the build log:

  Running must-gather...

  Error running must-gather collection:
      gather never finished for pod must-gather-lxz2x: pods "must-gather-lxz2x" not found

  Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.

This commit shifts the inspection fallback from 971dcc6 (#749) to come before that temporary namespace is deleted, and to include the temporary namespace, so we capture events and such to help debug why the pod failed to launch or if/when something else deleted it.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 1, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 1, 2023

@wking: This pull request references WRKLDS-593 which is a valid jira issue.

In response to this:

Carrying and adjusting #1273.

In a CI run, the must-gather failed:

 [must-gather      ] OUT namespace/openshift-must-gather-9dzzn created
 [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs created
 [must-gather      ] OUT pod for plug-in image registry.build01.ci.openshift.org/ci-op-4ryqvzt5/stable@sha256:c68fcf3161719f600a803627bca6d0ebf7894bf75282937aebe908a9ef54dedb created
 [must-gather-lxz2x] POD 2022-10-21T10:17:43.213934752Z Gathering data for ns/openshift-cluster-version...
 ...
 [must-gather-lxz2x] POD 2022-10-21T10:18:07.577813555Z Gathering data for ns/openshift-monitoring...
 [must-gather-lxz2x] OUT gather logs unavailable: read tcp 10.131.188.56:50182->3.12.171.103:6443: read: connection reset by peer
 [must-gather-lxz2x] OUT waiting for gather to complete
 [must-gather-lxz2x] OUT gather never finished: pods "must-gather-lxz2x" not found
 [must-gather      ] OUT namespace/openshift-must-gather-9dzzn deleted
 [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs deleted
 Gathering data for ns/openshift-config...

And from the build log:

 Running must-gather...

 Error running must-gather collection:
     gather never finished for pod must-gather-lxz2x: pods "must-gather-lxz2x" not found

 Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.

This commit shifts the inspection fallback from 971dcc6 (#749) to come before that temporary namespace is deleted, and to include the temporary namespace, so we capture events and such to help debug why the pod failed to launch or if/when something else deleted it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…llback

In a CI run [1], the must-gather failed [2]:

  [must-gather      ] OUT namespace/openshift-must-gather-9dzzn created
  [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs created
  [must-gather      ] OUT pod for plug-in image registry.build01.ci.openshift.org/ci-op-4ryqvzt5/stable@sha256:c68fcf3161719f600a803627bca6d0ebf7894bf75282937aebe908a9ef54dedb created
  [must-gather-lxz2x] POD 2022-10-21T10:17:43.213934752Z Gathering data for ns/openshift-cluster-version...
  ...
  [must-gather-lxz2x] POD 2022-10-21T10:18:07.577813555Z Gathering data for ns/openshift-monitoring...
  [must-gather-lxz2x] OUT gather logs unavailable: read tcp 10.131.188.56:50182->3.12.171.103:6443: read: connection reset by peer
  [must-gather-lxz2x] OUT waiting for gather to complete
  [must-gather-lxz2x] OUT gather never finished: pods "must-gather-lxz2x" not found
  [must-gather      ] OUT namespace/openshift-must-gather-9dzzn deleted
  [must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9kdhs deleted
  Gathering data for ns/openshift-config...

And from the build log [3]:

  Running must-gather...

  Error running must-gather collection:
      gather never finished for pod must-gather-lxz2x: pods "must-gather-lxz2x" not found

  Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.

This commit shifts the inspection fallback from 971dcc6 (inspect
clusteroperators as a backup to must-gather if it fails, 2021-02-23, openshift#749)
to come before that temporary namespace is deleted, and to include the
temporary namespace, so we capture events and such to help debug why
the pod failed to launch or if/when something else deleted it.

I'm also adding ClusterVersion to the fallback inspection, because
it's often useful to know what the cluster-version operator is up to.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/836/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1583384716713660416
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/836/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1583384716713660416/artifacts/e2e-aws-operator/gather-must-gather/artifacts/must-gather/must-gather.log
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/836/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1583384716713660416/artifacts/e2e-aws-operator/gather-must-gather/build-log.txt
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 1, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign mfojtik for approval by writing /assign @mfojtik in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking wking force-pushed the inspect-must-gather-namespace-too branch from 949031b to 288b52d Compare February 1, 2023 16:53
func (o *MustGatherOptions) BackupGathering(ctx context.Context, errs []error) {
func (o *MustGatherOptions) BackupGathering(ctx context.Context, namespace string, errs []error) {
targets := []string{
"clusterversion.v1.config.openshift.io",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also adding ClusterVersion to the fallback inspection, because it's often useful to know what the cluster-version operator is up to. But I'm happy to spin that out into a separate pull request if folks would prefer distinct pivots.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to do that in separate PR.

@ardaguclu
Copy link
Member

Thanks for this patch @wking. Has your change any impact on when --run-namespace is used?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 1, 2023

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn 288b52d link true /test e2e-aws-ovn
ci/prow/e2e-metal-ipi-ovn-ipv6 288b52d link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-serial 288b52d link true /test e2e-aws-ovn-serial
ci/prow/e2e-agnostic-ovn-cmd 288b52d link true /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-upgrade 288b52d link true /test e2e-aws-ovn-upgrade
ci/prow/rpm-build 288b52d link true /test rpm-build
ci/prow/images 288b52d link true /test images
ci/prow/build-rpms-from-tar 288b52d link true /test build-rpms-from-tar
ci/prow/unit 288b52d link true /test unit
ci/prow/verify 288b52d link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jul 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 3, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants