Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.14] OCPBUGS-18584: Check libovsdbclient.ErrNotFound on wrapped errors #1863

Merged

Conversation

flavio-fernandes
Copy link
Contributor

Instead of looking explicitly for libovsdbclient.ErrNotFound, checking logic should account for cases when error has been wrapped.

In particular, this change addresses the logic in: func DeleteNATsOps()
https://github.com/ovn-org/ovn-kubernetes/blob/247483c8d1167072e04cf63e1c6e45264a25310e/go-controller/pkg/libovsdb/ops/router.go#L1078

when the error began to be wrapped as follows:
ovn-org/ovn-kubernetes@25d892c#r1317615944

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 7, 2023
@openshift-ci-robot
Copy link
Contributor

@flavio-fernandes: This pull request references Jira Issue OCPBUGS-18584, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instead of looking explicitly for libovsdbclient.ErrNotFound, checking logic should account for cases when error has been wrapped.

In particular, this change addresses the logic in: func DeleteNATsOps()
https://github.com/ovn-org/ovn-kubernetes/blob/247483c8d1167072e04cf63e1c6e45264a25310e/go-controller/pkg/libovsdb/ops/router.go#L1078

when the error began to be wrapped as follows:
ovn-org/ovn-kubernetes@25d892c#r1317615944

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tssurya
Copy link
Contributor

tssurya commented Sep 11, 2023

wait these are not critical bugs why are we cherry-picking them? the more we cherry-pick the more chance of conflicts, why don't we wait for d/s merge?

@flavio-fernandes
Copy link
Contributor Author

/retest-required

@flavio-fernandes flavio-fernandes changed the base branch from master to release-4.14 September 12, 2023 18:17
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 12, 2023
@openshift-ci-robot
Copy link
Contributor

@flavio-fernandes: This pull request references Jira Issue OCPBUGS-18584, which is invalid:

  • expected Jira Issue OCPBUGS-18584 to depend on a bug targeting a version in 4.15.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Instead of looking explicitly for libovsdbclient.ErrNotFound, checking logic should account for cases when error has been wrapped.

In particular, this change addresses the logic in: func DeleteNATsOps()
https://github.com/ovn-org/ovn-kubernetes/blob/247483c8d1167072e04cf63e1c6e45264a25310e/go-controller/pkg/libovsdb/ops/router.go#L1078

when the error began to be wrapped as follows:
ovn-org/ovn-kubernetes@25d892c#r1317615944

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@flavio-fernandes
Copy link
Contributor Author

/retest-required

@flavio-fernandes
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 18, 2023
@openshift-ci-robot
Copy link
Contributor

@flavio-fernandes: This pull request references Jira Issue OCPBUGS-18584, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-18895 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-18895 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@martinkennelly
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
Instead of looking explicitly for libovsdbclient.ErrNotFound,
checking logic should account for cases when error has been wrapped.

In particular, this change addresses the logic in:
func DeleteNATsOps()
https://github.com/ovn-org/ovn-kubernetes/blob/247483c8d1167072e04cf63e1c6e45264a25310e/go-controller/pkg/libovsdb/ops/router.go#L1078

when the error began to be wrapped as follows:
ovn-org/ovn-kubernetes@25d892c#r1317615944

Reported-at: https://issues.redhat.com/browse/OCPBUGS-18584
Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
(cherry picked from commit 4980714)
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
@martinkennelly
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2023
@flavio-fernandes
Copy link
Contributor Author

/retest-required

1 similar comment
@flavio-fernandes
Copy link
Contributor Author

/retest-required

@martinkennelly
Copy link
Contributor

/retest-required
infra failed to setup - unrelated to PR

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 20, 2023

@flavio-fernandes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-kubevirt f278f15 link false /test e2e-aws-ovn-kubevirt
ci/prow/e2e-openstack-ovn f278f15 link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@martinkennelly
Copy link
Contributor

martinkennelly commented Sep 20, 2023

Upgrade is failing again. We have had 5 fail in a row now and before it failed 50% of the time but its a small sample size.

$omg get co
NAME                                      VERSION                                                    AVAILABLE  PROGRESSING  DEGRADED  SINCE
authentication                                                                                       Unknown    Unknown      Unknown   Unknown
baremetal                                                                                            Unknown    Unknown      Unknown   Unknown
cloud-controller-manager                  4.14.0-0.ci.test-2023-09-20-094240-ci-op-6dz138ct-initial  True       False        False     5m20s
cloud-credential                                                                                     True       False        False     54m
cluster-autoscaler                                                                                   Unknown    Unknown      Unknown   Unknown
config-operator                                                                                      Unknown    Unknown      Unknown   Unknown
console                                                                                              Unknown    Unknown      Unknown   Unknown
control-plane-machine-set                                                                            Unknown    Unknown      Unknown   Unknown
csi-snapshot-controller                                                                              Unknown    Unknown      Unknown   Unknown
dns                                                                                                  Unknown    Unknown      Unknown   Unknown
etcd                                      4.14.0-0.ci.test-2023-09-20-094240-ci-op-6dz138ct-initial  True       True         False     36s
image-registry                                                                                       Unknown    Unknown      Unknown   Unknown
ingress                                                                                              Unknown    Unknown      Unknown   Unknown
insights                                                                                             Unknown    Unknown      Unknown   Unknown
kube-apiserver                                                                                       False      True         True      2m37s
kube-controller-manager                                                                              False      True         True      2m35s
kube-scheduler                            4.14.0-0.ci.test-2023-09-20-094240-ci-op-6dz138ct-initial  False      True         True      2m38s
kube-storage-version-migrator                                                                        Unknown    Unknown      Unknown   Unknown
machine-api                                                                                          Unknown    Unknown      Unknown   Unknown
machine-approver                                                                                     Unknown    Unknown      Unknown   Unknown
machine-config                                                                                       Unknown    Unknown      Unknown   Unknown
marketplace                                                                                          Unknown    Unknown      Unknown   Unknown
monitoring                                                                                           Unknown    Unknown      Unknown   Unknown
network                                   4.14.0-0.ci.test-2023-09-20-094240-ci-op-6dz138ct-initial  True       True         False     3m38s
node-tuning                                                                                          Unknown    Unknown      Unknown   Unknown
openshift-apiserver                                                                                  Unknown    Unknown      Unknown   Unknown
openshift-controller-manager                                                                         Unknown    Unknown      Unknown   Unknown
openshift-samples                                                                                    Unknown    Unknown      Unknown   Unknown
operator-lifecycle-manager                                                                           Unknown    Unknown      Unknown   Unknown
operator-lifecycle-manager-catalog                                                                   Unknown    Unknown      Unknown   Unknown
operator-lifecycle-manager-packageserver                                                             Unknown    Unknown      Unknown   Unknown
service-ca                                                                                           Unknown    Unknown      Unknown   Unknown
storage                                                                                              Unknown    Unknown      Unknown   Unknown

API server, controller manager and scheduler are degraded.

I see Network is rolling out and no sign of being degraded:

status:
  conditions:
  - lastTransitionTime: '2023-09-20T10:59:57Z'
    status: 'False'
    type: Degraded
  - lastTransitionTime: '2023-09-20T10:59:04Z'
    status: 'False'
    type: ManagementStateDegraded
  - lastTransitionTime: '2023-09-20T10:59:04Z'
    status: 'True'
    type: Upgradeable
  - lastTransitionTime: '2023-09-20T10:59:21Z'
    message: 'DaemonSet "/openshift-multus/network-metrics-daemon" is waiting for
      other operators to become ready

      Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller"
      is waiting for other operators to become ready

      Deployment "/openshift-multus/multus-admission-controller" is waiting for other
      operators to become ready

      Deployment "/openshift-network-diagnostics/network-check-source" is waiting
      for other operators to become ready'
    reason: Deploying
    status: 'True'
    type: Progressing
  - lastTransitionTime: '2023-09-20T10:59:21Z'
    status: 'True'
    type: Available

I see ovnkube controlplane / node pods are Running:

$omg get pods -n openshift-ovn-kubernetes
NAME                                    READY  STATUS   RESTARTS  AGE
ovnkube-control-plane-746765cfd7-4tvsx  2/2    Running  0         3m28s
ovnkube-control-plane-746765cfd7-65bds  2/2    Running  0         3m28s
ovnkube-control-plane-746765cfd7-kck79  2/2    Running  0         3m28s
ovnkube-node-p57ng                      8/8    Running  0         2m41s
ovnkube-node-qfk4f                      8/8    Running  0         2m42s
ovnkube-node-qvqp7                      8/8    Running  0         2m42s

As you can see the roll out of these network components just occurred.

Looking at api server:

$omg get co kube-apiserver -o yaml
....
  - lastTransitionTime: '2023-09-20T11:02:22Z'
    message: 'GuardControllerDegraded: [Missing operand on node ip-10-0-124-138.ec2.internal,
      Missing operand on node ip-10-0-70-9.ec2.internal, Missing operand on node ip-10-0-9-247.ec2.internal]

      InstallerControllerDegraded: missing required resources: [configmaps: bound-sa-token-signing-certs-1,config-1,etcd-serving-ca-1,kube-apiserver-audit-policies-1,kube-apiserver-cert-syncer-kubeconfig-1,kube-apiserver-pod-1,kubelet-serving-ca-1,sa-token-signing-certs-1,
      secrets: etcd-client-1,localhost-recovery-client-token-1,localhost-recovery-serving-certkey-1]'
    reason: GuardController_SyncError::InstallerController_Error
    status: 'True'
    type: Degraded

Checking the kao pod logs:

omg  -n openshift-kube-apiserver-operator logs kube-apiserver-operator-f7d976dbc-sl425 | grep "E0"
2023-09-20T11:02:27.752507369Z E0920 11:02:27.752486       1 base_controller.go:268] InstallerController reconciliation failed: missing required resources: [configmaps: bound-sa-token-signing-certs-1,config-1,etcd-serving-ca-1,kube-apiserver-audit-policies-1,kube-apiserver-cert-syncer-kubeconfig-1,kube-apiserver-pod-1,kubelet-serving-ca-1,sa-token-signing-certs-1, secrets: etcd-client-1,localhost-recovery-client-token-1,localhost-recovery-serving-certkey-1]
...
2023-09-20T11:02:25.949853499Z E0920 11:02:25.949835       1 base_controller.go:268] GuardController reconciliation failed: [Missing operand on node ip-10-0-124-138.ec2.internal, Missing operand on node ip-10-0-70-9.ec2.internal, Missing operand on node ip-10-0-9-247.ec2.internal]
...
$omg get pods -n openshift-kube-apiserver
No resources found

@flavio-fernandes
Copy link
Contributor Author

/test e2e-aws-ovn-upgrade

@jcaamano
Copy link
Contributor

/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 21, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flavio-fernandes, jcaamano, martinkennelly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 21, 2023
@jechen0648
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 21, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e563872 and 2 for PR HEAD f278f15 in total

@openshift-merge-robot openshift-merge-robot merged commit 2bc46b2 into openshift:release-4.14 Sep 21, 2023
21 of 23 checks passed
@openshift-ci-robot
Copy link
Contributor

@flavio-fernandes: Jira Issue OCPBUGS-18584: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-18584 has not been moved to the MODIFIED state.

In response to this:

Instead of looking explicitly for libovsdbclient.ErrNotFound, checking logic should account for cases when error has been wrapped.

In particular, this change addresses the logic in: func DeleteNATsOps()
https://github.com/ovn-org/ovn-kubernetes/blob/247483c8d1167072e04cf63e1c6e45264a25310e/go-controller/pkg/libovsdb/ops/router.go#L1078

when the error began to be wrapped as follows:
ovn-org/ovn-kubernetes@25d892c#r1317615944

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jechen0648
Copy link

/ocpbugs cc-qa

@jechen0648
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 21, 2023
@flavio-fernandes flavio-fernandes deleted the err_wrap.ds branch September 21, 2023 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.