[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects #945

flavio-fernandes · 2022-02-08T14:24:53Z

In cases where we currently miss doing retries for removal of stale
objects, it is best to restart the pod than simply log an error and
bring the pod up. This change is changing that behavior on functions
run early on the pod start up.

Signed-off-by: Flavio Fernandes flaviof@redhat.com

Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit 44d06f5)

Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit 4e9e424)

findSwitch only sets the UUID in the provided parameter. So, renaming it to findSwitchUUID Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit d92eab2)

Upon starting, failures when syncing OVN DB with K8 should be considered fatal. Still, this change will introduce retry logic to minimize pod restarts. Conflicts: go-controller/pkg/ovn/pods.go Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit af27b80)

openshift-ci · 2022-02-08T14:24:58Z

@flavio-fernandes: This pull request references Bugzilla bug 2052017, which is invalid:

expected the bug to target the "4.10.0" release, but it targets "4.10.z" instead
expected dependent Bugzilla bug 2027874 to be in one of the following states: MODIFIED, ON_QA, VERIFIED, but it is CLOSED (ERRATA) instead
expected dependent Bugzilla bug 2027874 to target a release in 4.11.0, but it targets "4.7.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-02-08T14:25:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: flavio-fernandes
To complete the pull request process, please assign squeed after the PR has been reviewed.
You can assign the PR to them by writing /assign @squeed in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

flavio-fernandes · 2022-02-08T14:27:38Z

/bugzilla refresh

openshift-ci · 2022-02-08T14:27:41Z

@flavio-fernandes: This pull request references Bugzilla bug 2052017, which is invalid:

expected the bug to target the "4.10.0" release, but it targets "4.10.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flavio-fernandes · 2022-02-08T14:29:50Z

/bugzilla refresh

openshift-ci · 2022-02-08T14:29:57Z

@flavio-fernandes: This pull request references Bugzilla bug 2052017, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.0) matches configured target release for branch (4.10.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2042999 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Bugzilla bug 2042999 targets the "4.11.0" release, which is one of the valid target releases: 4.11.0
bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flavio-fernandes · 2022-02-10T16:19:35Z

/hold waiting for additional changes to retry (cc @tssurya )

flavio-fernandes · 2022-02-22T19:46:16Z

Holding this PR to cherry-pick the commits from @tssurya : ovn-org/ovn-kubernetes#2787

flavio-fernandes · 2022-02-22T19:49:56Z

/retest-required

openshift-ci · 2022-02-22T22:53:36Z

@flavio-fernandes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-e2e-gcp-ovn	`e0ee1e1`	link	false	`/test okd-e2e-gcp-ovn`
ci/prow/e2e-openstack-ovn	`e0ee1e1`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-vsphere-ovn	`e0ee1e1`	link	false	`/test e2e-vsphere-ovn`
ci/prow/e2e-metal-ipi-ovn-dualstack	`e0ee1e1`	link	true	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-aws-ovn	`e0ee1e1`	link	true	`/test e2e-aws-ovn`
ci/prow/4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade	`e0ee1e1`	link	true	`/test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

flavio-fernandes · 2022-03-29T13:50:56Z

This is now folded into #994
Closing this PR.

openshift-ci · 2022-03-29T13:51:02Z

@flavio-fernandes: This pull request references Bugzilla bug 2052017. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flavio-fernandes added 4 commits February 8, 2022 13:54

Unit tests: add logical switch as expected by syncNodes

f79f62d

Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit 44d06f5)

Change address_set AddressSetIterFunc so it can return error

dd86f3e

Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit 4e9e424)

libovsdbops: add FindSwitchByName rename findSwitch

c3f701d

findSwitch only sets the UUID in the provided parameter. So, renaming it to findSwitchUUID Signed-off-by: Flavio Fernandes <flaviof@redhat.com> (cherry picked from commit d92eab2)

openshift-ci bot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Feb 8, 2022

openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 8, 2022

openshift-ci bot requested review from dcbw and trozet February 8, 2022 14:25

openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 8, 2022

openshift-ci bot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 8, 2022

openshift-ci bot requested a review from anuragthehatter February 8, 2022 14:29

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2022

flavio-fernandes closed this Mar 29, 2022

flavio-fernandes deleted the fatal_on_rm_stale_4.10 branch March 29, 2022 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects #945

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects #945

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 10, 2022

flavio-fernandes commented Feb 22, 2022

flavio-fernandes commented Feb 22, 2022

openshift-ci bot commented Feb 22, 2022

flavio-fernandes commented Mar 29, 2022

openshift-ci bot commented Mar 29, 2022

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects #945

[release-4.10] Bug 2052017: restart pod on non-retriable failures when deleting stale objects #945

Conversation

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 8, 2022

openshift-ci bot commented Feb 8, 2022

flavio-fernandes commented Feb 10, 2022

flavio-fernandes commented Feb 22, 2022

flavio-fernandes commented Feb 22, 2022

openshift-ci bot commented Feb 22, 2022

flavio-fernandes commented Mar 29, 2022

openshift-ci bot commented Mar 29, 2022