Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2082599: add upper bound to number of retries #2970

Merged
merged 1 commit into from Jun 14, 2022

Conversation

ricky-rav
Copy link
Contributor

@ricky-rav ricky-rav commented May 6, 2022

The retry logic should not attempt to add or delete an object indefinitely. Adding an upper bound to the maximum number of times we can attempt to add or delete a given object, as we do already in level-driven controllers.

fixes #2082599

Signed-off-by: Riccardo Ravaioli rravaiol@redhat.com

@ricky-rav ricky-rav changed the title add upper bound to number of retries Bug 2082599: add upper bound to number of retries May 6, 2022
@coveralls
Copy link

coveralls commented May 6, 2022

Coverage Status

Coverage increased (+0.09%) to 52.043% when pulling 1d5486c on ricky-rav:maxretry into fc66d17 on ovn-org:master.

@ricky-rav
Copy link
Contributor Author

PTAL @trozet , @msherif1234 . Thanks!

Copy link
Contributor

@msherif1234 msherif1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good threshold is a bit too high IMHO timer fires every 30 sec so that means keep retry for 7 mins ?
left few comments

Thanks!!

go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
if !isResourceScheduled(r.oType, entry.oldObj) {
klog.V(5).Infof("Retry: %s %s not scheduled", r.oType, objKey)
entry.failedRetries++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why inc here we aren't scheduled for retry yet right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't 100% sure here. Can a pod be never scheduled? Should we keep retrying forever in that case?
I'm lacking some hands-on experience on what happens with these pods that fail to be added/deleted...

@trozet any input? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure a pod can never be scheduled. If it has a node selector that isn't applicable (either the selector doesnt apply to any nodes, or the nodes it does apply to are unready. In either case once the pod is scheduled, we would get a pod update event. I think we can try a number of times and give up or just ignore pods that are not scheduled and not add them to retry. Either is fine.

if !isResourceScheduled(r.oType, entry.newObj) {
klog.V(5).Infof("Retry: %s %s not scheduled", r.oType, objKey)
entry.failedRetries++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/pods_test.go Outdated Show resolved Hide resolved
go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
@ricky-rav
Copy link
Contributor Author

Let's discuss also the value for the maximum number of retries, as @msherif1234 suggested. For how long do we care about an object?

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

@ricky-rav let me know if you want to update anything else before merge

go-controller/pkg/ovn/obj_retry.go Show resolved Hide resolved
go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
@ricky-rav
Copy link
Contributor Author

@trozet thanks! I've updated the two comments you pointed at, I think we're good now.

@ricky-rav ricky-rav force-pushed the maxretry branch 2 times, most recently from 15f4c04 to b331efc Compare May 12, 2022 19:03
@ricky-rav
Copy link
Contributor Author

I've just rebased. PR should be ready for merging once CI is green.

@msherif1234
Copy link
Contributor

/lgtm

@ricky-rav
Copy link
Contributor Author

ricky-rav commented May 23, 2022

I reworked a bit the failed retry counter so that we now take into account all failed attempts to add/update/delete an object (and not just failed retries), and we initialize to 0 the counter every time a new add/update/event comes in, which was missing in my initial commit.

@msherif1234 @trozet PTAL

entry.timeStamp = time.Now()
entry.failedAttempts++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here u increment the counter directly while below u use the new method which use locking , do we really need to lock in the new method ? and if we do then probably we will need a read method with lock to use in iterate when compare against the max limit ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens at the very beginning of iterateRetryResources:

func (oc *Controller) iterateRetryResources(r *retryObjs, updateAll bool) {
	r.retryMutex.Lock()
	defer r.retryMutex.Unlock()
	now := time.Now()

To be honest, I don't have a very strong opinion about needing to acquire a lock when incrementing the counter...

The retry logic should not attempt to add/update/delete an object indefinitely. Adding an upper bound to the maximum number of times we can attempt to add/update/delete a given object, as we do already in level-driven controllers.

Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
@ricky-rav
Copy link
Contributor Author

/retest-failed

@trozet trozet merged commit afe4007 into ovn-org:master Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants