New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2082599: add upper bound to number of retries #2970
Conversation
PTAL @trozet , @msherif1234 . Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good threshold is a bit too high IMHO timer fires every 30 sec so that means keep retry for 7 mins ?
left few comments
Thanks!!
go-controller/pkg/ovn/obj_retry.go
Outdated
if !isResourceScheduled(r.oType, entry.oldObj) { | ||
klog.V(5).Infof("Retry: %s %s not scheduled", r.oType, objKey) | ||
entry.failedRetries++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why inc here we aren't scheduled for retry yet right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't 100% sure here. Can a pod be never scheduled? Should we keep retrying forever in that case?
I'm lacking some hands-on experience on what happens with these pods that fail to be added/deleted...
@trozet any input? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure a pod can never be scheduled. If it has a node selector that isn't applicable (either the selector doesnt apply to any nodes, or the nodes it does apply to are unready. In either case once the pod is scheduled, we would get a pod update event. I think we can try a number of times and give up or just ignore pods that are not scheduled and not add them to retry. Either is fine.
go-controller/pkg/ovn/obj_retry.go
Outdated
if !isResourceScheduled(r.oType, entry.newObj) { | ||
klog.V(5).Infof("Retry: %s %s not scheduled", r.oType, objKey) | ||
entry.failedRetries++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Let's discuss also the value for the maximum number of retries, as @msherif1234 suggested. For how long do we care about an object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
@ricky-rav let me know if you want to update anything else before merge
@trozet thanks! I've updated the two comments you pointed at, I think we're good now. |
15f4c04
to
b331efc
Compare
I've just rebased. PR should be ready for merging once CI is green. |
/lgtm |
I reworked a bit the failed retry counter so that we now take into account all failed attempts to add/update/delete an object (and not just failed retries), and we initialize to 0 the counter every time a new add/update/event comes in, which was missing in my initial commit. @msherif1234 @trozet PTAL |
entry.timeStamp = time.Now() | ||
entry.failedAttempts++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here u increment the counter directly while below u use the new method which use locking , do we really need to lock in the new method ? and if we do then probably we will need a read method with lock to use in iterate when compare against the max limit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens at the very beginning of iterateRetryResources
:
func (oc *Controller) iterateRetryResources(r *retryObjs, updateAll bool) {
r.retryMutex.Lock()
defer r.retryMutex.Unlock()
now := time.Now()
To be honest, I don't have a very strong opinion about needing to acquire a lock when incrementing the counter...
The retry logic should not attempt to add/update/delete an object indefinitely. Adding an upper bound to the maximum number of times we can attempt to add/update/delete a given object, as we do already in level-driven controllers. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
/retest-failed |
The retry logic should not attempt to add or delete an object indefinitely. Adding an upper bound to the maximum number of times we can attempt to add or delete a given object, as we do already in level-driven controllers.
fixes #2082599
Signed-off-by: Riccardo Ravaioli rravaiol@redhat.com