-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #325] Added retry count to abort workitem after a few failed retries #334
Conversation
Hi @nguyenkndinh. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
} | ||
|
||
const ( | ||
maxRequeuingCount = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what this translates into in terms of retry duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the DefaultControllerRateLimiter, which uses the worst case of the 2 limiters:
- ItemExponentialFailureRateLimiter, with a base delay of 5 milliseconds and max delay of 1000 seconds
- BucketRateLimiter, with a limit of 10, and 100 burst.
If not counting BucketRateLimiter cause that's a different beast, the work items would have to wait anywhere from 50 milliseconds to 10000 seconds in retry time (excluding work time and queue wait time as they can differ). However, the real avg number is a bit more complicated depending on the number of items in the queue, if the item has been enqueued many times, which rate limiter was used. So, When
to add an item back in the queue depends on either BucketRateLimiter or ItemExponentialFailureRateLimiter is selected as worst case. However, I wouldn't expect the worst case of 10000 to ever happens, though we need metrics data to fine tune this as it goes I assume.
Can you squash your commits |
/ok-to-test |
/hold |
6d70a64
to
030a5b6
Compare
pkg/controllers/tagging/metrics.go
Outdated
func recordWorkitemMetrics(actionName string, timeTaken float64, err error) { | ||
if err != nil { | ||
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc() | ||
} | ||
|
||
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we break this down into separate functions for count and latency metrics? Seems like at least for error counts, we're going to have zero latency all the time which doesn't seem desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controllers/tagging/metrics.go
Outdated
|
||
func recordWorkitemMetrics(actionName string, timeTaken float64, err error) { | ||
if err != nil { | ||
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a question around what the dimensions for this metric is going to look like - are we able to query this metric using actionName as well? Because we seem to be passing strings with node names wherever we're calling this function so need to understand if that makes sense or if we need actionName to be somewhat static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great call out, as I'm based this change off of aws_metrics.go, I assume the dimensions for this are similar to this. I'll check in CW and see how they are being produced. However, I think you're right as the usage in aws.go is also somewhat static.
pkg/controllers/tagging/metrics.go
Outdated
var ( | ||
workItemLatency = metrics.NewHistogramVec( | ||
&metrics.HistogramOpts{ | ||
Name: "tagging_workitem_latency", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have been looking at https://pkg.go.dev/github.com/prometheus/client_golang/prometheus for more info around how Prometheus handles these metrics and it looks like this is the metric name and the labels applied in the below functions are additional dimensions.
Keeping that in mind, can we rename these to cloudprovider_tagging_controller_work_item_latency
and cloudprovider_tagging_controller_work_item_error
instead to be more specific and to follow a similar pattern as the other metrics in this package?
} | ||
|
||
klog.Errorf("error processing work item '%v': %s, requeuing count exceeded", workItem, err.Error()) | ||
recordWorkItemErrorMetrics("process_work_item") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to track both total errors and errors after all retries are exhausted. Can you please emit two labels for this: total_errors
and errors_after_retries_exhausted
? You can emit total_errors
inside the if
above and errors_after_retries_exhausted
outside it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to count the number of total errors inside the if with a counter and the number of errors outside the if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand it now but pls correct me if I was wrong, it is simply to add recordWorkItemErrorMetrics(TotalErrors)
in the if and recordWorkItemErrorMetrics(ErrorsAfterRetriesExhausted)
outside the if so that we can aggregate them later.
pkg/controllers/tagging/metrics.go
Outdated
} | ||
|
||
func recordWorkItemErrorMetrics(actionName string) { | ||
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the label name to error_type
instead and change the param name accordingly?
pkg/controllers/tagging/metrics.go
Outdated
} | ||
|
||
func recordWorkItemLatencyMetrics(actionName string, timeTaken float64) { | ||
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this. Can we change this to latency_type
instead and change the param name accordingly?
} else { | ||
klog.Infof("Finished processing %v", workItem) | ||
timeTaken = time.Since(workItem.enqueueTime).Seconds() | ||
recordWorkItemLatencyMetrics("process_work_item", timeTaken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to work_item_processing_time
?
return nil | ||
} | ||
|
||
timeTaken := time.Since(workItem.enqueueTime).Seconds() | ||
recordWorkItemLatencyMetrics("dequeue_work_item", timeTaken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to work_item_dequeuing_time
?
utilruntime.HandleError(fmt.Errorf("expected workItem in workqueue but got %#v", obj)) | ||
err := fmt.Errorf("expected workItem in workqueue but got %#v", obj) | ||
utilruntime.HandleError(err) | ||
recordWorkItemErrorMetrics("dequeue_work_item") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to emit a latency metric here since this means that the workitem is invalid, right?
pkg/controllers/tagging/metrics.go
Outdated
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken) | ||
} | ||
|
||
func recordWorkItemErrorMetrics(actionName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a label for instance_id
in this function and take in the appropriate params? Might help to see if there are certain instances running into errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
pkg/controllers/tagging/metrics.go
Outdated
var ( | ||
workItemDuration = metrics.NewHistogramVec( | ||
&metrics.HistogramOpts{ | ||
Name: "cloudprovider_tagging_controller_work_item_duration_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing for this metric name, to match the other metrics already present in aws_metrics.go we should start with the prefix: cloudprovider_aws
so the metric should be cloudprovider_aws_tagging_controller_work_item_duration_seconds
. Then other controllers can follow the same pattern of cloudprovider_aws_controller_name_metric_name
.
pkg/controllers/tagging/metrics.go
Outdated
|
||
workItemError = metrics.NewCounterVec( | ||
&metrics.CounterOpts{ | ||
Name: "cloudprovider_tagging_controller_work_item_errors_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name: "cloudprovider_tagging_controller_work_item_errors_total", | |
Name: "cloudprovider_aws_tagging_controller_work_item_errors_total", |
} | ||
|
||
const ( | ||
maxRequeuingCount = 9 | ||
totalErrors = "total_errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since some of these are metrics labels, I think it makes more sense to define those with the metrics they will be used with. And when you do that you can write a comment as to what it can be used for when querying.
/lgtm |
/unhold |
/kind feature |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nckturner, nguyenkndinh, saurav-agarwalla The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
NONE