Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #325] Added retry count to abort workitem after a few failed retries #334

Merged
merged 1 commit into from
May 6, 2022

Conversation

nguyenkndinh
Copy link
Contributor

@nguyenkndinh nguyenkndinh commented Apr 5, 2022

What type of PR is this?
/kind feature

What this PR does / why we need it:

  • Added a logic to abort work item after a certain number of retries in case EC2 responds with non-retryable errors.
  • Added metrics collection to the controller.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @nguyenkndinh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 5, 2022
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 5, 2022
}

const (
maxRequeuingCount = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what this translates into in terms of retry duration?

Copy link
Contributor Author

@nguyenkndinh nguyenkndinh Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the DefaultControllerRateLimiter, which uses the worst case of the 2 limiters:

If not counting BucketRateLimiter cause that's a different beast, the work items would have to wait anywhere from 50 milliseconds to 10000 seconds in retry time (excluding work time and queue wait time as they can differ). However, the real avg number is a bit more complicated depending on the number of items in the queue, if the item has been enqueued many times, which rate limiter was used. So, When to add an item back in the queue depends on either BucketRateLimiter or ItemExponentialFailureRateLimiter is selected as worst case. However, I wouldn't expect the worst case of 10000 to ever happens, though we need metrics data to fine tune this as it goes I assume.

pkg/controllers/tagging/tagging_controller.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 7, 2022
@nckturner
Copy link
Contributor

Can you squash your commits

@nckturner
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2022
@nguyenkndinh
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2022
@nguyenkndinh nguyenkndinh force-pushed the master branch 2 times, most recently from 6d70a64 to 030a5b6 Compare April 9, 2022 00:27
Comment on lines 50 to 56
func recordWorkitemMetrics(actionName string, timeTaken float64, err error) {
if err != nil {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
}

workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break this down into separate functions for count and latency metrics? Seems like at least for error counts, we're going to have zero latency all the time which doesn't seem desirable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


func recordWorkitemMetrics(actionName string, timeTaken float64, err error) {
if err != nil {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a question around what the dimensions for this metric is going to look like - are we able to query this metric using actionName as well? Because we seem to be passing strings with node names wherever we're calling this function so need to understand if that makes sense or if we need actionName to be somewhat static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great call out, as I'm based this change off of aws_metrics.go, I assume the dimensions for this are similar to this. I'll check in CW and see how they are being produced. However, I think you're right as the usage in aws.go is also somewhat static.

var (
workItemLatency = metrics.NewHistogramVec(
&metrics.HistogramOpts{
Name: "tagging_workitem_latency",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been looking at https://pkg.go.dev/github.com/prometheus/client_golang/prometheus for more info around how Prometheus handles these metrics and it looks like this is the metric name and the labels applied in the below functions are additional dimensions.

Keeping that in mind, can we rename these to cloudprovider_tagging_controller_work_item_latency and cloudprovider_tagging_controller_work_item_error instead to be more specific and to follow a similar pattern as the other metrics in this package?

}

klog.Errorf("error processing work item '%v': %s, requeuing count exceeded", workItem, err.Error())
recordWorkItemErrorMetrics("process_work_item")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to track both total errors and errors after all retries are exhausted. Can you please emit two labels for this: total_errors and errors_after_retries_exhausted? You can emit total_errors inside the if above and errors_after_retries_exhausted outside it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to count the number of total errors inside the if with a counter and the number of errors outside the if?

Copy link
Contributor Author

@nguyenkndinh nguyenkndinh May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand it now but pls correct me if I was wrong, it is simply to add recordWorkItemErrorMetrics(TotalErrors) in the if and recordWorkItemErrorMetrics(ErrorsAfterRetriesExhausted) outside the if so that we can aggregate them later.

}

func recordWorkItemErrorMetrics(actionName string) {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the label name to error_type instead and change the param name accordingly?

}

func recordWorkItemLatencyMetrics(actionName string, timeTaken float64) {
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this. Can we change this to latency_type instead and change the param name accordingly?

} else {
klog.Infof("Finished processing %v", workItem)
timeTaken = time.Since(workItem.enqueueTime).Seconds()
recordWorkItemLatencyMetrics("process_work_item", timeTaken)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to work_item_processing_time?

return nil
}

timeTaken := time.Since(workItem.enqueueTime).Seconds()
recordWorkItemLatencyMetrics("dequeue_work_item", timeTaken)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to work_item_dequeuing_time?

utilruntime.HandleError(fmt.Errorf("expected workItem in workqueue but got %#v", obj))
err := fmt.Errorf("expected workItem in workqueue but got %#v", obj)
utilruntime.HandleError(err)
recordWorkItemErrorMetrics("dequeue_work_item")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to emit a latency metric here since this means that the workitem is invalid, right?

workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
}

func recordWorkItemErrorMetrics(actionName string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a label for instance_id in this function and take in the appropriate params? Might help to see if there are certain instances running into errors.

Copy link
Contributor

@saurav-agarwalla saurav-agarwalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

pkg/controllers/tagging/metrics.go Outdated Show resolved Hide resolved
pkg/controllers/tagging/tagging_controller.go Show resolved Hide resolved
pkg/controllers/tagging/tagging_controller.go Outdated Show resolved Hide resolved
var (
workItemDuration = metrics.NewHistogramVec(
&metrics.HistogramOpts{
Name: "cloudprovider_tagging_controller_work_item_duration_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing for this metric name, to match the other metrics already present in aws_metrics.go we should start with the prefix: cloudprovider_aws so the metric should be cloudprovider_aws_tagging_controller_work_item_duration_seconds. Then other controllers can follow the same pattern of cloudprovider_aws_controller_name_metric_name.


workItemError = metrics.NewCounterVec(
&metrics.CounterOpts{
Name: "cloudprovider_tagging_controller_work_item_errors_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name: "cloudprovider_tagging_controller_work_item_errors_total",
Name: "cloudprovider_aws_tagging_controller_work_item_errors_total",

}

const (
maxRequeuingCount = 9
totalErrors = "total_errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of these are metrics labels, I think it makes more sense to define those with the metrics they will be used with. And when you do that you can write a comment as to what it can be used for when querying.

pkg/controllers/tagging/tagging_controller.go Outdated Show resolved Hide resolved
pkg/controllers/tagging/tagging_controller.go Outdated Show resolved Hide resolved
@nckturner
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022
@nguyenkndinh
Copy link
Contributor Author

nguyenkndinh commented May 6, 2022

/unhold

@nckturner
Copy link
Contributor

/kind feature
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2022
@nckturner
Copy link
Contributor

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 6, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nckturner, nguyenkndinh, saurav-agarwalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0780829 into kubernetes:master May 6, 2022
k8s-ci-robot added a commit that referenced this pull request May 31, 2022
…f-#308-#334-#387-upstream-release-1.23

Automated cherry pick of #308: Add tagging controller configuration
#334: Stop retrying failed workitem after a certain amount of
#387: Fix issues in tagging controller
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants