Skip to content

[Issue #325] Added retry count to abort workitem after a few failed retries#334

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
nguyenkndinh:master
May 6, 2022
Merged

[Issue #325] Added retry count to abort workitem after a few failed retries#334
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
nguyenkndinh:master

Conversation

@nguyenkndinh
Copy link
Copy Markdown
Contributor

@nguyenkndinh nguyenkndinh commented Apr 5, 2022

What type of PR is this?
/kind feature

What this PR does / why we need it:

  • Added a logic to abort work item after a certain number of retries in case EC2 responds with non-retryable errors.
  • Added metrics collection to the controller.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2022
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @nguyenkndinh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 5, 2022
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 5, 2022
}

const (
maxRequeuingCount = 10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what this translates into in terms of retry duration?

Copy link
Copy Markdown
Contributor Author

@nguyenkndinh nguyenkndinh Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the DefaultControllerRateLimiter, which uses the worst case of the 2 limiters:

If not counting BucketRateLimiter cause that's a different beast, the work items would have to wait anywhere from 50 milliseconds to 10000 seconds in retry time (excluding work time and queue wait time as they can differ). However, the real avg number is a bit more complicated depending on the number of items in the queue, if the item has been enqueued many times, which rate limiter was used. So, When to add an item back in the queue depends on either BucketRateLimiter or ItemExponentialFailureRateLimiter is selected as worst case. However, I wouldn't expect the worst case of 10000 to ever happens, though we need metrics data to fine tune this as it goes I assume.

Comment thread pkg/controllers/tagging/tagging_controller.go
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 7, 2022
@nckturner
Copy link
Copy Markdown
Contributor

Can you squash your commits

@nckturner
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2022
@nguyenkndinh
Copy link
Copy Markdown
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2022
@nguyenkndinh nguyenkndinh force-pushed the master branch 2 times, most recently from 6d70a64 to 030a5b6 Compare April 9, 2022 00:27
Comment thread pkg/controllers/tagging/metrics.go Outdated
Comment on lines +50 to +56
func recordWorkitemMetrics(actionName string, timeTaken float64, err error) {
if err != nil {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
}

workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break this down into separate functions for count and latency metrics? Seems like at least for error counts, we're going to have zero latency all the time which doesn't seem desirable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread pkg/controllers/tagging/metrics.go Outdated

func recordWorkitemMetrics(actionName string, timeTaken float64, err error) {
if err != nil {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a question around what the dimensions for this metric is going to look like - are we able to query this metric using actionName as well? Because we seem to be passing strings with node names wherever we're calling this function so need to understand if that makes sense or if we need actionName to be somewhat static.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great call out, as I'm based this change off of aws_metrics.go, I assume the dimensions for this are similar to this. I'll check in CW and see how they are being produced. However, I think you're right as the usage in aws.go is also somewhat static.

Comment thread pkg/controllers/tagging/tagging_controller_test.go
Comment thread pkg/controllers/tagging/metrics.go Outdated
var (
workItemLatency = metrics.NewHistogramVec(
&metrics.HistogramOpts{
Name: "tagging_workitem_latency",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have been looking at https://pkg.go.dev/github.com/prometheus/client_golang/prometheus for more info around how Prometheus handles these metrics and it looks like this is the metric name and the labels applied in the below functions are additional dimensions.

Keeping that in mind, can we rename these to cloudprovider_tagging_controller_work_item_latency and cloudprovider_tagging_controller_work_item_error instead to be more specific and to follow a similar pattern as the other metrics in this package?

}

klog.Errorf("error processing work item '%v': %s, requeuing count exceeded", workItem, err.Error())
recordWorkItemErrorMetrics("process_work_item")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to track both total errors and errors after all retries are exhausted. Can you please emit two labels for this: total_errors and errors_after_retries_exhausted? You can emit total_errors inside the if above and errors_after_retries_exhausted outside it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to count the number of total errors inside the if with a counter and the number of errors outside the if?

Copy link
Copy Markdown
Contributor Author

@nguyenkndinh nguyenkndinh May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand it now but pls correct me if I was wrong, it is simply to add recordWorkItemErrorMetrics(TotalErrors) in the if and recordWorkItemErrorMetrics(ErrorsAfterRetriesExhausted) outside the if so that we can aggregate them later.

Comment thread pkg/controllers/tagging/metrics.go Outdated
}

func recordWorkItemErrorMetrics(actionName string) {
workItemError.With(metrics.Labels{"workqueue": actionName}).Inc()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the label name to error_type instead and change the param name accordingly?

Comment thread pkg/controllers/tagging/metrics.go Outdated
}

func recordWorkItemLatencyMetrics(actionName string, timeTaken float64) {
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this. Can we change this to latency_type instead and change the param name accordingly?

} else {
klog.Infof("Finished processing %v", workItem)
timeTaken = time.Since(workItem.enqueueTime).Seconds()
recordWorkItemLatencyMetrics("process_work_item", timeTaken)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to work_item_processing_time?

}

timeTaken := time.Since(workItem.enqueueTime).Seconds()
recordWorkItemLatencyMetrics("dequeue_work_item", timeTaken)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to work_item_dequeuing_time?

utilruntime.HandleError(fmt.Errorf("expected workItem in workqueue but got %#v", obj))
err := fmt.Errorf("expected workItem in workqueue but got %#v", obj)
utilruntime.HandleError(err)
recordWorkItemErrorMetrics("dequeue_work_item")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to emit a latency metric here since this means that the workitem is invalid, right?

Comment thread pkg/controllers/tagging/metrics.go Outdated
workItemLatency.With(metrics.Labels{"workqueue": actionName}).Observe(timeTaken)
}

func recordWorkItemErrorMetrics(actionName string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a label for instance_id in this function and take in the appropriate params? Might help to see if there are certain instances running into errors.

Copy link
Copy Markdown
Contributor

@saurav-agarwalla saurav-agarwalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Comment thread pkg/controllers/tagging/metrics.go
Comment thread pkg/controllers/tagging/metrics.go Outdated
Comment thread pkg/controllers/tagging/tagging_controller.go
Comment thread pkg/controllers/tagging/tagging_controller.go Outdated
Comment thread pkg/controllers/tagging/metrics.go Outdated
var (
workItemDuration = metrics.NewHistogramVec(
&metrics.HistogramOpts{
Name: "cloudprovider_tagging_controller_work_item_duration_seconds",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing for this metric name, to match the other metrics already present in aws_metrics.go we should start with the prefix: cloudprovider_aws so the metric should be cloudprovider_aws_tagging_controller_work_item_duration_seconds. Then other controllers can follow the same pattern of cloudprovider_aws_controller_name_metric_name.

Comment thread pkg/controllers/tagging/metrics.go Outdated

workItemError = metrics.NewCounterVec(
&metrics.CounterOpts{
Name: "cloudprovider_tagging_controller_work_item_errors_total",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name: "cloudprovider_tagging_controller_work_item_errors_total",
Name: "cloudprovider_aws_tagging_controller_work_item_errors_total",


const (
maxRequeuingCount = 9
totalErrors = "total_errors"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of these are metrics labels, I think it makes more sense to define those with the metrics they will be used with. And when you do that you can write a comment as to what it can be used for when querying.

Comment thread pkg/controllers/tagging/tagging_controller.go Outdated
Comment thread pkg/controllers/tagging/tagging_controller.go Outdated
@nckturner
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022
@nguyenkndinh
Copy link
Copy Markdown
Contributor Author

nguyenkndinh commented May 6, 2022

/unhold

@nckturner
Copy link
Copy Markdown
Contributor

/kind feature
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2022
@nckturner
Copy link
Copy Markdown
Contributor

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 6, 2022
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nckturner, nguyenkndinh, saurav-agarwalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0780829 into kubernetes:master May 6, 2022
k8s-ci-robot added a commit that referenced this pull request May 31, 2022
…f-#308-#334-#387-upstream-release-1.23

Automated cherry pick of #308: Add tagging controller configuration
#334: Stop retrying failed workitem after a certain amount of
#387: Fix issues in tagging controller
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants