Skip to content
This repository has been archived by the owner on Oct 21, 2020. It is now read-only.

Use rate limited work queues and up resync from 15s to 15m #825

Merged
merged 1 commit into from
Jun 26, 2018

Conversation

wongma7
Copy link
Contributor

@wongma7 wongma7 commented Jun 20, 2018

  • Up resync period to 15 minutes
  • Ditch goroutinemap and instead use the exponential backoff that comes with rate limited work queues
  • Ditch failedProvisionStats and failedDeleteStats, and instead use NumRequeues that comes with rate limited work queues

@wongma7 wongma7 self-assigned this Jun 20, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 20, 2018
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 21, 2018
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 21, 2018

@cofyc if you would like to review. @childsb if you would like to also/find somebody else willing to :)

The PR looks big but the code for claim queue and volume queue is pretty much identical. I considered refactoring it so there is less duplicate code but I will let y'all decide if it is okay as-is.

@wongma7 wongma7 changed the title [WIP] Use rate limited work queues Use rate limited work queues Jun 21, 2018
@wongma7 wongma7 changed the title Use rate limited work queues Use rate limited work queues and up resync from 15s to 15m Jun 21, 2018
@@ -473,8 +511,8 @@ func NewProvisionController(

volumeHandler := cache.ResourceEventHandlerFuncs{
AddFunc: nil,
Copy link
Contributor

@cofyc cofyc Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why omitting AddFunc here? If omitted, volumes will not be queued until first resync happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only care about volume Updates and not Adds because we are checking if a volume has entered 'Released' state and needs to be deleted

Copy link
Contributor

@cofyc cofyc Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but volumes entered 'Released' state during provisioner is not running (e.g. restarting or migrating to another node) will be delayed 15 minutes (by default) to delete. I'm not sure if this is acceptable. Old controller implementation did not handle Add event, but its resync period is only 15 seconds.

shouldDelete only checks volume itself, queuing all volumes should not affect performance much IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, you're correct, I've added AddFunc.


if err := ctrl.syncClaimHandler(key); err != nil {
if ctrl.claimQueue.NumRequeues(obj) < ctrl.failedProvisionThreshold {
glog.Errorf("retrying syncing claim '%s' because failures %v < threshold %v", key, ctrl.claimQueue.NumRequeues(obj), ctrl.failedProvisionThreshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using warning here to distinguish with giving up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glog does not give us a Warnf :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autocomplete failed me... done!

also sidenote, logging verbosity is kinda all over the place in general, I will fix it later.

// Done but do not Forget: it will not be in the queue but NumRequeues
// will be saved until the obj is deleted from kubernetes
}
return fmt.Errorf("error syncing claim '%s': %s", key, err.Error())
Copy link
Contributor

@cofyc cofyc Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use %q instead of '%s'? using %q to print quoted string seems more idiomatic in golang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, replaced everywhere.

failedProvisionStats: make(map[types.UID]int),
failedDeleteStats: make(map[types.UID]int),
failedProvisionStatsMutex: &sync.Mutex{},
failedDeleteStatsMutex: &sync.Mutex{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed from ProvisionController struct too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if key, ok = obj.(string); !ok {
ctrl.claimQueue.Forget(obj)
utilruntime.HandleError(fmt.Errorf("expected string in workqueue but got %#v", obj))
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we can simply return this error here, because outside function will call utilruntime.HandleError if err != nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cofyc
Copy link
Contributor

cofyc commented Jun 26, 2018

LGTM

@wongma7 wongma7 added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2018
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2018
@wongma7
Copy link
Contributor Author

wongma7 commented Jun 26, 2018

thank you very much for your review, I'll merge this shortly and continue working on controller improvements.

@wongma7 wongma7 added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2018
@wongma7 wongma7 merged commit 7a455d3 into kubernetes-retired:master Jun 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/lib cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants