Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry kube lock acquire until --schedule-window expires #22

Merged
merged 6 commits into from
May 31, 2018

Conversation

SpComb
Copy link
Contributor

@SpComb SpComb commented May 31, 2018

Run each scheduled task with a context.Context, using the --schedule-window=1h option to set a deadline for the task execution.

Fixes #1: kube/Lock.Acquire takes a context.Context and times out if the lock is not freed before the context expires.

This required re-implementing the k8s.io/client-go/util/retry/RetryOnConflict due to a bug with watch timeout errors, which caused the upgrades to run without the lock held in case the Acquire => retry => wait timed out: kubernetes/client-go#427

Fixes #20: the top-level Kube.AcquireLock() retries the kube/Lock.Acquire until it either succeeds, or the context deadline expires. This also handles the master being down, with crude exponential backoff.

@SpComb SpComb added the enhancement New feature or request label May 31, 2018
@SpComb
Copy link
Contributor Author

SpComb commented May 31, 2018

Testing that the new kube/Lock.modify update conflict retry works:

2018/05/31 11:40:00 Acquiring kube lock...
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=centos-7
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: update
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: retry modify conflict: Operation cannot be fulfilled on daemonsets.apps "host-upgrades": the object has been modified; please apply your changes to the latest version and try again
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=ubuntu-xenial: locked
2018/05/31 11:40:00 kube/lock kube-system/daemonsets/host-upgrades: watch v1.ListOptions{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, LabelSelector:"", FieldSelector:"metadata.name=host-upgrades", IncludeUninitialized:false, Watch:false, ResourceVersion:"55656", TimeoutSeconds:(*int64)(nil), Limit:0, Continue:""}
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: wait ok
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=centos-7
2018/05/31 11:40:05 kube/lock kube-system/daemonsets/host-upgrades: update

@SpComb
Copy link
Contributor Author

SpComb commented May 31, 2018

Testing that the kube/Lock.Acquire timeout works:

2018/05/31 11:39:00 Schedule run started, deadline at 2018-05-31 11:39:10.000963736 +0000 UTC m=+89.629437537
2018/05/31 11:39:00 Acquiring kube lock...
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=: free
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: acquire
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: set pharos-host-upgrades.kontena.io/lock=ubuntu-xenial
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: update
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: retry modify conflict: Operation cannot be fulfilled on daemonsets.apps "host-upgrades": the object has been modified; please apply your changes to the latest version and try again
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: wait
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: test pharos-host-upgrades.kontena.io/lock=centos-7: locked
2018/05/31 11:39:00 kube/lock kube-system/daemonsets/host-upgrades: watch v1.ListOptions{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, LabelSelector:"", FieldSelector:"metadata.name=host-upgrades", IncludeUninitialized:false, Watch:false, ResourceVersion:"55538", TimeoutSeconds:(*int64)(nil), Limit:0, Continue:""}
2018/05/31 11:39:10 kube/lock kube-system/daemonsets/host-upgrades: wait err: timed out waiting for the condition
2018/05/31 11:39:10 Acquiring kube lock failed, retrying: timed out waiting for the condition
2018/05/31 11:39:11 Failed to acquire kube lock: context deadline exceeded

@SpComb
Copy link
Contributor Author

SpComb commented May 31, 2018

Testing that the top-level Kube.AcquireLock retry works in the case of the kube API being down:

2018/05/31 11:41:00 Schedule run started, deadline at 2018-05-31 11:41:10.000908342 +0000 UTC m=+86.407412249
2018/05/31 11:41:00 Acquiring kube lock...
2018/05/31 11:41:00 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:00 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:01 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:01 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:03 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:03 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:07 kube/lock kube-system/daemonsets/host-upgrades: get
2018/05/31 11:41:07 Acquiring kube lock failed, retrying: Get: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/host-upgrades: dial tcp 10.96.0.1:443: connect: connection refused
2018/05/31 11:41:15 Failed to acquire kube lock: context deadline exceeded

@SpComb SpComb merged commit 3ade33b into master May 31, 2018
@SpComb SpComb deleted the feature/lock-acquire-retry branch May 31, 2018 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant