Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries for kubelet plugin registration should be at a lower layer #73371

Closed
taragu opened this issue Jan 27, 2019 · 11 comments · Fixed by #73891
Closed

Retries for kubelet plugin registration should be at a lower layer #73371

taragu opened this issue Jan 27, 2019 · 11 comments · Fixed by #73891
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@taragu
Copy link
Contributor

taragu commented Jan 27, 2019

What would you like to be added:
The kubelet registration mechanism should be more robust (e.g. have a full controller that maintains desired/actual and reconciles that state periodically). The CSI layer is not right layer to implement retries. Retries should not be implemented by every consumer.

Why is this needed: For a more robust kubelet registration mechanism

Related issue: #71487

/cc @saad-ali
/sig storage
/assign

@taragu taragu added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 27, 2019
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 27, 2019
@saad-ali
Copy link
Member

Thanks for creating this.

Background

The Kubelet Plugin Registration Mechanism watches a directory on the kubelet host for new socket files. When a new socket file is discovered, the plugin mechanism will attempt to connect to the socket and probe it using the protocol defined here. If this registration step fails, the kubelet never attempts to register the socket again, unless the socket for the driver is deleted & recreated to trigger a new registration.

Problem

If the registration error was transient, the registration is not retried and the plugin is simply not registered.

Requirements

  • Ensure terminal registration error during kubelet device plugin registration are not retried.
  • Ensure non-terminal registration error during kubelet device plugin registration are retried.
  • Ensure registration retries do not occur if the socket no longer exists.
  • Ensure unregistration is only called if registration for that socket succeeded.
  • Ensure terminal unregistration error during kubelet device plugin unregistration are not retried.
  • Ensure non-terminal unregistration error during kubelet device plugin unregistration are retried.
  • Ensure all retries implement exponential backoff.
  • Ensure long running registration and unregistration operations do not block other operations.

Suggested Solution

Implement a controller that manages registration/unregistration, similar to the volumemanager.

  • Implement an "actual" and "desired" state in-memory cache.
  • Wire the existing "plugin watcher" to act as a "populator" which adds sockets to the "desired" state whenever a new socket is created and removes sockets from the "desired" state whenever an existing socket is deleted.
  • Implement a "reconciler" that acts as a dumb loop that periodically diffs "desired" and "actual" state.
  • Any socket that exists in "desired" but not "actual" state should trigger a "registration" operation.
  • Any socket that exists in "actual" but not "desired" state should trigger a "unregistration" operation.
  • "reconciler" should use the nestedpendingoperations library to execute registration and unregistration operations. This library automatically takes care of exponential backoff and ensures no more then one operation (registration or unregistration) can happen at a given time for a given socket.

CC @msau42 as FYI.

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 28, 2019
@msau42
Copy link
Member

msau42 commented Jan 28, 2019

cc @RenaudWasTaken

@saad-ali
Copy link
Member

cc @vikaschoudhary16

@vikaschoudhary16
Copy link
Contributor

@saad-ali I can take up this if none has started already.

@taragu
Copy link
Contributor Author

taragu commented Jan 30, 2019

@vikaschoudhary16 I was planning to work on this issue this weekend. @saad-ali what's the priority for this issue? If this is urgent, @vikaschoudhary16 you should probably work on this instead

@vikaschoudhary16
Copy link
Contributor

@taragu go ahead please. I can help with reviews :)

@saad-ali
Copy link
Member

Not urgent. Getting it in for v1.14 would be really great, and make the whole system more robust. If we don't, with the work around @taragu added for CSI #72873 we should be stable enough on the CSI side that we can push this to 1.15 if needed.

@taragu
Copy link
Contributor Author

taragu commented Jan 30, 2019

@vikaschoudhary16 sounds good. Thanks!! 😄

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2019
@taragu
Copy link
Contributor Author

taragu commented May 1, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2019
@saad-ali
Copy link
Member

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants