Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubelet plugin manager #73891

Merged
merged 1 commit into from May 31, 2019

Conversation

@taragu
Copy link
Member

commented Feb 10, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
Implement a controller that manages plugin registration/unregistration

Which issue(s) this PR fixes:
Fixes #73371

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Kubelet plugin registration now has retry and exponential backoff logic for when registration of plugins (like CSI or device plugin) fail.
@taragu

This comment has been minimized.

Copy link
Member Author

commented Feb 10, 2019

* "reconciler" should use the [`nestedpendingoperations` library]
to execute registration and unregistration operations. 
This library automatically takes care of exponential backoff and ensures no more
then one operation (registration or unregistration) can happen at a given time for 
a given socket.

@saad-ali I was looking at nestedpendingoperations but wasn't sure how to use it to register/unregister plugin. Could you please point me to a similar example?

@taragu

This comment has been minimized.

Copy link
Member Author

commented Feb 10, 2019

@taragu taragu force-pushed the taragu:plugin-manager branch from c472ca1 to ec659e5 Feb 18, 2019

@k8s-ci-robot k8s-ci-robot added the size/L label Feb 18, 2019

@saad-ali

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

@taragu there is a generic version of the nestedpendingoperation that is simpler and not volume specific called goroutinemap. See https://github.com/kubernetes/kubernetes/blob/master/pkg/util/goroutinemap/goroutinemap.go (edited)

The operation_executor is just a wrapper around nestedpendingoperation (or goroutinemap). I would define a new operation_executor for the plugin manager that defines two operations: RegisterPlugin() and UnregisterPlugin(). This operation_executor would use goroutinemap to ensure that no more then one operation starts for the same socket. So the key that you use in the goroutinemap (I think) would be the socket path.

Once you have an operation_executor defined, you can create two caches: desired and actual that keep track of the sockets that we want to register and the sockets that we actually have successfully registered.

Then you can create the dumb diff loop, the reconciler, it will (maybe every 100 ms) diff desired and actual. If a socket is in desired but not in actual it will call operation_executor.Register(). If a socket is in actual but not desired it will call operation_executor.Unregister(). And because operation_executor uses goroutinemap it will refuse to start a new register or unregister operation while one is pending (and automatically handle exponential backoff).

Once you have that, you need to write a populator -- in this case that will basically just be the existing code that currently calls the old Register method -- and instead it would add to the desired state cache on register, and remove from desired state cache on unregister (populator in this case doesn't need to be a new package).

@taragu

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

@saad-ali thank you so much!! This is very comprehensive. I'll try to digest this later this evening.

@taragu taragu force-pushed the taragu:plugin-manager branch from ec659e5 to 5f07de8 Feb 25, 2019

@k8s-ci-robot k8s-ci-robot added size/XXL and removed size/L labels Feb 25, 2019

@taragu taragu force-pushed the taragu:plugin-manager branch 7 times, most recently from 17aa31b to a6f39d6 Feb 25, 2019

@taragu taragu force-pushed the taragu:plugin-manager branch 3 times, most recently from 1853dcf to dc863d8 Mar 7, 2019

@saad-ali
Copy link
Member

left a comment

@taragu, one more thing I forgot, we should undo the temporary fix from #72873 as well. We could do that in a follow up PR if you want.

@taragu

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

@saad-ali sounds good. I will undo #72873 in a followup PR

@msau42

This comment has been minimized.

Copy link
Member

commented May 13, 2019

/hold
Spoke with @vishh. There are still some discussions happening with @saad-ali around if we can just have the drivers restart

@taragu taragu force-pushed the taragu:plugin-manager branch from 63aaa8f to e3e10f1 May 30, 2019

@vishh

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/approve
(after offline discussion with @saad-ali who patiently explained that k'let can silently fail re-registration when the plugins assume normal operation)

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, taragu, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@taragu taragu force-pushed the taragu:plugin-manager branch from e3e10f1 to 9cab985 May 30, 2019

@saad-ali

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 30, 2019

@saad-ali

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 30, 2019

@taragu taragu force-pushed the taragu:plugin-manager branch from 9cab985 to 5e18554 May 30, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 30, 2019

@saad-ali

This comment has been minimized.

Copy link
Member

commented May 30, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 30, 2019

@taragu

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

/hold cancel

@taragu

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

/test pull-kubernetes-integration

1 similar comment
@taragu

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot merged commit fe37733 into kubernetes:master May 31, 2019

21 checks passed

cla/linuxfoundation taragu authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
asw.Lock()
defer asw.Unlock()

if _, ok := asw.socketFileToInfo[socketPath]; ok {

This comment has been minimized.

Copy link
@tedyu

tedyu May 31, 2019

Contributor

Not sure why the check is needed - we can delete the entry directly.
It is fine if the key doesn't exist.

dsw.Lock()
defer dsw.Unlock()

if _, ok := dsw.socketFileToInfo[socketPath]; ok {

This comment has been minimized.

Copy link
@tedyu

tedyu May 31, 2019

Contributor

No need to check prior to deletion

return nil
}

// Dial establishes the gRPC communication with the picked up plugin socket. https://godoc.org/google.golang.org/grpc#Dial

This comment has been minimized.

Copy link
@tedyu

tedyu May 31, 2019

Contributor

Should be lowercase: dial

}

func (rc *reconciler) getHandlers() map[string]cache.PluginHandler {
rc.Lock()

This comment has been minimized.

Copy link
@tedyu

tedyu May 31, 2019

Contributor

Getting read lock should suffice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.