Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use probe based plugin watcher mechanism in Device Manager #58755

Merged
merged 1 commit into from Jul 27, 2018

Conversation

@vikaschoudhary16
Copy link
Member

vikaschoudhary16 commented Jan 24, 2018

What this PR does / why we need it:
Uses this probe based utility in the device plugin manager.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56944

Notes For Reviewers:
Changes are backward compatible and existing device plugins will continue to work. At the same time, any new plugins that has required support for probing model (Identity service implementation), will also work.

Release note

Add support kubelet plugin watcher in device manager.

/sig node
/area hw-accelerators
/cc /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm @saad-ali @chakri-nelluri @ConnorDoyle

// Schedulable resource name. As of now it's expected to be a DNS Label
string resource_name = 3;
string resource_name = 2;

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 24, 2018

Member

The kubelet shouldn't be telling which resource_name the device plugin has.
You need a RegisterResponse

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jan 24, 2018

Author Member

This is for authentication. kubelet will probe a canonical path and will treat part of socket name as resource name. Device plugin can further authenticate that.

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch from 95b158e to 74358df Jan 24, 2018

sock := strings.Split(r.SockPath, "/")
sock = strings.Split(sock[len(sock)-1], ".")
if len(sock) != 3 {
// NOTE: socket path must be in the form of /var/lib/kubelet/device-plugins/<domain-name>.<resource-name>.sock

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 24, 2018

Member

Let's have the name be explicit in the gRPC protocol rather than try to fit that into the FS naming convention and then having to manipulate the string

}

func WatchPlugins(path string, outChan chan PluginInfo) error {
watcher, err := fsnotify.NewWatcher()

This comment has been minimized.

@jiayingz

jiayingz Jan 24, 2018

Member

Thanks a lot for implementing this change very quickly!

Haven't looked through the whole PR, but looking at this part, I am curious whether kubelet can re-discover all of plugins after it restarts. Do we need to scan through the directory initially?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jan 25, 2018

Author Member

Thanks, done. PTAL!

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch 2 times, most recently from 6f8b6f4 to cf8d6ba Jan 25, 2018

// GetResourceNameFrom returns base resource name from the plugin socket absolute path
func GetResourceNameFrom(sockPath string) (string, error) {
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) {
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath)

This comment has been minimized.

@lichuqiang

lichuqiang Jan 25, 2018

Member

I thought the probe func is meant to be built as a common mechanism for both device-plugin and other potential components of kubelet.
So, it’s specially for device-plugin indeed?

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 25, 2018

Member

As I was mentioning higher, we shouldn't impose a name format for the sockets. We're trying to fit a DNS name into an FS naming convention, that's not a good model (as shown by the further transformation we have to do on the string) and is prone to implementation errors and user errors.

I also don't think it matches the simple API format we're trying to introduce here.

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

This pkg is to be shared by other components, like CSI, as well. +1 to avoid specific naming conventions.

@saad-ali @verult could you comment on any socket naming requirements/preference from CSI or flexvolume side?

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

FYI, according to CSI proposal, the expected naming convention is /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock

For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/? The parent directory can be a parameter passed in to NewPluginWatcher. Perhaps we can just pass the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification?

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 30, 2018

Member

Can you point me to that convention ? I can't seem to find any directive in the spec.

This comment has been minimized.

@verult

verult Jan 30, 2018

Contributor

Flexvolume doesn't use sockets to communicate with kubelet; instead, kubelet executes Flexvolume drivers directly on host. I don't think this model works well for Flex, and IMO that's OK.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 31, 2018

Member

For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/?

Hmm, we mwntioned moving the checkpointing data out of this directory, I think that's a good idea. In that case where would we put it?
Wouldn't it make more sense to have:

  • var/lib/kubelet/csi/
  • var/lib/kubelet/device-plugin/
  • var/lib/kubelet/device-plugin/data.checkpoint
  • var/lib/kubelet/device-plugin/plugins/

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

I would use csi/plugins to keep the plugin hierarchy same and just in case CSI would need any checkpointing it has a place to do it.


err = watcher.Add(path)
if err != nil {
watcher.Close()

This comment has been minimized.

@lichuqiang

lichuqiang Jan 25, 2018

Member

what if err == nil?
Maybe things like defer watcher.Close() should be addressed outside to ensure its collection

return nil
}

// Dial establishes the gRPC communication with the picked up plugin socket. https://godoc.org/google.golang.org/grpc#Dial

This comment has been minimized.

@ScorpioCPH

ScorpioCPH Jan 28, 2018

Member

nit: put reference link in a newline.

if err != nil {
return err
}
watcher, err := fsnotify.NewWatcher()

This comment has been minimized.

@ScorpioCPH

ScorpioCPH Jan 28, 2018

Member

How about use package filesystem .FSWatcher here:

// FSWatcher is a callback-based filesystem watcher abstraction for fsnotify.
type FSWatcher interface {
	// Initializes the watcher with the given watch handlers.
	// Called before all other methods.
	Init(FSEventHandler, FSErrorHandler) error

	// Starts listening for events and errors.
	// When an event or error occurs, the corresponding handler is called.
	Run()

	// Add a filesystem path to watch
	AddWatch(path string) error
}

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

+1. Maybe take a look at the existing pkg/volume/flexvolume/probe.go and see whether we can follow a similar flow. Some example functionalities that the current pkg/volume/flexvolume/probe.go has and we may also consider:

  • create the plugin directory if it doesn't exist
  • recursively add watcher for sub directories
  • watchEventInterval and watchEventLimit
  • watch remove events

This comment has been minimized.

@verult

verult Jan 30, 2018

Contributor

I would recommend taking advantage of this interface, and consider extending it if it's insufficient.

watchEventInterval and watchEventLimit are there for rate limiting: it prevents a malicious actor from overwhelming kubelet by triggering too many inotify events.

In Flexvolume the requirement is that if ANYTHING inside a driver directory changes, the driver should update. For this reason, the Flexvolume prober recursively watches all subdirectories. But here AFAIK each device plugin is a single socket, so recursive watch won't be necessary.

My understanding is it's not necessary to watch remove events because gRPC will terminate when socket is deleted, triggering the endpoint deletion in managerImpl.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 30, 2018

Member

It looks like that code lacks a stop function

This comment has been minimized.

@jiayingz

jiayingz Jan 30, 2018

Member

@verult Thanks a lot for your helpful inputs! Agree we should try to reuse FSWatcher interface. For now, I think we may leave remove events and recursively watch out of this PR. For possible future extension, I would suggest that PluginWatcher passes the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification.

For CSI, is it possible that we use /var/lib/kubelet/plugins/csi/[SanitizedCSIDriverName].sock naming convention to work with a flat directory? If we have to use the proposed /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock format, we can add recursive watch in PluginWatcher in a later PR.

For device plugin, I think the reason why we can't just pass resource name in Registration response is because the advertised resource name may fail Kubelet's validation, if e.g., it doesn't follow extended resource name convention, or it collides with an existing name. I am fine with the naming format currently used in this PR. Something we may consider is whether we want to allow socket name with some timestamp suffix to allow a new instance to come up without tearing down the old plugin instance to ensure zero downtime. But I think we can leave that part out for now, given that we expect most plugins to be deployed as DaemonSet containers whose upgrades require container restarts.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 30, 2018

Member

So I thought about it more and I think a good solution could be to have to following model:

  • The socket has the same format as the resource name (e.g: /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu.sock for the socket)
  • The gRPC server advertises the same name as it's socket (e.g: "nvidia.com/gpu")
  • Kubelet validates after calling Register that the socket name matches the advertised name (and applies resourceName validation)

or it collides with an existing name

That would be the same behavior as re-registration.

This comment has been minimized.

@jiayingz

jiayingz Jan 30, 2018

Member

I am also fine with this model. It is just that ProbeWatcher has to support recursive watching then, and I don't know whether that would create some performance concerns as inotify man page mentions that "Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees."

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 30, 2018

Member

Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees.

Pretty sure that's meant for large amount of directories (>10000) rather than our 3 device plugins (or the not so many CSI sockets).

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

CSI drivers should in almost all the cases be < 10.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Feb 5, 2018

Member

Something we may consider is whether we want to allow socket name with some timestamp suffix to allow a new instance to come up without tearing down the old plugin instance to ensure zero downtime

We could just require the directory to match the vendor name instead of exact match of the resource name. e.g:

  • /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu-v1.8.sock
  • /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu-v1.9.sock

Could both advertise resources under nvidia.com/*
Naming convention after nvidia.com/ would then be left to the vendor.

}

// GetResourceNameFrom returns base resource name from the plugin socket absolute path
func GetResourceNameFrom(sockPath string) (string, error) {

This comment has been minimized.

@ScorpioCPH

ScorpioCPH Jan 28, 2018

Member

Is this GetResourceNameFrom() only used for device plugin? Or other plugin too?


go func() {
for _, f := range files {
outChan <- PluginInfo{path + f.Name()}

This comment has been minimized.

@ScorpioCPH

ScorpioCPH Jan 28, 2018

Member

Will this return checkpoint files? as we use the same dir:

utilstore.NewFileStore(socketDir, utilfs.DefaultFs{})
// Tests that the device plugin manager correctly handles registration and re-registration by
// making sure that after registration, devices are correctly updated and if a re-registration
// happens, we will NOT delete devices; and no orphaned devices left.
func TestDevicePluginReRegistration(t *testing.T) {

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 29, 2018

Member

ReRegistration is still possible with this behavior and needs to be tested

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Jan 29, 2018

@verult with your experience on pkg/volume/flexvolume/probe.go, it would be great if you can help take a look at pkg/kubelet/util/pluginwatcher/ part of this PR. Thanks a lot!

}

_, err := e.client.Register(context.Background(), reqt)
if err != nil {

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

Just return err?

service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

This is an incompatible API change. I think we will need a new API directory. Is it too early to create a deviceplugin/v1beta1 directory? I know we haven't flipped the feature gate flag yet, but it seems generally api version is a pre-requirement for feature gate version. Perhaps we should discuss this on this week's RW meeting?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Feb 1, 2018

Author Member

Thanks, this is an important question. Unfortunately missed the opportunity to discuss it at last meeting. Lets target for next meeting and meanwhile will try to discuss and get opinion on slack and list.

}
return nil
}

const (
// kubeletDevicePluginCheckpoint is the file name of device plugin checkpoint
kubeletDevicePluginCheckpoint = "kubelet_internal_checkpoint"

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

Should we consider to move checkpoint file to a separate location, like /var/run/kubelet/device-plugins/?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Feb 1, 2018

Author Member

may be in next PR?

// GetResourceNameFrom returns base resource name from the plugin socket absolute path
func GetResourceNameFrom(sockPath string) (string, error) {
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) {
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath)

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

FYI, according to CSI proposal, the expected naming convention is /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock

For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/? The parent directory can be a parameter passed in to NewPluginWatcher. Perhaps we can just pass the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification?

name = "all-srcs",
srcs = [":package-srcs"],
tags = ["automanaged"],
)

This comment has been minimized.

@jiayingz

jiayingz Jan 29, 2018

Member

Can we add a plugin_watcher_test.go with some unit test coverage?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Feb 1, 2018

Author Member

sure

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jan 31, 2018

Thanks guys for the inputs. Will soon update the PR.

if err != nil {
return err
}
watcher, err := fsnotify.NewWatcher()

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

CSI drivers should in almost all the cases be < 10.

// GetResourceNameFrom returns base resource name from the plugin socket absolute path
func GetResourceNameFrom(sockPath string) (string, error) {
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) {
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath)

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

I would use csi/plugins to keep the plugin hierarchy same and just in case CSI would need any checkpointing it has a place to do it.


// GetResourceNameFrom returns base resource name from the plugin socket absolute path
func GetResourceNameFrom(sockPath string) (string, error) {
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) {

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

Do we need the check for "Contains (sockPath, "/")"? May be add a stricter prefix check?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Feb 1, 2018

Author Member

This serves for extended resources naming format. Do you have something in mind related to a stricter prefix match?

"fmt"
"io/ioutil"
"os"
"reflect"
"sync/atomic"
// "sync/atomic"

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

*Nit - remove

return &pluginapi.Empty{}, nil
existingDevs := make(map[string]pluginapi.Device)
e := newEndpointImpl(r.SockPath, name, conn, existingDevs, m.callback)
err = e.register(name)

This comment has been minimized.

@chakri-nelluri

chakri-nelluri Jan 31, 2018

Contributor

Do we expect the plugins to support register call? CSI does not support any register call today.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jan 31, 2018

Member

We probably should re-use the service Identity of the CSI spec.
The Register call doesn't need to send the name to the plugin.

@chakri-nelluri

This comment has been minimized.

Copy link
Contributor

chakri-nelluri commented Jan 31, 2018

@vikaschoudhary16 Added comments.
CSI today doesn't support extra register call today. Other than that, it looks ok to me. Let me circle with the other folks and see what we can do to support the register call.

@chakri-nelluri

This comment has been minimized.

Copy link
Contributor

chakri-nelluri commented Jan 31, 2018

@vikaschoudhary16 @jingxu97 I am looking for more information on how different components of Kubelet like VolumeManager(CSI) & Network(CNI) can consume this? Do we have a write-up on how different components can consume this?

@chakri-nelluri

This comment has been minimized.

Copy link
Contributor

chakri-nelluri commented Feb 5, 2018

@vikaschoudhary16 @jingxu97 Any update on the consumption model from different Kubelet components?

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch from cf8d6ba to 5599bd7 Feb 12, 2018

@k8s-ci-robot k8s-ci-robot added size/XXL and removed size/XL labels Feb 12, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch from dd23c20 to 51af7dd Jul 16, 2018

@jiayingz
Copy link
Member

jiayingz left a comment

Only a couple nit comments. Overall lgtm.

func (m *Stub) Register(kubeletEndpoint, resourceName string, pluginSockDir string) error {
if pluginSockDir != "" {
if _, err := os.Stat(pluginSockDir + "DEPRECATION"); err == nil {
log.Println("Deprecation file found. Registration from plugin side.")

This comment has been minimized.

@jiayingz

jiayingz Jul 16, 2018

Member

Should it be the other way around, i.e., if Deprecation file found, skip Register gRPC call with Kubelet; if not, proceed?

}

return func(name string, endpoint string, versions []string, sockPath string) (chan bool, error) {
//chanForAckOfNotification := make(chan bool)

This comment has been minimized.

@jiayingz

jiayingz Jul 16, 2018

Member

remove this commented line?

initialConfig.FeatureGates[string(features.KubeletPluginsWatcher)] = true
})
devicePluginSockPaths := []string{pluginapi.DevicePluginPath, "/var/lib/kubelet/plugins/"}
//devicePluginSockPaths := []string{pluginapi.DevicePluginPath}

This comment has been minimized.

@jiayingz

jiayingz Jul 16, 2018

Member

remove this?

)

// Serial because the test restarts Kubelet
var _ = framework.KubeDescribe("Device Plugin [Feature:DevicePlugin][NodeFeature:DevicePlugin][Serial]", func() {
f := framework.NewDefaultFramework("device-plugin-errors")

Context("DevicePlugin", func() {
By("Enabling support for Kubelet Plugins Watcher")

This comment has been minimized.

@jiayingz

jiayingz Jul 16, 2018

Member

Do you think if it is possible to wrap lines 52-205 in a testDevicePlugin(f *framework.Framework, enableProbeWatcher bool) function? Then we can get test coverage on both settings with the following test config:

var _ = framework.KubeDescribe("NVIDIA GPU Device Plugin [Feature:GPUDevicePlugin][NodeFeature:GPUDevicePlugin][Serial] [Disruptive]", func() {
f := framework.NewDefaultFramework("device-plugin-gpus-errors")
testDevicePlugin(f, false)
}

var _ = framework.KubeDescribe("NVIDIA GPU Device Plugin [Feature:GPUDevicePluginProbe][NodeFeature:GPUDevicePluginProbe][Serial] [Disruptive]", func() {
f := framework.NewDefaultFramework("device-plugin-gpus-errors-probe-mode")
testDevicePlugin(f, true)
}

@vikaschoudhary16 vikaschoudhary16 changed the title Use probe based plugin watcher mechanism in Device Manager [WIP]Use probe based plugin watcher mechanism in Device Manager Jul 17, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch from 51af7dd to 854f30b Jul 17, 2018

@k8s-ci-robot k8s-ci-robot added size/L and removed size/XL labels Jul 17, 2018

@vikaschoudhary16 vikaschoudhary16 changed the title [WIP]Use probe based plugin watcher mechanism in Device Manager Use probe based plugin watcher mechanism in Device Manager Jul 17, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:probing-mode branch from 854f30b to a584250 Jul 17, 2018

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Jul 17, 2018

/lgtm
/approve
Thanks a lot for the work!

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 17, 2018

/assign @derekwaynecarr for higher level approval

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 17, 2018

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-local-e2e-containerized

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 18, 2018

/test pull-kubernetes-local-e2e-containerized

2 similar comments
@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 25, 2018

/test pull-kubernetes-local-e2e-containerized

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 27, 2018

/test pull-kubernetes-local-e2e-containerized

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jul 27, 2018

@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Jul 27, 2018

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 27, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, Random-Liu, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 27, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jul 27, 2018

Automatic merge from submit-queue (batch tested with PRs 58755, 66414). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 32e38b6 into kubernetes:master Jul 27, 2018

13 of 18 checks passed

pull-kubernetes-local-e2e-containerized Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation vikaschoudhary16 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 27, 2018

@vikaschoudhary16: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized a584250 link /test pull-kubernetes-local-e2e-containerized
pull-kubernetes-e2e-gce a584250 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot pushed a commit that referenced this pull request Jul 30, 2018

Kubernetes Submit Queue
Merge pull request #65948 from figo/stateless
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Simplify device manager: make endpoint stateless

While reviewing devicemanager code, found the caching layer on endpoint is redundant.
Here are the 3 related objects in picture:

**devicemanager  <->  endpoint <-> plugin**

plugin is the source of truth for devices and device health status.
devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint to cache devices, this patch is removing the cache layer,
endpoint becomes stateless, which i believe should be the case (but i do welcome review
if i missed something here).

also removing the Manager.Devices() since i didn't find any caller of this other than test.

if we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, so don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.



**What this PR does / why we need it**:
this patch simplify the device manager code, make it more maintainable.

**Which issue(s) this PR fixes** *:
this is a refactor of device manager code

**Special notes for your reviewer**:
will need to rebase the code if #58755 get checked-in first.

**Release note**:

```release-note
None
```

/sig node
/cc @jiayingz @RenaudWasTaken @vishh @saad-ali @vikaschoudhary16 @vladimirvivien @anfernee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.