-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use probe based plugin watcher mechanism in Device Manager #58755
Use probe based plugin watcher mechanism in Device Manager #58755
Conversation
// Schedulable resource name. As of now it's expected to be a DNS Label | ||
string resource_name = 3; | ||
string resource_name = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet shouldn't be telling which resource_name the device plugin has.
You need a RegisterResponse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for authentication. kubelet will probe a canonical path and will treat part of socket name as resource name. Device plugin can further authenticate that.
95b158e
to
74358df
Compare
sock := strings.Split(r.SockPath, "/") | ||
sock = strings.Split(sock[len(sock)-1], ".") | ||
if len(sock) != 3 { | ||
// NOTE: socket path must be in the form of /var/lib/kubelet/device-plugins/<domain-name>.<resource-name>.sock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have the name be explicit in the gRPC protocol rather than try to fit that into the FS naming convention and then having to manipulate the string
} | ||
|
||
func WatchPlugins(path string, outChan chan PluginInfo) error { | ||
watcher, err := fsnotify.NewWatcher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for implementing this change very quickly!
Haven't looked through the whole PR, but looking at this part, I am curious whether kubelet can re-discover all of plugins after it restarts. Do we need to scan through the directory initially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. PTAL!
6f8b6f4
to
cf8d6ba
Compare
// GetResourceNameFrom returns base resource name from the plugin socket absolute path | ||
func GetResourceNameFrom(sockPath string) (string, error) { | ||
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) { | ||
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the probe func is meant to be built as a common mechanism for both device-plugin and other potential components of kubelet.
So, it’s specially for device-plugin indeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was mentioning higher, we shouldn't impose a name format for the sockets. We're trying to fit a DNS name into an FS naming convention, that's not a good model (as shown by the further transformation we have to do on the string) and is prone to implementation errors and user errors.
I also don't think it matches the simple API format we're trying to introduce here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, according to CSI proposal, the expected naming convention is /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock
For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/? The parent directory can be a parameter passed in to NewPluginWatcher. Perhaps we can just pass the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to that convention ? I can't seem to find any directive in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flexvolume doesn't use sockets to communicate with kubelet; instead, kubelet executes Flexvolume drivers directly on host. I don't think this model works well for Flex, and IMO that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/?
Hmm, we mwntioned moving the checkpointing data out of this directory, I think that's a good idea. In that case where would we put it?
Wouldn't it make more sense to have:
- var/lib/kubelet/csi/
- var/lib/kubelet/device-plugin/
- var/lib/kubelet/device-plugin/data.checkpoint
- var/lib/kubelet/device-plugin/plugins/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use csi/plugins to keep the plugin hierarchy same and just in case CSI would need any checkpointing it has a place to do it.
|
||
err = watcher.Add(path) | ||
if err != nil { | ||
watcher.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if err == nil
?
Maybe things like defer watcher.Close()
should be addressed outside to ensure its collection
return nil | ||
} | ||
|
||
// Dial establishes the gRPC communication with the picked up plugin socket. https://godoc.org/google.golang.org/grpc#Dial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put reference link in a newline.
if err != nil { | ||
return err | ||
} | ||
watcher, err := fsnotify.NewWatcher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about use package filesystem .FSWatcher
here:
// FSWatcher is a callback-based filesystem watcher abstraction for fsnotify.
type FSWatcher interface {
// Initializes the watcher with the given watch handlers.
// Called before all other methods.
Init(FSEventHandler, FSErrorHandler) error
// Starts listening for events and errors.
// When an event or error occurs, the corresponding handler is called.
Run()
// Add a filesystem path to watch
AddWatch(path string) error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Maybe take a look at the existing pkg/volume/flexvolume/probe.go and see whether we can follow a similar flow. Some example functionalities that the current pkg/volume/flexvolume/probe.go has and we may also consider:
- create the plugin directory if it doesn't exist
- recursively add watcher for sub directories
- watchEventInterval and watchEventLimit
- watch remove events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend taking advantage of this interface, and consider extending it if it's insufficient.
watchEventInterval and watchEventLimit are there for rate limiting: it prevents a malicious actor from overwhelming kubelet by triggering too many inotify events.
In Flexvolume the requirement is that if ANYTHING inside a driver directory changes, the driver should update. For this reason, the Flexvolume prober recursively watches all subdirectories. But here AFAIK each device plugin is a single socket, so recursive watch won't be necessary.
My understanding is it's not necessary to watch remove events because gRPC will terminate when socket is deleted, triggering the endpoint deletion in managerImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that code lacks a stop function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@verult Thanks a lot for your helpful inputs! Agree we should try to reuse FSWatcher interface. For now, I think we may leave remove events and recursively watch out of this PR. For possible future extension, I would suggest that PluginWatcher passes the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification.
For CSI, is it possible that we use /var/lib/kubelet/plugins/csi/[SanitizedCSIDriverName].sock naming convention to work with a flat directory? If we have to use the proposed /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock format, we can add recursive watch in PluginWatcher in a later PR.
For device plugin, I think the reason why we can't just pass resource name in Registration response is because the advertised resource name may fail Kubelet's validation, if e.g., it doesn't follow extended resource name convention, or it collides with an existing name. I am fine with the naming format currently used in this PR. Something we may consider is whether we want to allow socket name with some timestamp suffix to allow a new instance to come up without tearing down the old plugin instance to ensure zero downtime. But I think we can leave that part out for now, given that we expect most plugins to be deployed as DaemonSet containers whose upgrades require container restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I thought about it more and I think a good solution could be to have to following model:
- The socket has the same format as the resource name (e.g: /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu.sock for the socket)
- The gRPC server advertises the same name as it's socket (e.g: "nvidia.com/gpu")
- Kubelet validates after calling
Register
that the socket name matches the advertised name (and applies resourceName validation)
or it collides with an existing name
That would be the same behavior as re-registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also fine with this model. It is just that ProbeWatcher has to support recursive watching then, and I don't know whether that would create some performance concerns as inotify man page mentions that "Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees.
Pretty sure that's meant for large amount of directories (>10000) rather than our 3 device plugins (or the not so many CSI sockets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI drivers should in almost all the cases be < 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we may consider is whether we want to allow socket name with some timestamp suffix to allow a new instance to come up without tearing down the old plugin instance to ensure zero downtime
We could just require the directory to match the vendor name instead of exact match of the resource name. e.g:
- /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu-v1.8.sock
- /var/lib/kubelet/plugins/device-plugins/nvidia.com/gpu-v1.9.sock
Could both advertise resources under nvidia.com/*
Naming convention after nvidia.com/
would then be left to the vendor.
} | ||
|
||
// GetResourceNameFrom returns base resource name from the plugin socket absolute path | ||
func GetResourceNameFrom(sockPath string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this GetResourceNameFrom()
only used for device plugin? Or other plugin too?
|
||
go func() { | ||
for _, f := range files { | ||
outChan <- PluginInfo{path + f.Name()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this return checkpoint files? as we use the same dir:
utilstore.NewFileStore(socketDir, utilfs.DefaultFs{})
// Tests that the device plugin manager correctly handles registration and re-registration by | ||
// making sure that after registration, devices are correctly updated and if a re-registration | ||
// happens, we will NOT delete devices; and no orphaned devices left. | ||
func TestDevicePluginReRegistration(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReRegistration is still possible with this behavior and needs to be tested
@verult with your experience on pkg/volume/flexvolume/probe.go, it would be great if you can help take a look at pkg/kubelet/util/pluginwatcher/ part of this PR. Thanks a lot! |
} | ||
|
||
_, err := e.client.Register(context.Background(), reqt) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just return err?
service Registration { | ||
rpc Register(RegisterRequest) returns (Empty) {} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incompatible API change. I think we will need a new API directory. Is it too early to create a deviceplugin/v1beta1 directory? I know we haven't flipped the feature gate flag yet, but it seems generally api version is a pre-requirement for feature gate version. Perhaps we should discuss this on this week's RW meeting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is an important question. Unfortunately missed the opportunity to discuss it at last meeting. Lets target for next meeting and meanwhile will try to discuss and get opinion on slack and list.
} | ||
return nil | ||
} | ||
|
||
const ( | ||
// kubeletDevicePluginCheckpoint is the file name of device plugin checkpoint | ||
kubeletDevicePluginCheckpoint = "kubelet_internal_checkpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider to move checkpoint file to a separate location, like /var/run/kubelet/device-plugins/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be in next PR?
// GetResourceNameFrom returns base resource name from the plugin socket absolute path | ||
func GetResourceNameFrom(sockPath string) (string, error) { | ||
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) { | ||
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, according to CSI proposal, the expected naming convention is /var/lib/kubelet/plugins/[SanitizedCSIDriverName]/csi.sock
For simplicity, wonder whether we can have csi plugins go to /var/lib/kubelet/plugins/csi/ directory and device plugins go to /var/lib/kubelet/plugins/device_plugins/? The parent directory can be a parameter passed in to NewPluginWatcher. Perhaps we can just pass the absolute path in the channel, and the PluginWatcher caller can do its own naming convention verification?
pkg/kubelet/util/pluginwatcher/BUILD
Outdated
name = "all-srcs", | ||
srcs = [":package-srcs"], | ||
tags = ["automanaged"], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a plugin_watcher_test.go with some unit test coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
Thanks guys for the inputs. Will soon update the PR. |
if err != nil { | ||
return err | ||
} | ||
watcher, err := fsnotify.NewWatcher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI drivers should in almost all the cases be < 10.
// GetResourceNameFrom returns base resource name from the plugin socket absolute path | ||
func GetResourceNameFrom(sockPath string) (string, error) { | ||
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) { | ||
return "", fmt.Errorf("invalid sock name: %s. Expected format: <sock-dir>/<domain>.<resource>.sock", sockPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use csi/plugins to keep the plugin hierarchy same and just in case CSI would need any checkpointing it has a place to do it.
|
||
// GetResourceNameFrom returns base resource name from the plugin socket absolute path | ||
func GetResourceNameFrom(sockPath string) (string, error) { | ||
if !((strings.Contains(sockPath, ".sock")) && (strings.Contains(sockPath, "/"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the check for "Contains (sockPath, "/")"? May be add a stricter prefix check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This serves for extended resources naming format. Do you have something in mind related to a stricter prefix match?
"fmt" | ||
"io/ioutil" | ||
"os" | ||
"reflect" | ||
"sync/atomic" | ||
// "sync/atomic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Nit - remove
return &pluginapi.Empty{}, nil | ||
existingDevs := make(map[string]pluginapi.Device) | ||
e := newEndpointImpl(r.SockPath, name, conn, existingDevs, m.callback) | ||
err = e.register(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect the plugins to support register call? CSI does not support any register call today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should re-use the service Identity of the CSI spec.
The Register call doesn't need to send the name to the plugin.
@vikaschoudhary16 Added comments. |
@vikaschoudhary16 @jingxu97 I am looking for more information on how different components of Kubelet like VolumeManager(CSI) & Network(CNI) can consume this? Do we have a write-up on how different components can consume this? |
@vikaschoudhary16 @jingxu97 Any update on the consumption model from different Kubelet components? |
cf8d6ba
to
5599bd7
Compare
@@ -76,6 +77,12 @@ func (cm *containerManagerStub) GetCapacity() v1.ResourceList { | |||
return c | |||
} | |||
|
|||
func (cm *containerManagerStub) GetPluginRegistrationHandlerCbkFunc() pluginwatcher.RegisterCallbackFn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -28,12 +28,16 @@ import ( | |||
"google.golang.org/grpc" | |||
|
|||
pluginapi "k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1" | |||
watcherapi "k8s.io/kubernetes/pkg/kubelet/apis/pluginregistration/v1alpha1" | |||
//"k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the commented line?
} | ||
} | ||
|
||
func (m *ManagerImpl) isVersionCompatibleWithPlugin(versions []string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is fine as we only have a single supported version. When we do need to support multiple versions in the future, we may need to extend this function to return a supported version. E.g., say kubelet supports v1beta1 and v1beta2, and we get v1alpha1 and v1beta1 from a device plugin, this function should return v1beta1 and we probably should pass that version to addEndpointProbeMode so that we can use correct version for communication. Maybe we can add some TODO here to make sure we don't forget in the future?
} | ||
m, p1, w := setupInProbeMode(t, devs, callback, socketName, pluginSocketName) | ||
atomic.StoreInt32(&expCallbackCount, 1) | ||
//p1.Register(socketName, testResourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the commented line?
err = p2.Start() | ||
require.NoError(t, err) | ||
atomic.StoreInt32(&expCallbackCount, 2) | ||
//p2.Register(socketName, testResourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
@@ -239,6 +240,33 @@ func (m *ManagerImpl) Start(activePods ActivePodsFunc, sourcesReady config.Sourc | |||
return nil | |||
} | |||
|
|||
// GetWatcherCallback returns callback function to be registered with plugin watcher | |||
func (m *ManagerImpl) GetWatcherCallback() watcher.RegisterCallbackFn { | |||
return func(name string, endpoint string, versions []string, sockPath string) (chan bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think whether we should create DEPRECATION file under m.socketDir before returning the callback function as we discussed in #58755 (comment)? This way, people can start testing the migration with pluginwatcher enabled and we can extend the device plugin e2e_node test and e2e test to start getting test coverage.
test/e2e_node/device_plugin.go
Outdated
|
||
//By("Register resources") | ||
//err = dp1.Register(pluginapi.KubeletSocket, resourceName) | ||
//framework.ExpectNoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the commented lines?
test/e2e_node/device_plugin.go
Outdated
@@ -160,6 +162,112 @@ var _ = framework.KubeDescribe("Device Plugin [Feature:DevicePlugin] [Serial] [D | |||
f.PodClient().DeleteSync(pod1.Name, &metav1.DeleteOptions{}, framework.DefaultPodDeletionTimeout) | |||
f.PodClient().DeleteSync(pod2.Name, &metav1.DeleteOptions{}, framework.DefaultPodDeletionTimeout) | |||
}) | |||
It("Verifies the Kubelet device plugin functionality with Probe-Mode plugin discovery mechanism", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we do need to test both cases, but instead of having a separate similar test case, what do you think that we create a DEPRECATION file under /var/lib/kubelet/device_plugins/ directory in device manager GetWatcherCallback(), and add a helper function here that calls dp.Register only if that file doesn't exist? Perhaps even better is to do this in stub device plugin, and then we may add some test on upgrade scenario.
dd23c20
to
51af7dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple nit comments. Overall lgtm.
func (m *Stub) Register(kubeletEndpoint, resourceName string, pluginSockDir string) error { | ||
if pluginSockDir != "" { | ||
if _, err := os.Stat(pluginSockDir + "DEPRECATION"); err == nil { | ||
log.Println("Deprecation file found. Registration from plugin side.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be the other way around, i.e., if Deprecation file found, skip Register gRPC call with Kubelet; if not, proceed?
} | ||
|
||
return func(name string, endpoint string, versions []string, sockPath string) (chan bool, error) { | ||
//chanForAckOfNotification := make(chan bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this commented line?
test/e2e_node/device_plugin.go
Outdated
initialConfig.FeatureGates[string(features.KubeletPluginsWatcher)] = true | ||
}) | ||
devicePluginSockPaths := []string{pluginapi.DevicePluginPath, "/var/lib/kubelet/plugins/"} | ||
//devicePluginSockPaths := []string{pluginapi.DevicePluginPath} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this?
test/e2e_node/device_plugin.go
Outdated
) | ||
|
||
// Serial because the test restarts Kubelet | ||
var _ = framework.KubeDescribe("Device Plugin [Feature:DevicePlugin][NodeFeature:DevicePlugin][Serial]", func() { | ||
f := framework.NewDefaultFramework("device-plugin-errors") | ||
|
||
Context("DevicePlugin", func() { | ||
By("Enabling support for Kubelet Plugins Watcher") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think if it is possible to wrap lines 52-205 in a testDevicePlugin(f *framework.Framework, enableProbeWatcher bool) function? Then we can get test coverage on both settings with the following test config:
var _ = framework.KubeDescribe("NVIDIA GPU Device Plugin [Feature:GPUDevicePlugin][NodeFeature:GPUDevicePlugin][Serial] [Disruptive]", func() {
f := framework.NewDefaultFramework("device-plugin-gpus-errors")
testDevicePlugin(f, false)
}
var _ = framework.KubeDescribe("NVIDIA GPU Device Plugin [Feature:GPUDevicePluginProbe][NodeFeature:GPUDevicePluginProbe][Serial] [Disruptive]", func() {
f := framework.NewDefaultFramework("device-plugin-gpus-errors-probe-mode")
testDevicePlugin(f, true)
}
51af7dd
to
854f30b
Compare
854f30b
to
a584250
Compare
/lgtm |
/assign @derekwaynecarr for higher level approval |
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-local-e2e-containerized |
2 similar comments
/test pull-kubernetes-local-e2e-containerized |
/test pull-kubernetes-local-e2e-containerized |
ping @derekwaynecarr |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, Random-Liu, vikaschoudhary16 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 58755, 66414). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@vikaschoudhary16: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Simplify device manager: make endpoint stateless While reviewing devicemanager code, found the caching layer on endpoint is redundant. Here are the 3 related objects in picture: **devicemanager <-> endpoint <-> plugin** plugin is the source of truth for devices and device health status. devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates from plugin. So there is no point for endpoint to cache devices, this patch is removing the cache layer, endpoint becomes stateless, which i believe should be the case (but i do welcome review if i missed something here). also removing the Manager.Devices() since i didn't find any caller of this other than test. if we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, so don't have to call endpoint after all. This patch makes code more readable, data model been simplified. **What this PR does / why we need it**: this patch simplify the device manager code, make it more maintainable. **Which issue(s) this PR fixes** *: this is a refactor of device manager code **Special notes for your reviewer**: will need to rebase the code if #58755 get checked-in first. **Release note**: ```release-note None ``` /sig node /cc @jiayingz @RenaudWasTaken @vishh @saad-ali @vikaschoudhary16 @vladimirvivien @anfernee
What this PR does / why we need it:
Uses this probe based utility in the device plugin manager.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #56944
Notes For Reviewers:
Changes are backward compatible and existing device plugins will continue to work. At the same time, any new plugins that has required support for probing model (Identity service implementation), will also work.
Release note
/sig node
/area hw-accelerators
/cc /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm @saad-ali @chakri-nelluri @ConnorDoyle