Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orderly kernel module version upgrade #263

Closed
yevgeny-shnaidman opened this issue Jan 23, 2023 · 4 comments
Closed

Orderly kernel module version upgrade #263

yevgeny-shnaidman opened this issue Jan 23, 2023 · 4 comments
Assignees
Milestone

Comments

@yevgeny-shnaidman
Copy link
Contributor

yevgeny-shnaidman commented Jan 23, 2023

Issue summary

Kernel module version upgrade must allow user to control the order of the pods(nodes) upgrade, and the timing of the upgrade of a specific pod (node).

Current upgrade process

Current upgrade process is as following:

  1. User updates Module CR with the new image-container
  2. KMM updates ModuleLoader DaemonSet with the new configuration
  3. k8s ReplicaSets controller initiates a rolling upgrade of the DaemonSet, which means that the pods of the DaemonSet will get destroyed and recreated one by one, in the order defined by the controller.

Current flow causes the following difficulty:

  1. User needs to terminate workload on a node prior to the ModuleLoader pod being destroyed (otherwise modprobe -r will fail)
  2. User does not want to delete all the workloads on all the nodes ( downtime), but do it one by one
  3. Since we don't know the order of ModuleLoader rolling upgrade, user cannot schedule the workload per node downtime correctly
  4. In case part of the workload is running in the DevicePlugin, the "modprobe -r" command will fail during Pod rollout, since the DevicePlugin pods are not rolled out

Proposed Solution

  1. Add a "Module Version" field to Module CRD. This field will be set by the user and updated whenever the ModuleLoader image is updated. Its value will be used by a customer to set ModuleVersionLabel on the nodes ( i.e if the customer want to run version 2 of the Module on some node, he must label the node with ModuleVersionLabel=<module version>)
  2. New labels: ModuleLoaderVersionLabel and DevicePluginVersionLabel will be derived from the "Module Version" field. Those labels will be used by KMMO exclusively for adding/removal of ModuleLoader/DevicePlugin pods on nodes.
  3. When the Module CR is updated with new "ModuleVersion" and the new "Image" fields, it will not update the existing ModuleLoader, but will create a new ModuleLoader, with the new image and the new ModuleLoaderVersionLabel as part of the NodeSelector. The same goes for DevicePlugin: a new DaemonSet is created with the new DevicePluginVersionLabel as part of Node Selector
  4. Garbage collector will need to be updated to collect ModuleLoader DS with existing kernels but irrelevant module versions, and to collect the DevicePlugin DS with irrelevant module version

Upgrade Flow example
Initiating Module

  1. User create a Module CR with image, module version and selector fields
  2. User labels the appropriate nodes with 2 labels: <module_namespace>-<module-name>-version=<module version> and node selector field from CR
  3. ModuleLoader is created with node selector field set to Selector field and derived Module version field and starts running on the labeled nodes

Upgrading Module Version

  1. User updates the Module CR with new image and new module version
  2. KMMO creates a new ModuleLoader with node selector field set to CR's Selector field and new derived ModuleLoader Version label. At this stage new ModuleLoader is not running on any nodes, since they do not contain new ModuleLoader Version label
  3. KMMO creates a new DevicePlugin with node selector field CR's Selector field and new derived DevicePlugin Version label.At this stage new DevicePlugin is not running on any nodes, since they do not contain new DevicePlugin Version label
  4. User picks one node and removes workload on that node
  5. Once the workload is removed user removes the ModuleVersionLabel from the node
  6. Once KMMO notices the ModuleVersionLabel removal, it removes the DevicePluginVersionLabel, which will cause DevicePlugin workload to stop running on the node
  7. Once DevicePlugin workload is not running, KMMO removes ModuleLoaderVersionLabel
  8. User adds new ModuleVersionLabel to the node
  9. KMMO adds new ModuleLoaderVersionLabel to the node, which causes a new version of kernel module to be deployed on the node
  10. KMMO adds new DevicePluginVersionLabel to the node , which causes a deployment of the device plugin onto the node
  11. User restores workload on the node
@yevgeny-shnaidman
Copy link
Contributor Author

/assign @yevgeny-shnaidman

@hershpa
Copy link
Contributor

hershpa commented Feb 21, 2023

As mentioned above, one of the main challenges of supporting seamless kernel module upgrade is if a workload is using the kernel module. In that case, the command modprobe -r to unload the kernel module will fail. As a result, there needs to be a way to remove all workloads that are using that particular module so that we can unload the module successfully.

Potential Solution
One way to potentially address this gap is to look at the device plugin resource exported from device plugin associated with that particular kernel module. We can find all workloads on a node that are using the resource and only delete those workloads. If we can find such an API that can delete workloads on a node for a particular resource, then KMM can simply call that API prior to unloading the kernel module. 2 APIs that may fit the usage case is kubectl drain and kubectl delete pod, however more investigation is needed to confirm whether the APIs can accomplish what we need.

@uMartinXu
Copy link

Device plguin exports some specific resource to cluster, to use any of these resources in cluster, the pod has to claim the resources in the yaml to create the Pod, so it might be possible to just drain the pods claiming any resources that are related with the driver module we want to upgrade. After that it is safe to rmmod the kernel module and then insmode the new modules to upgrade the drivers. We can start from "kubetctl drain" command and figure out whether we can add the resource claiming as a parameter to drain the pods.

@yevgeny-shnaidman
Copy link
Contributor Author

@hershpa @uMartinXu KMM responsibility is to deal only with kernel modules and device plugins. It does deal with any workloads that are running after kernel modules are loaded. It is up to other customer's operator to managed the workload. So, it seems to me that those operators, and not KMM, should also decide when and how the workload should be removed from the node. In addition, allowing KMM to actually drain nodes is very problematic: KMM is not a core operator, it does not know what workloads are running on the nodes. Draining the nodes will remove ALL workloads from the nodes, including those that have no dependencies on the KMM. In most of the cases, upgrading kernel module does not require removal of all the workloads from a node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants