Create configuration level device plugin #565

johnsonshih · 2023-03-03T08:57:10Z

What this PR does / why we need it:
This PR is to implement the proposal of exposing resource at Configuration level. https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#configuration-level-resources
With Configuration level resource, users can select resource to use at configuration level without knowing the instance id beforehand.

Special notes for your reviewer:
Summary of changes in this PR
- CL resource and IL resource share the capacity pool. i.e. (# of allocated CL virtual devices + # of allocated IL virtual devices) <= capacity.
- The name of CL device plugin is the Akri Configuration name and follows the same convention of IL device plugin, i.e., replace ['.', '/'] with "-".
- The CL device plugin uses the same name schema as IL device plugin for virtual devices. The virtual device id reported by CL device plugin looks like _.
- ConfigurationDevicePlugin represent the behavior of CL device plugin.
- DevicePluginService contains a list_and_watch_message_sender to notify refreshing list_and_watch, used by the DevicePluginService internally, a copy of list_and_watch_message_sender is stored in the associated InstanceInfo, used by external entity to refresh the DPS's list_and_watch.
- Configuration DevicePluginService contains a list_and_watch_message_sender to notify refreshing list_and_watch, used by the Configuration DevicePluginService internally, a copy of list_and_watch_message_sender is store in the InstanceConfig, used by external entity to refresh the Configuration DPS's list_and_watch
- When IL DPS allocate a virtual device, it notify CL DPS to refresh list_and_watch ,and vice versa, CL DPS notify IL DPS to refresh list_and_watch when it allocates a virtual device.

If applicable:

this PR has an associated PR with documentation in akri-docs
this PR contains unit tests
added code adheres to standard Rust formatting (cargo fmt)
code builds properly (cargo build)
code is free of common mistakes (cargo clippy)
all Akri tests succeed (cargo test)
inline documentation builds (cargo doc)
all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

rpieczon · 2023-04-04T08:27:10Z

Is there any chance to have it as a part of incoming release?

Signed-off-by: Johnson Shih <jshih@microsoft.com>

…ge change Signed-off-by: Johnson Shih <jshih@microsoft.com>

Signed-off-by: Johnson Shih <jshih@microsoft.com>

agent/src/util/crictl_containers.rs

kate-goldenring · 2023-04-07T17:11:12Z

@johnsonshih I haven't given this a thorough look through yet, but i spec-ed out this work last April to a working POC, so i wanted to share the diff with you: main...kate-goldenring:akri:config-level-resources-poc.

Rather than creating a separate ConfigurationDevicePluginService I added a internal_allocate_for_config call that is called from allocate depending on whether instance_name = config_name. This may be a simpler approach and result in less code duplication:

        if self.instance_name == self.config_name {
            match self
                .internal_allocate_for_config(requests, kube_interface)
                .await
            {
                Ok(resp) => Ok(resp),
                Err(e) => Err(e),
            }
        } else {
            match self.internal_allocate(requests, kube_interface).await {
                Ok(resp) => Ok(resp),
                Err(e) => Err(e),
            }
        }

kate-goldenring

Thanks for diving into this @johnsonshih! I think we can simplify this and still use one DevicePluginService per #565 (comment). You are also welcome to pull in anything from the POC I put together last year: main...kate-goldenring:akri:config-level-resources-poc. Sorry for not sharing it earlier 😵‍💫

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih · 2023-04-07T20:53:15Z

Thanks for diving into this @johnsonshih! I think we can simplify this and still use one DevicePluginService per #565 (comment). You are also welcome to pull in anything from the POC I put together last year: main...kate-goldenring:akri:config-level-resources-poc. Sorry for not sharing it earlier face_with_spiral_eyes

Thanks for the information. I on purposely separate out the instance plugin service and configuration plugin service into two different struct as most of the fields used in DevicePluginService is specific to instance (e.g., instance_name), and the behavior of instance device plugin and configuration device plugin are different, too. Put both behaviors in the same DevicePluginService struct make it difficult to maintain.

I do have some follow up refactoring for CL resource after this PR committed. 'll definitely check the poc branch to see anything that we can leverage to improve the design and eliminate code duplication. Given the schedule, is it ok we focus on committing this PR? As long as the behavior is correct, we can improve our code base with follow up changes. Thanks!

Signed-off-by: Johnson Shih <jshih@microsoft.com>

kate-goldenring

Thanks for iterating on this @johnsonshih. I still do not see the need for the ConfigurationDevicePluginService and the duplication of code. Can we aim to reduce the size of changes in this PR and maintain one DevicePluginService with conditional extra functionality if it is a Configuration level DP? I am concerned about maintainability and debugability and reviewability. I wonder if one approach would be to start with one DevicePluginService to reduce the size of this PR and then in a subsequent one if we feel like we want different services to add that. It is hard here to see what is changed verses relocated. Or maybe we can move the shared code of list_and_watch into shared methods or it could be argued that that consolidation could come in a separate PR, so I am talking in circles.

agent/src/util/device_plugin_service.rs

johnsonshih · 2023-05-27T01:27:22Z

In this PR, each Configuration virtual devices are considered different, that's why you see the same instance is allocated to a pod multiple times. It's based on the proposal https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md
The device plugin created for each Configuration will contain capacity * number of instances slots. Each slot will map to a "real" slot of an Instance device plugin. For example, after deploying a onvif Configuration with a capacity of 2, the NodeSpec of a node that could see both cameras would be:
Capacity:
akri.sh/akri-onvif: 4
akri.sh/akri-onvif-8120fe: 2
akri.sh/akri-onvif-a19705: 2

I'm going to create a switch in Configuration to support different behaviors for the Configuration virtual devices, i.e., introduce a switch in Configuration (switch name "configurationResourceFrom", possible values are "instance" and "deviceUsage". Default value of configurationResourceFrom : "instance")
configurationResourceFrom: instance => on a node, only one instance allocated.
configurationResourceFrom: deviceUsage => on a node, slots from an instance are considered different, so it's possible to have multiple slots (from the same instance) allocated as Configuration device plugin.

kate-goldenring · 2023-06-05T18:16:48Z

I'm going to create a switch in Configuration to support different behaviors for the Configuration virtual devices, i.e., introduce a switch in Configuration (switch name "configurationResourceFrom", possible values are "instance" and "deviceUsage". Default value of configurationResourceFrom : "instance")

@johnsonshih I like the idea of making this configurable but i'd rather not expose users to the nuance of "deviceUsage" and IMO only contributors are familiar with that term. Maybe the switch can be uniqueDevices: true? I don't love that either but maybe we can brainstorm naming more

kate-goldenring · 2023-06-05T18:17:03Z

@johnsonshih would you be able to demo this at the community meeting tomorrow?

johnsonshih · 2023-06-07T20:01:15Z

I'm going to create a switch in Configuration to support different behaviors for the Configuration virtual devices, i.e., introduce a switch in Configuration (switch name "configurationResourceFrom", possible values are "instance" and "deviceUsage". Default value of configurationResourceFrom : "instance")

@johnsonshih I like the idea of making this configurable but i'd rather not expose users to the nuance of "deviceUsage" and IMO only contributors are familiar with that term. Maybe the switch can be uniqueDevices: true? I don't love that either but maybe we can brainstorm naming more

I'll change to use uniqueDevices.

Signed-off-by: Johnson Shih <jshih@microsoft.com>

…ub.com/johnsonshih/akri into user/jshih/configuration-device-plugin

kate-goldenring

Thanks for continuing to iterate on this @johnsonshih. I have concerns about potential race cases and how it looks like the Configuration Device Plugin is reporting devices.

kate-goldenring · 2023-06-16T20:55:04Z

agent/src/util/device_plugin_service.rs

+#[derive(Clone, Debug)]
+pub struct InstanceConfig {
+    pub usage_update_message_sender: Option<broadcast::Sender<ListAndWatchMessageKind>>,
+    pub instances: HashMap<String, InstanceInfo>,
+}


Please add documentation around all public structs and fields (see InstanceInfo and InstanceConnectivityStatus)

add document comments

kate-goldenring · 2023-06-16T21:00:35Z

agent/src/util/discovery_operator.rs

+        // Create a device plugin for the Configuration
+        let config_dp_name = get_device_configuration_name(&config_name);
+        trace!(
+            "start_discovery - create configuration device plugin {}",


Suggested change

"start_discovery - create configuration device plugin {}",

"internal_start_discovery - create configuration device plugin {}",

update message

kate-goldenring · 2023-06-16T21:00:42Z

agent/src/util/discovery_operator.rs

+            }
+            Err(e) => {
+                error!(
+                    "start_discovery - error {} building configuration device plugin",


Suggested change

"start_discovery - error {} building configuration device plugin",

"internal_start_discovery - error {} building configuration device plugin",

update message

kate-goldenring · 2023-06-16T21:08:24Z

agent/src/util/device_plugin_service.rs

+impl InstanceConfig {
+    pub fn new() -> Self {
+        InstanceConfig {
+            usage_update_message_sender: None,
+            instances: HashMap::new(),
+        }
+    }
 }


This is the equivalent of what the InstanceConfig's fields default values would be. If you tag the struct with Default then you don't need this and can change all instantiations to InstanceConfig::default()

Suggested change

impl InstanceConfig {

pub fn new() -> Self {

InstanceConfig {

usage_update_message_sender: None,

instances: HashMap::new(),

}

}

}

use Default trait

kate-goldenring · 2023-06-16T21:09:07Z

agent/src/util/device_plugin_service.rs

+    pub configuration_usage_slots: HashSet<String>,
+}
+
+#[derive(Clone, Debug)]


Just tag with default instead of having constructor behave as default would (per comment below)

Suggested change

#[derive(Clone, Debug)]

#[derive(Clone, Debug, Default)]

add Default trait

kate-goldenring · 2023-06-16T22:00:33Z

agent/src/util/device_plugin_service.rs

+                .unwrap();
+                discovered_devices.insert(instance_name, virtual_devices);
+            }
+            // construct virtual device info list


nit: consistency with comments (Capitalize the first letter of the comment)

update comments

kate-goldenring · 2023-06-16T22:01:22Z

agent/src/util/device_plugin_service.rs

-        );
-        Err(Status::new(
+fn build_virtual_devices_list_for_instance(
+    device_map: HashMap<String, Vec<v1beta1::Device>>,


This should be able to take in a reference to avoid unnecessary cloning

Suggested change

device_map: HashMap<String, Vec<v1beta1::Device>>,

device_map: &HashMap<String, Vec<v1beta1::Device>>,

pass reference instead of value

kate-goldenring · 2023-06-16T22:01:57Z

agent/src/util/device_plugin_service.rs

+                    },
+                )
+                .await
+                .unwrap();


please propegate errors instead of unwrapping

remove unwrap()

kate-goldenring · 2023-06-16T22:05:18Z

agent/src/util/device_plugin_service.rs

-            device_usage_id
-        );
-        Err(Status::new(
+fn build_virtual_devices_list_for_instance(


I think this function could use a different name. it isn't building virtual devices rather if i am reading it correctly, checking if there is a healthy slot. How about instances_with_healthy_devices and maybe a comment explaining the function

Change function name to build_virtual_device_health_state_for_instance and add comments.

kate-goldenring · 2023-06-16T22:08:30Z

agent/src/util/device_plugin_service.rs

-        Err(Status::new(
+fn build_virtual_devices_list_for_instance(
+    device_map: HashMap<String, Vec<v1beta1::Device>>,
+) -> HashMap<String, String> {


This function is behaving differently than i would expect. If I have a configuration with a capacity of 4 and 3 instances are discovered, i would expect 12 virtual devices to be reported. Why only 3 here? Kubelet keeps track of its allocate calls. It will not call allocate again on a device it has already allocated, therefor the other 3 capacity slots will never be used

For a capacity of 4 and 3 instances are discovered, the Instance.deviceUsage is 12. These deviceUsage slots are shared by all IL and CL virtual devices from all nodes.
For CL resources on a node, every instance can only be allocated as CL virtual device once, hence CL DP reports 3, Instance-0, Instance-1, Instance-2. the CL virtual device-x is Healthy as long as the Instance-x has a free slot available or allocated as a CL virtual device.

If CL DP reports 12 virtual devices, when allocate one CL virtual device, CL DP need to refresh list_and_watch to flip the other virtual devices from the instance being allocated to Unhealthy, otherwise Kubernetes will call allocate (max) 12 times since Kubernetes think all virtual devices are healthy and can be allocated.

IMO, expose CL virtual devices from usage slot level is more difficult to track and easier to get out of sync since the list_and_watch is running in a different thread. For user's perspective, it's more intuitive to see the IL/CL resource availability if CL virtual devices are reported based on Instance.

Have you successfully allocated N+1 CL devices when N instances are discovered with a capacity of 2? I do not believe it is possible to do this and for kubelet to be healthy.

I ran this branch locally to test this out. You cannot use the full capacity of devices with the current implementation. For example, when deploying this helm chart (with the agent and controller running locally):

helm install akri akri-helm-charts/akri $AKRI_HELM_CRICTL_CONFIGURATION --set agent.enabled=false --set controller.enabled=false --set debugEcho.configuration.enabled=true --set debugEcho.configuration.shared=false --set debugEcho.configuration.capcity=2

These devices are discovered

kagold@kagold-ThinkPad-X1-Carbon-6th:~$ kubectl get akrii NAME CONFIG SHARED NODES AGE akri-debug-echo-a19705 akri-debug-echo true ["myNode"] 10m akri-debug-echo-8120fe akri-debug-echo true ["myNode"] 10m

The node is labeled with the following resources. Note how only 2 akri.sh/akri-debug-echo devices are "allocatable" despite each instance having a capacity of 2, so 4 should be available to pods:

Capacity: akri.sh/akri-debug-echo: 2 akri.sh/akri-debug-echo-8120fe: 2 akri.sh/akri-debug-echo-a19705: 2 ... Allocatable: akri.sh/akri-debug-echo: 2 akri.sh/akri-debug-echo-8120fe: 2 akri.sh/akri-debug-echo-a19705: 2

To illustrate the point, deploy a replicaset with 3 replicas:

apiVersion: apps/v1 kind: Deployment metadata: name: debug-echo-deployment labels: app: debug-echo-broker spec: replicas: 3 selector: matchLabels: app: debug-echo-broker template: metadata: labels: app: debug-echo-broker spec: containers: - name: debug-echo-broker image: nginx resources: limits: akri.sh/akri-debug-echo: "1" requests: akri.sh/akri-debug-echo: "1"

Only two pods run with the third left pending due to no resource being available:

NAME READY STATUS RESTARTS AGE debug-echo-deployment-57ccdc95d9-24726 0/1 Pending 0 3m59s debug-echo-deployment-57ccdc95d9-2wl6b 1/1 Running 0 3m59s debug-echo-deployment-57ccdc95d9-5zkvb 1/1 Running 0 3m59s

$ kubectl describe pod debug-echo-deployment-57ccdc95d9-24726 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 4m18s default-scheduler 0/1 nodes are available: 1 Insufficient akri.sh/akri-debug-echo. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

This was a key reason why i did a direct mapping of CL slots to IL slots in the original implementation main...kate-goldenring:akri:config-level-resources-poc

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih requested review from bfjelds, kate-goldenring, jiria, Britel, romoh and adithyaj as code owners March 3, 2023 08:57

johnsonshih added 14 commits April 4, 2023 09:20

Set pre_start_required to false in get_device_plugin_options

7f569ee

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Suffix usage slot to annotation key name

c18de5d

Signed-off-by: Johnson Shih <jshih@microsoft.com>

append hash of device usage id to device property key name

590e570

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Refine trait DevicePluginBuilderInterface

113ac77

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Change build_container_allocate_response to accept a list of devices

7813770

Signed-off-by: Johnson Shih <jshih@microsoft.com>

decouple build_list_and_watch_response from DevicePluginService

71b7ac6

Signed-off-by: Johnson Shih <jshih@microsoft.com>

extract function allocate_for_instance

e701738

Signed-off-by: Johnson Shih <jshih@microsoft.com>

save discovery device in InstanceInfo

c039107

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Add usage_update_message_sender to DiscoveryOperator

2ffc366

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Add support for Configuration DevicePlugin

23965ac

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Create configuration device plugin

0582c50

Signed-off-by: Johnson Shih <jshih@microsoft.com>

CHECK: do we need this semi colon?

1aebc5b

Signed-off-by: Johnson Shih <jshih@microsoft.com>

DevicePluginService notify ConfigurationDevicePluginService about usa…

5fc980e

…ge change Signed-off-by: Johnson Shih <jshih@microsoft.com>

address clippy warnings

bcc848e

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih force-pushed the user/jshih/configuration-device-plugin branch from 5e3ed62 to bcc848e Compare April 4, 2023 16:21

johnsonshih added 2 commits April 4, 2023 18:02

address clippy warning

0b61fac

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Merge branch 'main' into user/jshih/configuration-device-plugin

0588b07

kate-goldenring reviewed Apr 7, 2023

View reviewed changes

agent/src/util/crictl_containers.rs Outdated Show resolved Hide resolved

kate-goldenring requested changes Apr 7, 2023

View reviewed changes

Merge branch 'main' into user/jshih/configuration-device-plugin

ee2e846

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih requested a review from kate-goldenring April 7, 2023 23:42

johnsonshih added 3 commits May 16, 2023 18:49

refactor code, notify between Configuration and Instance device plugin

7c55f75

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Update version

5f476ab

Signed-off-by: Johnson Shih <jshih@microsoft.com>

move definition of DevicePluginService struct to a better location

e7b9353

Signed-off-by: Johnson Shih <jshih@microsoft.com>

kate-goldenring requested changes May 19, 2023

View reviewed changes

johnsonshih closed this Jun 7, 2023

johnsonshih reopened this Jun 7, 2023

johnsonshih added 3 commits June 7, 2023 21:01

Merge branch 'main' into user/jshih/configuration-device-plugin

b45e95b

Signed-off-by: Johnson Shih <jshih@microsoft.com>

use enum for device plugin type

26d5307

Signed-off-by: Johnson Shih <jshih@microsoft.com>

cargo fmt

d9da6a5

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih requested review from diconico07 and kate-goldenring June 8, 2023 04:27

johnsonshih added 3 commits June 10, 2023 11:48

remove uniqueDevices

b4badba

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Merge branch 'main' into user/jshih/configuration-device-plugin

f73f342

Signed-off-by: Johnson Shih <jshih@microsoft.com>

move function to a different location

9f3236e

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih marked this pull request as draft June 15, 2023 17:07

johnsonshih added 3 commits June 15, 2023 14:46

Merge branch 'main' into user/jshih/configuration-device-plugin

f52cbed

Signed-off-by: Johnson Shih <jshih@microsoft.com>

clippy lint

43c3c0a

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Merge branch 'user/jshih/configuration-device-plugin' of https://gith…

61f500a

…ub.com/johnsonshih/akri into user/jshih/configuration-device-plugin

johnsonshih marked this pull request as ready for review June 15, 2023 22:08

kate-goldenring requested changes Jun 16, 2023

View reviewed changes

kate-goldenring removed the version/patch Patch version change is needed label Jun 16, 2023

johnsonshih added 2 commits June 16, 2023 16:27

Use Default for InstanceConfig

29bfb02

Signed-off-by: Johnson Shih <jshih@microsoft.com>

Check conflict and bail out when updating instance

0aa9783

Signed-off-by: Johnson Shih <jshih@microsoft.com>

johnsonshih marked this pull request as draft June 25, 2023 18:40

johnsonshih closed this Jul 11, 2023

kate-goldenring mentioned this pull request Jul 12, 2023

Add Configuration device plugin #627

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create configuration level device plugin #565

Create configuration level device plugin #565

johnsonshih commented Mar 3, 2023 •

edited

rpieczon commented Apr 4, 2023

kate-goldenring commented Apr 7, 2023 •

edited

kate-goldenring left a comment •

edited

johnsonshih commented Apr 7, 2023

kate-goldenring left a comment •

edited

johnsonshih commented May 27, 2023

kate-goldenring commented Jun 5, 2023

kate-goldenring commented Jun 5, 2023

johnsonshih commented Jun 7, 2023

kate-goldenring left a comment

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 21, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 16, 2023

johnsonshih Jun 16, 2023

kate-goldenring Jun 21, 2023

kate-goldenring Jun 21, 2023 •

edited

kate-goldenring Jun 21, 2023

	"start_discovery - create configuration device plugin {}",
	"internal_start_discovery - create configuration device plugin {}",

	"start_discovery - error {} building configuration device plugin",
	"internal_start_discovery - error {} building configuration device plugin",

	device_map: HashMap<String, Vec<v1beta1::Device>>,
	device_map: &HashMap<String, Vec<v1beta1::Device>>,

Create configuration level device plugin #565

Create configuration level device plugin #565

Conversation

johnsonshih commented Mar 3, 2023 • edited

rpieczon commented Apr 4, 2023

kate-goldenring commented Apr 7, 2023 • edited

kate-goldenring left a comment • edited

Choose a reason for hiding this comment

johnsonshih commented Apr 7, 2023

kate-goldenring left a comment • edited

Choose a reason for hiding this comment

johnsonshih commented May 27, 2023

kate-goldenring commented Jun 5, 2023

kate-goldenring commented Jun 5, 2023

johnsonshih commented Jun 7, 2023

kate-goldenring left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kate-goldenring Jun 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnsonshih commented Mar 3, 2023 •

edited

kate-goldenring commented Apr 7, 2023 •

edited

kate-goldenring left a comment •

edited

kate-goldenring left a comment •

edited

kate-goldenring Jun 21, 2023 •

edited