Skip to content

feat: dynamically update node resource capacity and taints from plugin ping responses#516

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/dynamic-update-node-resources
Draft

feat: dynamically update node resource capacity and taints from plugin ping responses#516
Copilot wants to merge 3 commits intomainfrom
copilot/dynamic-update-node-resources

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

Node resource capacity (CPU, memory, pods, GPUs, FPGAs) and taints were fixed at initialization and never updated based on plugin state. This adds support for plugins to report live resource availability and node taints via the existing /pinglink response body.

Changes

New types (pkg/interlink/types.go)

  • PingResponse — structured ping response envelope with optional resources and taints fields
  • ResourcesResponse — resource capacities with JSON lowercase keys (cpu, memory, pods, accelerators)
  • AcceleratorResponse — per-accelerator entry (resourceType + available)
  • TaintResponse — per-taint entry (key, value, effect)

Dynamic update logic (pkg/virtualkubelet/virtualkubelet.go)

  • New updateNodeResources() method validates quantities via resource.ParseQuantity and updates both node.Status.Capacity and node.Status.Allocatable; invalid values emit a warning and leave the field unchanged
  • New updateNodeTaints() method replaces all non-system taints with the plugin-supplied list; the built-in virtual-node.interlink/no-schedule taint is always preserved; unknown effects default to NoSchedule with a warning
  • nodeUpdate() now attempts to unmarshal successful ping responses as PingResponse; if a resources or taints field is present, applies the respective update before calling onNodeChangeCallback

Backward compatibility

Non-JSON or resource/taint-less responses (existing plugin behavior) are silently ignored — no behavior change for current plugins.

Example plugin response

{
  "status": "ok",
  "resources": {
    "cpu": "128",
    "memory": "512Gi",
    "pods": "1000",
    "accelerators": [
      { "resourceType": "nvidia.com/gpu", "available": "8" }
    ]
  },
  "taints": [
    { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" }
  ]
}

Omitted fields retain their current configured values, so partial updates are supported. When taints is present as an empty array ([]), all plugin-managed taints are cleared.

Original prompt

Overview

Currently, the Virtual Kubelet node's resource capacity (CPU, Memory, GPU, etc.) is set only once during initialization in NewProviderConfig() and never updated dynamically based on plugin feedback.

Problem Statement

The InterLink system periodically pings the plugin to check its status (every 30 seconds in the nodeUpdate() function), but only updates the node's connectivity condition. There is no mechanism to dynamically update node resource capacity based on information returned by the plugin.

Proposed Solution

Add the capability to dynamically update node resources based on plugin responses:

  1. Extend the ping response format to optionally include resource information (CPU, Memory, Pod count, GPUs, FPGAs, etc.)

  2. Parse plugin responses to extract resource data when available

  3. Update node resources dynamically by modifying node.Status.Capacity and node.Status.Allocatable when resource information is received

  4. Maintain backward compatibility - if the plugin doesn't return resource data, continue using the statically configured values

Implementation Details

Changes Required:

  1. Plugin Response Structure - Define an optional resource update format in the plugin response that includes:

    {
      "status": "ok",
      "resources": {
        "cpu": "100",
        "memory": "256Gi",
        "pods": "1000",
        "accelerators": [
          {
            "resourceType": "nvidia.com/gpu",
            "available": "8"
          }
        ]
      }
    }
  2. Update nodeUpdate() function to:

    • Parse the response body for resource information
    • Call a new updateNodeResources() function with the extracted data
    • Update p.node.Status.Capacity and p.node.Status.Allocatable
    • Trigger the node change callback
  3. Add helper function updateNodeResources() to safely update node resource spec based on parsed response

  4. Add validation to ensure resource values are valid Kubernetes quantities

  5. Logging for resource updates to help with debugging

Backward Compatibility:

  • If no resource data is in the response, node resources remain unchanged
  • Existing plugins continue to work without modification
  • Configuration-based resource definition remains the default behavior

Benefits

  • Allows remote plugins to report actual available resources dynamically
  • Enables load-based resource scaling
  • Allows plugins to adjust available resources based on backend state
  • Improves Kubernetes scheduler accuracy by providing real-time resource information

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: is there a point where the vk can update the node resource spec based on interlinkplugin reply?

Assistant: I'll search the repository to understand how the virtual kubelet (vk) handles updates to the node resource spec based on InterLink plugin replies.

Assistant: Based on the code search results, I can see there is indeed a mechanism for the Virtual Kubelet (VK) to update the node resource spec based on InterLink plugin replies. Let me show you the specific points:

Node Update Mechanism

The primary point where the VK updates the node resource spec based on plugin replies is in the nodeUpdate() function in pkg/virtualkubelet/virtualkubelet.go:

Assistant: Yes, there is a specific point where the VK (Virtual Kubelet) can update the node resource spec based on the InterLink plugin reply. Let me explain the key mechanism:

Node Update Point

The main mechanism for updating node resources based on plugin responses is in the nodeUpdate() function (lines 599-660), which runs continuously and performs a periodic health check by pinging the InterLink plugin.

func (p *Provider) nodeUpdate(ctx context.Context) {
    // ... timer setup ...
    
    for {
        // Periodic ping to InterLink plugin (every 30 seconds)
        _, code, respBody, err := PingInterLink(ctx, p.config)
        
        if err != nil || code != 200 {
            // Update node conditions based on ping failure
            p.node.Status.Conditions = NodeConditionWithInterlink(
                false, v1.ConditionFalse, "InterlinkPingFailed", errorMsg)
            
            if p.node.Annotations == nil {
                p.node.Annotations = make(map[string]string)
            }
            p.node.Annotations["interlink.virtual-kubelet.io/ping-response"] = ""
            p.onNodeChangeCallback(p.node)
        } else {
            // Update node conditions based on successful ping
            p.node.Status.Conditions = NodeConditionWithInterlink(
                true, v1.ConditionTrue, "InterlinkPingSuccessful", successMsg)
            
            // Store response body in annotation
            p.node.Annotations["interlink.virtual-kubelet.io/ping-response"] = respBody
            p.onNodeChangeCallback(p.node)
        }
    }
}

How It Works

  1. **The `Notify...

This pull request was created from Copilot chat.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for interlink-dev canceled.

Name Link
🔨 Latest commit c714361
🔍 Latest deploy log https://app.netlify.com/projects/interlink-dev/deploys/69d20678d1ecff0008ae78e5

Copilot AI changed the title [WIP] Add capability to dynamically update node resources feat: dynamically update node resource capacity from plugin ping responses Apr 3, 2026
Copilot AI requested a review from dciangot April 3, 2026 08:20
@dciangot
Copy link
Copy Markdown
Member

dciangot commented Apr 3, 2026

@copilot is the new api retrocompatible with the previous spec?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 3, 2026

@copilot is the new api retrocompatible with the previous spec?

Yes, fully backward compatible. The VK-side change (line 658–662) only attempts json.Unmarshal on the raw ping response body and only updates resources when the parse succeeds and a resources field is present. Existing plugins that return plain text (e.g. "0", "ok") or any JSON without a resources key are unaffected — the unmarshal either fails or leaves pingResp.Resources nil, and we fall through with no resource changes.

The interLink API's Ping handler (pkg/interlink/api/ping.go) is unchanged — it still forwards to the plugin's /status endpoint and passes the response straight back. No changes to the plugin-facing API at all.

Copilot AI added a commit to interlink-hq/interlink-slurm-plugin that referenced this pull request Apr 3, 2026
…e schema

Replace the custom NodeResources type with PingResponse / ResourcesResponse /
AcceleratorResponse types that mirror the schema introduced in interLink#516.

The VK's nodeUpdate() in that PR tries to JSON-unmarshal the ping response body
as PingResponse and, if Resources is non-nil, calls updateNodeResources() to
update node.Status.Capacity and node.Status.Allocatable.  By returning the same
structure from the /status ping path, the SLURM plugin feeds live cluster
resource data directly into the virtual node's advertised capacity.

Key changes:
- types.go: PingResponse {status, resources}, ResourcesResponse {cpu, memory,
  pods, accelerators} and AcceleratorResponse {resourceType, available} all with
  Kubernetes-quantity string values; TODO comment to replace with upstream types
  once interLink#516 is merged
- Status.go: getClusterResources() now returns PingResponse; JSON path reports
  available resources (total - allocated) as Kubernetes quantity strings; text
  fallback reports total CPUs + free memory; clampToZero() helper extracted to
  avoid duplicate negative-guard logic
- resources_test.go: tests updated to assert PingResponse.Resources.CPU /
  .Memory using Kubernetes quantity strings ("24", "96000Mi", etc.)

Agent-Logs-Url: https://github.com/interlink-hq/interlink-slurm-plugin/sessions/4badfd6e-a4b2-4ee2-b9e2-96b592f06b37

Co-authored-by: dciangot <4144326+dciangot@users.noreply.github.com>
@dciangot
Copy link
Copy Markdown
Member

dciangot commented Apr 5, 2026

@copilot make also possible to pass taints in response to the ping call

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 5, 2026

@copilot make also possible to pass taints in response to the ping call

Done in c714361. Plugins can now include a taints field in the ping response alongside resources:

{
  "status": "ok",
  "resources": { "cpu": "128", "memory": "512Gi" },
  "taints": [
    { "key": "vendor.io/maintenance", "value": "true", "effect": "NoSchedule" }
  ]
}

When taints is present (even as []), all non-system taints on the node are replaced with the provided list — the built-in virtual-node.interlink/no-schedule taint is always preserved. When the field is absent, existing taints are left unchanged (backward compatible).

Copilot AI changed the title feat: dynamically update node resource capacity from plugin ping responses feat: dynamically update node resource capacity and taints from plugin ping responses Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants