Skip to content

feat(autoscaler): deploy node-agent on nodes missing the grouping label#389

Merged
matthyx merged 4 commits into
mainfrom
feat/autoscaler-default-node-group
Jul 1, 2026
Merged

feat(autoscaler): deploy node-agent on nodes missing the grouping label#389
matthyx merged 4 commits into
mainfrom
feat/autoscaler-default-node-group

Conversation

@matthyx

@matthyx matthyx commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

The node-agent autoscaler groups nodes by NodeGroupLabel (default node.kubernetes.io/instance-type) and skipped any node missing that label. On on-prem / custom clusters where the label isn't populated, no node-agent DaemonSet was created — the autoscaler effectively did nothing.

Change

  • Add configurable nodeAgentAutoscaler.defaultNodeGroup (default "default"; empty string preserves the strict skip behavior).
  • Label-less nodes now fall into this default group, flagged via NodeGroup.IsDefault and exposed to the template as IsDefaultGroup.
  • A nodeSelector cannot match an absent label, so the default group's DaemonSet is targeted with a DoesNotExist node affinity instead. The matching template branch lives in the Helm chart (see companion PR).
  • Resource sizing is unchanged — first node's allocatable per group.

Companion PR

Requires the Helm chart template change: kubescape/helm-charts (DaemonSet template IsDefaultGroup branch).

Tests

  • Unit: default-group grouping, renderer asserts DoesNotExist affinity + no instance-type selector.
  • Integration: end-to-end against the actual Helm-generated template for both default and non-default groups.
  • Docs updated (docs/node-agent-autoscaler.md).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a configurable “default group” to target nodes missing the grouping label and render a DaemonSet for them.
    • Enhanced Helm/template rendering with a dedicated scheduling path for the default group.
  • Bug Fixes

    • Improved handling of unlabeled nodes: when configured, they’re no longer skipped.
    • Refined managed DaemonSet reconciliation to reliably create/update per rendered group and delete orphans.
  • Documentation

    • Documented default-group behavior, representative sizing for heterogeneous unlabeled nodes, and what happens when the setting is empty.
  • Tests

    • Expanded unit and integration coverage for default-group targeting and template rendering.

The node-agent autoscaler grouped nodes by NodeGroupLabel
(default node.kubernetes.io/instance-type) and skipped any node that did
not carry it, so on on-prem/custom clusters no node-agent DaemonSet was
created and the autoscaler effectively did nothing.

Add a configurable DefaultNodeGroup (default "default", empty = strict
skip). Label-less nodes now fall into this default group, flagged via
NodeGroup.IsDefault and exposed to the template as IsDefaultGroup. Since a
nodeSelector cannot match an absent label, the default group's DaemonSet
is targeted with a "DoesNotExist" node affinity instead (the Helm chart
provides the template branch).

Sizing is unchanged (first node's allocatable per group).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx matthyx added the ai-assisted Created through Armosec AI tooling (armosec-shared-rules plugin) label Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dd63bbf5-3b98-45bf-860b-8b078b7f2f21

📥 Commits

Reviewing files that changed from the base of the PR and between 75ea4f5 and 99a27e9.

📒 Files selected for processing (4)
  • .github/workflows/pr-created.yaml
  • docs/node-agent-autoscaler.md
  • nodeagentautoscaler/autoscaler_test.go
  • nodeagentautoscaler/nodegrouper.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/pr-created.yaml
  • nodeagentautoscaler/autoscaler_test.go
  • nodeagentautoscaler/nodegrouper.go
  • docs/node-agent-autoscaler.md

📝 Walkthrough

Walkthrough

Adds a configurable default node group for unlabeled nodes, threads that state into DaemonSet rendering and reconciliation, and updates tests, docs, and CI checks for the new scheduling behavior.

Changes

DefaultNodeGroup feature

Layer / File(s) Summary
Config and data contracts
config/config.go, nodeagentautoscaler/nodegrouper.go, nodeagentautoscaler/templaterenderer.go, nodeagentautoscaler/autoscaler.go
Adds DefaultNodeGroup to autoscaler config, IsDefault on node groups, and IsDefaultGroup plus the node-group label key in renderer data and construction.
Default-group assignment
nodeagentautoscaler/nodegrouper.go
Assigns unlabeled nodes to the configured default group, keeps the skip path when unset, and salts default-group name collision resolution.
Rendering and reconcile flow
nodeagentautoscaler/autoscaler.go, nodeagentautoscaler/templaterenderer.go, .github/workflows/pr-created.yaml, nodeagentautoscaler/integration_test.go, nodeagentautoscaler/templaterenderer_test.go, nodeagentautoscaler/autoscaler_test.go
Passes the label key into rendering, sets default-group template data, reconciles DaemonSets by generated name, and updates integration/CI coverage and reconciliation tests for the new behavior.
Validation and docs
config/config_test.go, nodeagentautoscaler/autoscaler_test.go, docs/node-agent-autoscaler.md
Extends config and node-group validation plus documentation for default-group behavior, sizing, scheduling, and error handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • kubescape/operator#370: Also changes nodeagentautoscaler/templaterenderer.go and NewTemplateRenderer call sites in the same area of the codebase.

Poem

🐇 I hop by labels, soft and bright,
Unlabeled nodes now find their site.
With DoesNotExist, I bound and zoom,
And default groups still get a room.
I twitch my nose at tests anew,
Then nibble carrots, happy through.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: deploying the node-agent on nodes missing the grouping label.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/autoscaler-default-node-group

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.23.0)
nodeagentautoscaler/autoscaler_test.go

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.15][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

nodeagentautoscaler/nodegrouper.go

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.16][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

🔧 markdownlint-cli2 (0.22.1)
docs/node-agent-autoscaler.md

markdownlint-cli2 wrapper config was not available before execution


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
nodeagentautoscaler/integration_test.go (1)

110-111: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the Helm default branch keeps kubernetes.io/os: linux.

The in-memory renderer test already checks this, but this integration test does not. If the real Helm template drops the OS selector in the default-group branch, this test still passes while the DaemonSet can target unsupported nodes.

Proposed assertion
 	assert.Equal(t, "node-agent-default", ds.Name)

 	// Must not pin to an instance-type value (the nodes lack the label).
+	assert.Equal(t, "linux", ds.Spec.Template.Spec.NodeSelector["kubernetes.io/os"])
 	assert.NotContains(t, ds.Spec.Template.Spec.NodeSelector, "node.kubernetes.io/instance-type")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nodeagentautoscaler/integration_test.go` around lines 110 - 111, The
integration test for the DaemonSet template currently only verifies it does not
pin to an instance-type label, but it also needs to assert the default Helm
branch preserves the OS selector. Update the checks in nodeagentautoscaler
integration_test around the DaemonSet spec so it explicitly confirms
ds.Spec.Template.Spec.NodeSelector still contains kubernetes.io/os with linux
when using the default-group path, alongside the existing instance-type
assertion.
nodeagentautoscaler/autoscaler_test.go (1)

358-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert IsDefault on the fallback group.

Right now this only proves the unlabeled nodes were counted under "default". If GetNodeGroups stops setting NodeGroup.IsDefault, the default-affinity path can still stay green because the renderer tests construct NodeGroup{IsDefault: true} manually. Please assert the flag here as part of the grouper contract.

Proposed test assertion
 	require.Contains(t, byLabel, "default")
 	assert.Equal(t, 2, byLabel["default"].NodeCount)
+	assert.True(t, byLabel["default"].IsDefault)
 	require.Contains(t, byLabel, "m5.large")
 	assert.Equal(t, 1, byLabel["m5.large"].NodeCount)
+	assert.False(t, byLabel["m5.large"].IsDefault)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nodeagentautoscaler/autoscaler_test.go` around lines 358 - 365, The fallback
group check in `TestGetNodeGroups` only verifies the `"default"` label and node
count, so it can miss regressions where `GetNodeGroups` stops marking the group
as default. Update the assertions around the `byLabel["default"]` entry to also
verify its `IsDefault` field is true, using the existing `GetNodeGroups` result
and `NodeGroup` struct contract as the locator.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Line 343: The test setup in the autoscaler test is using the deprecated fake
client constructor, so update the client initialization in the relevant test
helper to use fake.NewClientset(nodes...) instead of
fake.NewSimpleClientset(nodes...). Keep the rest of the test logic unchanged and
make sure the autoscaler test continues to build against the client-go version
in use.

In `@nodeagentautoscaler/nodegrouper.go`:
- Around line 82-114: The fallback handling in nodegrouper.go’s grouping logic
is merging unlabeled nodes with real nodes whose NodeGroupLabel matches
DefaultNodeGroup, which collapses distinct populations into one NodeGroup.
Update the grouping in the node-group aggregation path so default-fallback nodes
are tracked separately from actual label values, and only set IsDefault for the
fallback group created from missing labels. Use the existing symbols NodeGroup,
groupMap, labelValue, isDefault, and RenderDaemonSet to ensure labeled nodes
keep their normal affinity path while fallback nodes continue using the
DoesNotExist branch.

---

Nitpick comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Around line 358-365: The fallback group check in `TestGetNodeGroups` only
verifies the `"default"` label and node count, so it can miss regressions where
`GetNodeGroups` stops marking the group as default. Update the assertions around
the `byLabel["default"]` entry to also verify its `IsDefault` field is true,
using the existing `GetNodeGroups` result and `NodeGroup` struct contract as the
locator.

In `@nodeagentautoscaler/integration_test.go`:
- Around line 110-111: The integration test for the DaemonSet template currently
only verifies it does not pin to an instance-type label, but it also needs to
assert the default Helm branch preserves the OS selector. Update the checks in
nodeagentautoscaler integration_test around the DaemonSet spec so it explicitly
confirms ds.Spec.Template.Spec.NodeSelector still contains kubernetes.io/os with
linux when using the default-group path, alongside the existing instance-type
assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a0c66bf3-643d-4849-918b-6a67eb1bb12c

📥 Commits

Reviewing files that changed from the base of the PR and between d425e08 and 8d6d32e.

📒 Files selected for processing (8)
  • config/config.go
  • config/config_test.go
  • docs/node-agent-autoscaler.md
  • nodeagentautoscaler/autoscaler_test.go
  • nodeagentautoscaler/integration_test.go
  • nodeagentautoscaler/nodegrouper.go
  • nodeagentautoscaler/templaterenderer.go
  • nodeagentautoscaler/templaterenderer_test.go

Comment thread nodeagentautoscaler/autoscaler_test.go Outdated
Comment thread nodeagentautoscaler/nodegrouper.go

@matthyx matthyx left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — blockers

Nice feature and the test coverage is solid. Two issues below look like real blockers before merge, plus one CI-coverage concern.

🔴 Blocker 1 — the default-group key collides with a real label value

nodeagentautoscaler/nodegrouper.go (~L81–115). Label-less nodes are bucketed under groupMap[ng.config.DefaultNodeGroup] ("default" by default). But that is the same keyspace as real label values, so a node that legitimately carries node.kubernetes.io/instance-type=default lands in the same group, and:

group.IsDefault = group.IsDefault || isDefault

flips the merged group onto the affinity path. The rendered DaemonSet then uses the DoesNotExist affinity, which by definition does not match the legitimately-labeled default node — so that node silently loses its node-agent. The two populations also share resource sizing (first node's allocatable per group), which can mis-size the DaemonSet.

The inline comment // A label value can be shared by labelled and label-less nodes shows the collision was noticed, but OR-ing IsDefault is the wrong resolution — it sacrifices the labeled node. The fallback population should be tracked in a separate bucket (e.g. a sentinel key that can't be a real label value) so a real default label and the synthetic fallback never merge. This overlaps with CodeRabbit's actionable comment on the same lines.

🔴 Blocker 2 — the DoesNotExist affinity needs the label key, but TemplateData only carries the value

nodeagentautoscaler/templaterenderer.go L24–29, L220. TemplateData.NodeGroupLabel is set to group.LabelValue (the value, e.g. m5.large / default), and no field carries the configured nodeGroupLabel key. So the companion Helm template can only hardcode node.kubernetes.io/instance-type in the DoesNotExist expression — exactly what the unit/integration tests do.

nodeGroupLabel is configurable (config/config.go), so if an operator overrides it, the default-group affinity still keys off node.kubernetes.io/instance-type. It then targets the wrong/absent label and the default group matches nothing (or the wrong nodes) — defeating the whole purpose of the feature for the exact custom/on-prem clusters it's meant to serve. Suggest passing the label key through TemplateData so the affinity is built from the configured key rather than a hardcoded constant.

🟡 Concern — the integration test silently skips, so the guarantee isn't enforced in CI

nodeagentautoscaler/integration_test.go (~L82–89). Both TestIntegration_HelmGeneratedTemplate* t.Skip when /tmp/test-daemonset-template.yaml is absent. In CI that file won't exist, so the "end-to-end against the actual Helm-generated template" guarantee in the PR description isn't actually verified — the default-group affinity (and the hardcoded-key issue above) could regress with everything green. Consider checking in a representative rendered template, or generating it in the CI job, so this test really runs.

Minor (non-blocking)

  • fake.NewSimpleClientset(...) is deprecated; fake.NewClientset(...) is the current constructor (per CodeRabbit).
  • CodeRabbit's nitpicks to assert IsDefault in TestNodeGrouper_GetNodeGroups_DefaultNodeGroup and to assert kubernetes.io/os: linux in the integration default-group path are worth taking — they harden the grouper/affinity contract.

@github-actions

Copy link
Copy Markdown

Summary:

  • License scan: success
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

@matthyx

matthyx commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

🔗 Companion PR: kubescape/helm-charts#868 — adds the DaemonSet template branch (DoesNotExist affinity for the default group) that this change relies on. Land helm-charts#868 at or before this PR.

Resolve the blockers and nitpicks from #389 review:

- Track label-less nodes under an internal sentinel bucket so a node
  legitimately labelled "<groupingLabel>=default" forms its own normal
  (nodeSelector-targeted) group and is never merged onto the affinity path.
  Salt the fallback group's collision hash so it cannot collapse onto a real
  same-named group.
- Key DaemonSet reconciliation (existence + orphan deletion) off the unique
  DaemonSet name instead of the node-group label value, which is not unique
  once the default group shares a value with a real group.
- Pass the configured grouping label key through TemplateData
  (NodeGroupLabelKey) so the nodeSelector and the default group's DoesNotExist
  affinity key off the operator's actual grouping label rather than a
  hardcoded constant; the Helm template consumes it as the single source.
- Run the integration tests in CI: extract the real DaemonSet template from
  the helm-charts chart so the tests run instead of skipping.
- Tests: assert IsDefault on the grouper contract, assert os=linux survives in
  the default branch, add the =default collision test and a Reconcile
  create/orphan test; switch to fake.NewClientset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx

matthyx commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @matthyx — addressed in 75ea4f5.

🔴 Blocker 1 — default-group key collides with a real label value
Label-less nodes are now bucketed under an internal sentinel key ("\x00default-node-group", impossible as a real label value), so a node legitimately labelled …=default forms its own normal, nodeSelector-targeted group and is never merged onto the affinity path or flagged IsDefault. The collision-resolution hash is salted for the fallback group so it can't collapse onto a real same-named group. Also fixed the downstream consequence: reconcile (existence check + orphan deletion) now keys off the unique DaemonSet name rather than the non-unique node-group label value. New tests: TestNodeGrouper_GetNodeGroups_DefaultValueCollision and TestAutoscaler_Reconcile_CreatesPerGroupAndDeletesOrphans.

🔴 Blocker 2 — affinity needs the label key, but TemplateData only had the value
Added TemplateData.NodeGroupLabelKey, populated from the operator's configured NodeGroupLabel. Both the nodeSelector key and the default group's DoesNotExist affinity now key off it (companion PR updated to emit {{ .NodeGroupLabelKey }} instead of substituting the Helm value at chart-render time). The operator config is now the single source of truth, so a custom nodeGroupLabel is honoured end-to-end.

🟡 Concern — integration tests silently skip in CI
integration-build now checks out the chart, extracts the real DaemonSet template via helm template, and runs the integration tests for real (no more skip).
⚠️ Merge-order note: that job renders helm-charts/main, so the default-group integration test will stay red on this PR until kubescape/helm-charts#868 merges. Please merge #868 first, then re-run CI here.

Minor: switched to fake.NewClientset; added IsDefault assertions to the grouper test and kubernetes.io/os: linux to the integration default-group path.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nodeagentautoscaler/autoscaler.go (1)

193-236: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Enforce the rendered-name contract before using name-based bookkeeping.

Reconcile() now treats GenerateDaemonSetName(group) as the resource identity, but the DaemonSet name still comes from a configurable template. If a custom template renders any other metadata.name, this code will reconcile one name and orphan-delete another on the next loop. Please either validate desiredDS.Name == dsName and fail fast, or switch the existence/orphan tracking to the rendered name instead of the precomputed one.

Suggested guard
 		desiredDS, err := a.templateRenderer.RenderDaemonSet(group, resources)
 		if err != nil {
 			logger.L().Error("failed to render DaemonSet",
 				helpers.String("nodeGroup", group.LabelValue),
 				helpers.Error(err))
 			continue
 		}
+		if desiredDS.Name != dsName {
+			return fmt.Errorf("rendered DaemonSet name %q does not match expected %q", desiredDS.Name, dsName)
+		}

Also applies to: 254-255

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nodeagentautoscaler/autoscaler.go` around lines 193 - 236, Reconcile()
currently uses GenerateDaemonSetName(group) for processed/existing bookkeeping
while RenderDaemonSet can produce a different metadata.name, so name-based
tracking may target the wrong DaemonSet. In the Reconcile loop, either validate
that desiredDS.Name matches dsName and stop with an error if it does not, or
change the create/update/orphan cleanup logic to consistently use the rendered
name from desiredDS.Name throughout the DaemonSet handling flow. Ensure the same
identity is used wherever existingByName and processedNames are consulted.
🧹 Nitpick comments (1)
nodeagentautoscaler/autoscaler_test.go (1)

629-694: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a reconcile test for the real-default collision path.

TestNodeGrouper_GetNodeGroups_DefaultValueCollision proves the groups stay distinct, but the regression fixed here was in Reconcile()'s name-based existence/orphan bookkeeping. A test with one node labeled ...=default plus unlabeled nodes would catch a future change that collapses one of the two DaemonSets during update/delete.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nodeagentautoscaler/autoscaler_test.go` around lines 629 - 694, Add a
reconcile coverage for the real “default” collision path by extending the
autoscaler reconcile test around NewAutoscaler and Reconcile. Use one node
explicitly labeled with the node-group label set to default plus at least one
unlabeled node so both the real default group and the fallback default group
exist at once, then assert Reconcile creates and preserves two distinct
DaemonSets instead of collapsing one during name-based existence/orphan
handling. Keep the check focused on the unique DaemonSet names and the managed
orphan cleanup logic in Reconcile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/pr-created.yaml:
- Around line 41-55: The workflow job is checking out kubescape/helm-charts at
main, which can break this validation when the required Helm template change
only exists in the companion branch. Update the checkout in the pr-created
workflow to use the companion branch or pinned SHA for the stack, or add gating
so the integration step in the Extract node-agent daemonset template / Build and
run integration tests flow only runs when that ref is available.
- Around line 38-43: Disable credential persistence on the extra checkout step
in the Helm charts fetch. Update the actions/checkout usage for the Checkout
helm-charts step to turn off credential persistence, since this repository clone
is read-only. Keep the existing checkout configuration for kubescape/helm-charts
in place, but add the checkout setting that prevents the workflow token from
being written into helm-charts/.git/config.

In `@nodeagentautoscaler/templaterenderer.go`:
- Around line 226-229: The DaemonSet labels in the node agent template currently
use the shared node group label, which can collide between the fallback default
group and a real default group. Update the DaemonSet selector and pod labels in
templaterenderer.go, especially where the template uses the group-derived name
fields, so each DaemonSet is uniquely identified by a discriminator like Name
instead of kubescape.io/node-group. Also add a test that covers the
duplicated-default scenario and verifies the selectors no longer collide.

---

Outside diff comments:
In `@nodeagentautoscaler/autoscaler.go`:
- Around line 193-236: Reconcile() currently uses GenerateDaemonSetName(group)
for processed/existing bookkeeping while RenderDaemonSet can produce a different
metadata.name, so name-based tracking may target the wrong DaemonSet. In the
Reconcile loop, either validate that desiredDS.Name matches dsName and stop with
an error if it does not, or change the create/update/orphan cleanup logic to
consistently use the rendered name from desiredDS.Name throughout the DaemonSet
handling flow. Ensure the same identity is used wherever existingByName and
processedNames are consulted.

---

Nitpick comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Around line 629-694: Add a reconcile coverage for the real “default” collision
path by extending the autoscaler reconcile test around NewAutoscaler and
Reconcile. Use one node explicitly labeled with the node-group label set to
default plus at least one unlabeled node so both the real default group and the
fallback default group exist at once, then assert Reconcile creates and
preserves two distinct DaemonSets instead of collapsing one during name-based
existence/orphan handling. Keep the check focused on the unique DaemonSet names
and the managed orphan cleanup logic in Reconcile.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b616c6a2-25a9-4892-9304-8f87e11cd13f

📥 Commits

Reviewing files that changed from the base of the PR and between 8d6d32e and 75ea4f5.

📒 Files selected for processing (8)
  • .github/workflows/pr-created.yaml
  • docs/node-agent-autoscaler.md
  • nodeagentautoscaler/autoscaler.go
  • nodeagentautoscaler/autoscaler_test.go
  • nodeagentautoscaler/integration_test.go
  • nodeagentautoscaler/nodegrouper.go
  • nodeagentautoscaler/templaterenderer.go
  • nodeagentautoscaler/templaterenderer_test.go
✅ Files skipped from review due to trivial changes (1)
  • docs/node-agent-autoscaler.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • nodeagentautoscaler/integration_test.go

Comment on lines +38 to +43
- name: Checkout helm-charts
uses: actions/checkout@v4
with:
repository: kubescape/helm-charts
ref: main
path: helm-charts

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Disable credential persistence on the extra checkout.

This checkout is read-only, so persisting the workflow token into helm-charts/.git/config is unnecessary and creates an avoidable leak path if that workspace is ever uploaded or reused.

Suggested hardening
       - name: Checkout helm-charts
         uses: actions/checkout@v4
         with:
           repository: kubescape/helm-charts
           ref: main
           path: helm-charts
+          persist-credentials: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Checkout helm-charts
uses: actions/checkout@v4
with:
repository: kubescape/helm-charts
ref: main
path: helm-charts
- name: Checkout helm-charts
uses: actions/checkout@v4
with:
repository: kubescape/helm-charts
ref: main
path: helm-charts
persist-credentials: false
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 38-43: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/pr-created.yaml around lines 38 - 43, Disable credential
persistence on the extra checkout step in the Helm charts fetch. Update the
actions/checkout usage for the Checkout helm-charts step to turn off credential
persistence, since this repository clone is read-only. Keep the existing
checkout configuration for kubescape/helm-charts in place, but add the checkout
setting that prevents the workflow token from being written into
helm-charts/.git/config.

Source: Linters/SAST tools

Comment on lines +41 to +55
repository: kubescape/helm-charts
ref: main
path: helm-charts
- uses: azure/setup-helm@v4
- name: Extract node-agent daemonset template
run: |
helm template test helm-charts/charts/kubescape-operator \
--kube-version 1.29.0 \
--set nodeAgent.autoscaler.enabled=true --set clusterName=test \
| grep -A 400 "daemonset-template.yaml:" | tail -n +2 | sed 's/^ //' \
| awk '/^---/{exit} {print}' > /tmp/test-daemonset-template.yaml
test -s /tmp/test-daemonset-template.yaml || { echo "template extraction produced an empty file"; exit 1; }
- name: Build and run integration tests
working-directory: operator
run: go test -tags=integration ./...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Pin the chart checkout to the companion change instead of main.

This job now validates behavior that depends on the companion Helm template branch, so checking out kubescape/helm-charts@main makes the PR fail until that other PR merges. Point this at the companion branch/SHA for the stack, or gate the integration job until that ref is available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/pr-created.yaml around lines 41 - 55, The workflow job is
checking out kubescape/helm-charts at main, which can break this validation when
the required Helm template change only exists in the companion branch. Update
the checkout in the pr-created workflow to use the companion branch or pinned
SHA for the stack, or add gating so the integration step in the Extract
node-agent daemonset template / Build and run integration tests flow only runs
when that ref is available.

Comment on lines +226 to +229
Name: fmt.Sprintf("node-agent-%s", group.SanitizedName),
NodeGroupLabel: group.LabelValue,
NodeGroupLabelKey: tr.nodeGroupLabelKey,
IsDefaultGroup: group.IsDefault,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect rendered/template selector usage around NodeGroupLabel.
rg -n -C5 '(\.NodeGroupLabel|kubescape\.io/node-group|matchLabels:|selector:)' \
  --glob '!**/vendor/**' \
  --glob '!**/node_modules/**' \
  .

Repository: kubescape/operator

Length of output: 20311


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the renderer and the nearby tests to inspect selector/name usage.
ast-grep outline nodeagentautoscaler/templaterenderer.go --view expanded
printf '\n--- templaterenderer.go (relevant slice) ---\n'
sed -n '1,320p' nodeagentautoscaler/templaterenderer.go | nl -ba | sed -n '180,260p'

printf '\n--- templaterenderer_test.go (selector slices) ---\n'
sed -n '1,460p' nodeagentautoscaler/templaterenderer_test.go | nl -ba | sed -n '60,230p'

Repository: kubescape/operator

Length of output: 1671


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n--- nodeagentautoscaler/templaterenderer.go: RenderDaemonSet and GenerateDaemonSetName ---\n'
sed -n '221,320p' nodeagentautoscaler/templaterenderer.go

printf '\n--- nodeagentautoscaler/nodegrouper.go: grouping/default logic ---\n'
sed -n '80,150p' nodeagentautoscaler/nodegrouper.go

printf '\n--- nodeagentautoscaler/templaterenderer_test.go: relevant template assertions ---\n'
sed -n '60,240p' nodeagentautoscaler/templaterenderer_test.go

Repository: kubescape/operator

Length of output: 10634


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the configured template path and any real DaemonSet template files.
rg -n -C3 'TemplatePath|node-agent.*template|kubescape.io/node-group|app.kubernetes.io/name: node-agent|selector:' \
  nodeagentautoscaler docs .github . \
  --glob '!**/vendor/**' \
  --glob '!**/node_modules/**'

printf '\n--- candidate files ---\n'
fd -a '.*template.*\.ya?ml$|.*daemonset.*\.ya?ml$|.*node-agent.*\.ya?ml$' .

Repository: kubescape/operator

Length of output: 23281


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '40,80p' .github/workflows/pr-created.yaml | cat -n

Repository: kubescape/operator

Length of output: 1061


🌐 Web query:

kubescape helm-charts node-agent autoscaler daemonset-template.yaml NodeGroupLabel selector Name kubescape.io/node-group

💡 Result:

In the Kubescape Helm charts, particularly within the kubescape-operator chart, the daemonset-template.yaml file serves as a shared template used for rendering Node Agent DaemonSets, especially when features like the operator-based autoscaler or multiple node-agent configurations are enabled [1][2]. Regarding the specific node label selector mentioned: The label kubescape.io/node-group is used as a dynamic placeholder within the daemonset-template.yaml [2]. Specifically, when the system renders the DaemonSet metadata, it includes: labels: kubescape.io/node-group: "{{{{ .NodeGroupLabel }}}}" This allows the Kubescape Operator to associate specific DaemonSet instances with defined node groups or node pools [2][3]. This is particularly relevant when using the nodeAgent.multipleDaemonSets feature, which enables administrators to deploy tailored Node Agent configurations (with specific resource requests/limits and scheduling constraints) to different node pools [3][4]. Key takeaways: - Shared Template: daemonset-template.yaml provides a single source of truth for the DaemonSet specification, used by multiple components within the operator to maintain consistency [2]. - Autoscaler Integration: The template includes logic to ensure that static DaemonSets and the automatic node-agent autoscaler do not conflict, with the template sometimes rendering in autoscaler-managed modes [1][2]. - Node Affinity: While the template handles the labeling for the object itself (using kubescape.io/node-group), the actual scheduling onto specific nodes is typically managed via nodeSelectors configured in the Helm values (nodeAgent.multipleDaemonSets.configurations[].nodeSelector) [3].

Citations:


🌐 Web query:

site:github.com/kubescape/helm-charts daemonset-template.yaml kubescape.io/node-group selector matchLabels app.kubernetes.io/name node-agent

💡 Result:

The daemonset-template.yaml file in the Kubescape Operator Helm chart is a template used specifically when the Kubescape Node Agent autoscaler mode is enabled [1][2]. This template allows the operator to dynamically manage Node Agent DaemonSets based on specific node groups [2]. The selector matchLabels within this context includes both standard identification labels and a specialized label for the autoscaler: 1. app.kubernetes.io/name: node-agent (typically included via the kubescape-operator.selectorLabels template helper) [2]. 2. kubescape.io/node-group: A dynamic label defined by the {{ .NodeGroupLabel }} placeholder, which maps to the configuration specified in the operator's values (e.g., nodeAgent.autoscaler.nodeGroupLabel) [2][3]. This configuration ensures that when the autoscaler is active, the generated DaemonSets are correctly identified and associated with the specific node groups they are intended to monitor [1][2]. In standard (non-autoscaler) deployments, the Node Agent is typically deployed as a static DaemonSet defined in node-agent/daemonset.yaml [1].

Citations:


Use a unique selector label for each DaemonSet
kubescape.io/node-group can be shared by both the fallback default group and a real default group, so spec.selector.matchLabels/pod labels should use a unique discriminator like .Name instead. Add a collision test for the duplicated-default case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nodeagentautoscaler/templaterenderer.go` around lines 226 - 229, The
DaemonSet labels in the node agent template currently use the shared node group
label, which can collide between the fallback default group and a real default
group. Update the DaemonSet selector and pod labels in templaterenderer.go,
especially where the template uses the group-derived name fields, so each
DaemonSet is uniquely identified by a discriminator like Name instead of
kubescape.io/node-group. Also add a test that covers the duplicated-default
scenario and verifies the selectors no longer collide.

@matthyx matthyx left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (with companion helm-charts #868 in mind)

Looked at #868 to resolve the cross-repo questions from my first pass. One correction and one sharpened blocker.

✅ Withdrawn — integration test "silently skips" concern

That's by design: the CI check renders the real chart template from helm-charts and the test consumes it, so it only goes green once #868 is available. Not a gap. #868 also keeps kubernetes.io/os: linux on the default group and keys the affinity off the configured label (.NodeGroupLabelKey), so the OS-selector and key-vs-value points from my first review are handled on the helm side — provided the operator supplies that field, which is the next item.

🔴 Blocker (sharpened) — this PR doesn't provide .NodeGroupLabelKey, so renders will fail once #868 merges

#868 renders {{ .NodeGroupLabelKey }} in both branches of the operator-consumed template:

{{- if not .IsDefaultGroup }}
  {{ .NodeGroupLabelKey }}: "{{ .NodeGroupLabel }}"   # normal nodeSelector
{{- end }}
...
{{- if .IsDefaultGroup }}
  - key: {{ .NodeGroupLabelKey }}                      # DoesNotExist affinity
    operator: DoesNotExist
{{- end }}

But TemplateData (nodeagentautoscaler/templaterenderer.go L20–33) only defines Name, NodeGroupLabel, IsDefaultGroup, Resources, GoMemLimit — there is no NodeGroupLabelKey. RenderDaemonSet builds the struct (L214–221) without it, and the renderer runs tr.template.Execute(&buf, data) (L237) with no missingkey leniency (and missingkey wouldn't help anyway — it only applies to maps, not struct field access).

So as soon as #868 merges and the operator renders the real template, Execute errors with can't evaluate field NodeGroupLabelKey in type templaterenderer.TemplateData — and this fails for every group (the normal nodeSelector branch references it too), not just the default group. No autoscaler-managed DaemonSet would render. This is exactly why the CI check waits on #868; with the PR as-is it would go red.

Fix: add NodeGroupLabelKey string to TemplateData and set it from the configured key in RenderDaemonSet, e.g. NodeGroupLabelKey: tr.config.NodeGroupLabel (or thread the key in via the constructor / NodeGrouper). The two PRs need to merge together with this field present.

🔴 Blocker (still stands) — default-group key collides with a real label value

Independent of helm. In nodeagentautoscaler/nodegrouper.go (~L81–115) label-less nodes bucket under groupMap[DefaultNodeGroup] ("default"), and group.IsDefault ||= isDefault flips a group to the default path if any member lacks the label. A node legitimately labeled node.kubernetes.io/instance-type=default lands in that same bucket — and #868's default branch drops the instance-type nodeSelector in favor of DoesNotExist, so that genuinely-labeled node is no longer matched and loses its node-agent (and resource sizing mixes the two populations). Track the synthetic fallback in a separate bucket that can't collide with a real label value.

@github-actions

Copy link
Copy Markdown

Summary:

  • License scan: success
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

The integration-build job rendered the chart from a fresh helm-charts
checkout, but the chart has local file:// subchart dependencies
(kubescape-alert-crd, kubescape-alert-crd-ns) that must be vendored first.
Run `helm dependency build` before `helm template` so the extraction
produces a non-empty template instead of failing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

@matthyx matthyx left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — both blockers resolved ✅

Verified the two new commits (75ea4f5, a96a01c) against the blockers from the earlier review. Built clean and go test ./nodeagentautoscaler/... ./config/... passes locally.

🔴→✅ Default-group key collision — label-less nodes now bucket under an internal sentinel (defaultGroupKey = "\x00default-node-group", nodegrouper.go) that can't equal a real label value, so a node legitimately labelled …=default forms its own normal nodeSelector group and is never merged onto the affinity path. The collision-resolution hash is salted for the fallback group so the two same-LabelValue groups get distinct DaemonSet names. Good catch also keying Reconcile off the unique DaemonSet name instead of the (now non-unique) kubescape.io/node-group label value — that would otherwise have collapsed the default and real default groups during create/orphan handling.

🔴→✅ Missing NodeGroupLabelKeyTemplateData.NodeGroupLabelKey is now plumbed through NewTemplateRenderer(..., cfg.NodeGroupLabel) and set in RenderDaemonSet, so the template's {{ .NodeGroupLabelKey }} (both the nodeSelector and the default group's DoesNotExist affinity, per helm-charts #868) resolves and keys off the operator's actual grouping label even when overridden.

Withdrawn concern — the integration test now runs in CI (pr-created.yaml extracts the real chart template), with helm dependency build added in a96a01c to vendor the file:// subcharts.

One remaining gate (merge ordering, not a code issue)

The CI job checks out kubescape/helm-charts@main, but the matching template branch (.NodeGroupLabelKey + IsDefaultGroup) lives in unmerged helm-charts #868. Until #868 is on main, the extraction pulls the old template and TestIntegration_HelmGeneratedTemplate_DefaultGroup will fail (old template renders an instance-type nodeSelector and no DoesNotExist affinity). So #868 must merge first (or together) for this PR's CI to go green — exactly the dependency already noted. No further code changes needed on the operator side.

@github-actions

Copy link
Copy Markdown

Summary:

  • License scan: success
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

@slashben

Copy link
Copy Markdown
Contributor

Pod selector collision for the =default value case

The PR carefully keeps the synthetic default group distinct from a real group labelled node.kubernetes.io/instance-type=default (sentinel map key, salted SanitizedName, distinct DaemonSet name) — but the rendered DaemonSet still sets kubescape.io/node-group: "{{ .NodeGroupLabel }}" in both spec.selector.matchLabels and the pod template labels, and NodeGroupLabel == LabelValue == "default" for both groups. So the two managed DaemonSets end up with identical pod selectors.

Node targeting is mutually exclusive (label-absent vs. =default) and ownerReferences prevent outright pod-stealing, so it won't usually thrash pods — but two same-namespace DaemonSets with overlapping selectors is exactly the fragility the collision logic was added to avoid: orphan adoption if a controllerRef is ever lost, ambiguous kubectl get pods -l kubescape.io/node-group=default, and confused status accounting.

Suggestion: derive the kubescape.io/node-group selector/label value from the collision-resolved SanitizedName rather than the raw LabelValue. TestNodeGrouper_GetNodeGroups_DefaultValueCollision asserts the names diverge but never checks selector uniqueness.

@slashben

Copy link
Copy Markdown
Contributor

Default group resource sizing is pinned to the first-listed node — riskier for this group than for instance-type groups

GetNodeGroups sets a group's AllocatableCPU/AllocatableMemory from the first node it sees and only increments NodeCount for the rest; CalculateResources then derives requests/limits from that single node's allocatable (2%/5%, clamped to 100m2000m / 180Mi4Gi by default).

For instance-type groups this is exact (homogeneous by definition). The default group is the opposite case: nodes land there because they're unlabeled, which on-prem/bare-metal is precisely the mixed-hardware scenario (e.g. a 2-core utility box and a 64-core worker in one bucket). Whichever node Nodes().List() returns first dictates requests/limits for the node-agent on all of them, and list order isn't guaranteed stable across reconciles, so the sizing can flip.

Consequence:

  • first = small node → node-agent under-requested and throttled/OOM-killed against a tiny limit on the big nodes;
  • first = big node → unschedulable requests or oversized limits on the small nodes.

Since the silent-skip bug this PR fixes is itself an on-prem/heterogeneous-cluster scenario, consider sizing the default group off the minimum allocatable across its nodes (a safe lower bound) instead of the arbitrary first one.

Address slashben's review on #389: the default (label-less) group can be
heterogeneous (mixed on-prem/bare-metal hardware), so sizing it off the
first-listed node is unsafe and order-dependent. Size it off the minimum
allocatable CPU/memory across its nodes instead, so the node-agent's requests
fit every node in the group regardless of node list order. Instance-type
groups are homogeneous and keep first-node sizing.

Also document the rare shared-selector edge case (a node legitimately labelled
<groupingLabel>=<defaultNodeGroup>): targeting is mutually exclusive and
ownerReferences prevent pod adoption, so it is safe; noted for operators.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
@matthyx

matthyx commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @slashben — both points addressed in 99a27e9.

Default-group sizing → minimum allocatable. Agreed; the default group is exactly the heterogeneous case. GetNodeGroups now sizes the default group off the minimum allocatable CPU/memory across its nodes (instance-type groups stay first-node — homogeneous by definition). This makes sizing a safe lower bound and independent of Nodes().List() order. Added TestNodeGrouper_GetNodeGroups_DefaultGroupSizesOffMinimum (big node listed first, asserts the group sizes to the small one).

Pod-selector collision (=default edge) → accept + document. Good catch that TestNodeGrouper_GetNodeGroups_DefaultValueCollision checks name divergence but not selector uniqueness. Deriving the kubescape.io/node-group selector value from SanitizedName would fix it but spec.selector is immutable, so it would force delete+recreate of every existing autoscaler DaemonSet on upgrade. Given targeting is mutually exclusive and ownerReferences prevent pod adoption (as you note), we opted to keep selectors as-is and document the edge case in docs/node-agent-autoscaler.md. Happy to revisit toward SanitizedName if you'd prefer correctness over the upgrade churn.

@github-actions

Copy link
Copy Markdown

Summary:

  • License scan: success
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

@matthyx matthyx requested a review from slashben June 30, 2026 08:11
@matthyx matthyx merged commit 8a9aa2e into main Jul 1, 2026
14 of 15 checks passed
@matthyx matthyx deleted the feat/autoscaler-default-node-group branch July 1, 2026 08:39
@matthyx matthyx moved this to To Archive in KS PRs tracking Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Created through Armosec AI tooling (armosec-shared-rules plugin)

Projects

Status: To Archive

Development

Successfully merging this pull request may close these issues.

2 participants