Conversation
📝 WalkthroughWalkthroughThis change adds a learning period tracking capability to the node-agent. It bumps a dependency version, computes a sniffing duration and stores it in shared container state, exposes this duration as a metadata label, and refactors label validation logic to apply overrides before final sanitization. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4943be5 to
a715f42
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
a715f42 to
93a5542
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/objectcache/shared_container_data.go (1)
94-99:formatDurationworks but is brittle to representation changes.The two
strings.Replacecalls correctly compacttime.Duration.String()output for the cases covered byTest_formatDuration(verified for5m,1h30m,45s,1h30s,1h, and0). However, it relies on the exact format oftime.Duration.String()(e.g., it would also stripm0sfrom inside fractional-second strings like"1h0m0.5s"only partially). For the duration ranges this codebase uses (sniffing time in minutes/hours), this is acceptable — flagging only as a heads-up for future maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/objectcache/shared_container_data.go` around lines 94 - 99, formatDuration currently massages time.Duration.String() output with brittle string replacements (m0s/h0m) which can break for fractional seconds or unexpected representations; update formatDuration to build the output from duration components instead of post-processing the string: compute hours, minutes and seconds (using d.Hours()/Minutes()/Seconds() or integer division/modulo on d.Seconds()/time.Second) and format them conditionally (e.g., print hours if >0, minutes if >0, seconds otherwise) so outputs like "1h", "5m", "1h30m", "45s", "1h30s", and "0s" are produced deterministically without relying on time.Duration.String(). Ensure the logic lives in formatDuration and preserves existing test cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/objectcache/shared_container_data.go`:
- Around line 104-107: The current loop applies watchedContainer.LabelOverrides
directly to labels which allows an invalid override to overwrite the base label
and later be dropped by sanitization; instead, validate each override entry
before applying it: iterate over watchedContainer.LabelOverrides, run the same
per-entry validation/sanitization logic used later (or a dedicated
isValidLabelValue check) and only set labels[k] = v when the override value is
valid; this preserves the original base label for keys whose overrides are
invalid and prevents losing keys during the later sanitization pass.
---
Nitpick comments:
In `@pkg/objectcache/shared_container_data.go`:
- Around line 94-99: formatDuration currently massages time.Duration.String()
output with brittle string replacements (m0s/h0m) which can break for fractional
seconds or unexpected representations; update formatDuration to build the output
from duration components instead of post-processing the string: compute hours,
minutes and seconds (using d.Hours()/Minutes()/Seconds() or integer
division/modulo on d.Seconds()/time.Second) and format them conditionally (e.g.,
print hours if >0, minutes if >0, seconds otherwise) so outputs like "1h", "5m",
"1h30m", "45s", "1h30s", and "0s" are produced deterministically without relying
on time.Duration.String(). Ensure the logic lives in formatDuration and
preserves existing test cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 989fc585-c048-499a-94f8-387a3c256e64
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (4)
go.modpkg/containerprofilemanager/v1/lifecycle.gopkg/objectcache/shared_container_data.gopkg/objectcache/shared_container_data_test.go
| // Apply label overrides | ||
| for k, v := range watchedContainer.LabelOverrides { | ||
| if v == "" { | ||
| delete(labels, k) | ||
| } else if errs := content.IsLabelValue(v); len(errs) != 0 { | ||
| logger.L().Warning("GetLabels - label override value is not valid, skipping", helpers.String("key", k), helpers.String("value", v)) | ||
| delete(labels, k) | ||
| } else { | ||
| labels[k] = v | ||
| } | ||
| labels[k] = v | ||
| } |
There was a problem hiding this comment.
Behavior change: invalid LabelOverrides now drop the base label too.
Previously, overrides were validated per-entry and invalid overrides were discarded so the underlying base label remained. With the single-pass sanitization at Lines 127–139, an invalid override value now first overwrites the base value at Line 106 and is then removed entirely at Line 137 — so the resulting label map is missing the key in cases where it would have kept the (valid) base value before.
If overrides are expected to be trusted/pre-validated this is fine; otherwise consider validating overrides before applying:
♻️ Suggested fix
// Apply label overrides
for k, v := range watchedContainer.LabelOverrides {
+ if v == "" || len(content.IsLabelValue(v)) != 0 {
+ continue
+ }
labels[k] = v
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Apply label overrides | |
| for k, v := range watchedContainer.LabelOverrides { | |
| if v == "" { | |
| delete(labels, k) | |
| } else if errs := content.IsLabelValue(v); len(errs) != 0 { | |
| logger.L().Warning("GetLabels - label override value is not valid, skipping", helpers.String("key", k), helpers.String("value", v)) | |
| delete(labels, k) | |
| } else { | |
| labels[k] = v | |
| } | |
| labels[k] = v | |
| } | |
| // Apply label overrides | |
| for k, v := range watchedContainer.LabelOverrides { | |
| if v == "" || len(content.IsLabelValue(v)) != 0 { | |
| continue | |
| } | |
| labels[k] = v | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/objectcache/shared_container_data.go` around lines 104 - 107, The current
loop applies watchedContainer.LabelOverrides directly to labels which allows an
invalid override to overwrite the base label and later be dropped by
sanitization; instead, validate each override entry before applying it: iterate
over watchedContainer.LabelOverrides, run the same per-entry
validation/sanitization logic used later (or a dedicated isValidLabelValue
check) and only set labels[k] = v when the override value is valid; this
preserves the original base label for keys whose overrides are invalid and
prevents losing keys during the later sanitization pass.
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
…eCache (kubescape#788) (#36) * Replace AP and NN cache with CP (kubescape#788) * feat: foundation for ContainerProfileCache unification (steps 1, 2, 5-early) Additive-only scaffolding for the upcoming migration from the two workload-keyed caches (applicationprofilecache + networkneighborhoodcache) to a single container-keyed ContainerProfileCache. No consumers are rewired yet; all new code is unused. - Storage client: GetContainerProfile(namespace, name) on ProfileClient interface + *Storage impl + mock. - ContainerProfileCache interface + stub impl (methods return zero values; filled in by step 3/4). - Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness} deprecation counter + reconciler SLO metrics (entries gauge, hit/miss counter, tick duration histogram, eviction counter) registered up front so later steps emit cleanly. Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic consensus (v2, iteration 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: ContainerProfileCacheImpl + projection + shared-pointer fast-path (steps 3, 3.5, 4) - CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields - Option A+ fast-path: shared storage pointer when no user overlay - projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches - partial-profile detection with dedup'd WARN log + completeness metric label - Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix) - Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity Step 5 (reconciler) and legacy deletion land in follow-ups. Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: ContainerProfileCache reconciler with evict + refresh (step 5) - tickLoop drives evict + refresh on one goroutine, refresh gated by atomic - reconcileOnce evicts entries whose pod is gone or container stopped - refreshAllEntries snapshots IDs then refreshes outside Range to avoid a SafeMap RLock/WLock deadlock (rebuildEntry calls Set) - isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID) fallback for pre-running init containers with empty ContainerID - ctx.Err() honored inside Range callbacks for graceful shutdown - T8 end-to-end test: user-AP mutation -> cached projection reflects change Plan: .omc/plans/containerprofile-cache-unification-consensus.md Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err), #4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: profilehelper CP->legacy-shape shims + ContainerProfileCache aggregator wiring (step 6a) Adds the ContainerProfileCache reader to the ObjectCache aggregator interface so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer / NetworkNeighborhoodContainer shapes for callers that haven't migrated yet. - pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to aggregator interface + mock (both AP/NN stay for 6a-6c transit) - pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache, ContainerProfileCache() accessor - pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface + Get/SetContainerProfile test helpers, Start stub - pkg/rulemanager/profilehelper/profilehelper.go: - GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error) — the forward API - GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated; step 6c deletes them after CEL callers migrate. - cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to NewObjectCache; mock-path uses ContainerProfileCacheMock - test call sites updated for 5-arg NewObjectCache Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: migrate 20 CEL call sites to GetContainerProfile (step 6b) - applicationprofile/{capability,exec,http,open,syscall}.go: read fields directly off cp.Spec instead of the per-container AP shape - networkneighborhood/network.go: read Ingress/Egress/LabelSelector off cp.Spec directly - pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so SetApplicationProfile / SetNetworkNeighborhood also project into the unified ContainerProfile, and GetContainerProfile honours the shared container-ID registry (preserves "invalid container ID -> no profile" semantics for existing tests) - profilehelper CP->legacy shims remain in place; step 6c removes them Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: delete profilehelper shims + migrate rule_manager + creator (step 6c) - pkg/rulemanager/profilehelper/profilehelper.go: delete GetContainerApplicationProfile, GetContainerNetworkNeighborhood, GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile, GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now - pkg/rulemanager/rule_manager.go: - :202, :399 call profilehelper.GetContainerProfile instead of the shim - HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile; method name preserved (external API on RuleManagerInterface per plan v2 §2.4) - pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile and reads cp.Spec.PolicyByRuleId - pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use ContainerProfileCache().GetContainerProfileState (unified state source) Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: ObjectCache aggregator CP-only + collapse 2 callbacks to 1 (step 6d) - pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache() and NetworkNeighborhoodCache() methods — the aggregator is now {K8s, ContainerProfile, Dns} - pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc) - pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two ContainerCallback subscriptions (APC + NNC) collapse to one (CPC) - cmd/main.go: both branches (runtime-detection + mock) construct only ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup log: "ContainerProfileCache active; legacy AP/NN caches removed" - test call sites updated for 3-arg NewObjectCache Legacy packages still physically present (imports retained where still referenced, e.g. callstackcache); step 8 deletes them entirely. Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: delete legacy AP/NN cache packages + move callstackcache (step 8) - git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC) - git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC) - git rm pkg/objectcache/applicationprofilecache_interface.go - git rm pkg/objectcache/networkneighborhoodcache_interface.go - mv pkg/objectcache/applicationprofilecache/callstackcache/ -> pkg/objectcache/callstackcache/ (domain-agnostic, shared) - Update 4 importers: containerprofilecache_interface.go, v1/mock.go, containerprofilecache.go, reconciler.go - RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache() accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as test helpers that project into the unified CP - projection.go comments kept as historical source pointers — git history preserves the originals Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add T2 init-eviction, T5 packages-deleted, T7 lock-stress (step 9 partial) - tests/containerprofilecache/packages_deleted_test.go: go/packages dep-graph assertion that legacy AP/NN paths are absent - tests/containerprofilecache/lock_stress_test.go: 100-goroutine interleaved seed/read for same container IDs, 5s budget, race-safe - tests/containerprofilecache/init_eviction_test.go: T2a (event-path evict) + T2b (reconciler-path evict for missed RemoveContainer) - tests/containerprofilecache/helpers_test.go: shared test builders - pkg/objectcache/containerprofilecache: export ReconcileOnce and SeedEntryForTest as out-of-package test hooks - Makefile: check-legacy-packages target T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist items per plan v2 §2.7 — the pre-migration baselines those tests require can no longer be captured from this branch. Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Phase 4 review P1 findings 1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer, reconciler.go reconcileOnce). Security review flagged a race where the deleted mutex could be orphaned while a concurrent GetLock creates a new one, breaking mutual exclusion for the same container ID. Trade-off: bounded memory growth of stale lock entries, proportional to container churn — acceptable for a node-agent lifetime. 2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the ~20-line overlay metric/deprecation-warn block between buildEntry (addContainer path) and rebuildEntry (refresh path). Keeps the two in lockstep — code review flagged silent drift risk. Not addressed in this commit (plan-accepted tradeoffs, follow-up work): - Shared-pointer read-only invariant is convention-enforced, not type- enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is; downstream consumers must not mutate. - Storage RPC context propagation (requires storage.ProfileClient interface change, out of scope for this migration). Plan: .omc/plans/containerprofile-cache-unification-consensus.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: retry pending ContainerProfile GETs when CP appears after container-start Component tests on PR kubescape#788 regressed with "All alerts: []" and 54+ "container X not found in container-profile cache" log entries. Root cause: addContainer did a one-shot GetContainerProfile at EventTypeAddContainer time and bailed on 404. But the CP is created asynchronously by containerprofilemanager ~60s AFTER container-start, so the one-shot GET almost always missed; the cache entry was never created; rule evaluation short-circuited as "no profile". Legacy caches hid this via a periodic ListProfiles scan that picked up late-arriving profiles on the next tick. The point-lookup model dropped that scan. This commit adds an equivalent: a pending-container retry path in the reconciler. Changes: - CachedContainerProfile unchanged; new pendingContainer struct captures (container, sharedData, cpName) needed to retry the initial GET. - ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting for their CP to land in storage. - addContainer extracts the populate/GET into tryPopulateEntry. On miss (err or nil CP) it records a pending entry; the per-container goroutine exits. No more waiting 10 min inside addContainerWithTimeout. - reconciler.retryPendingEntries iterates pending under per-container locks, re-issues the GET, and promotes via tryPopulateEntry on success. - reconcileOnce gains a pending GC pass: containers whose pod is gone or whose status is not Running get dropped from pending so we don't retry forever on terminated containers. - deleteContainer also clears from pending on EventTypeRemoveContainer. - metrics: cache_entries gauge gains a "pending" kind; reconciler eviction counter gets a "pending_pod_stopped" reason. Tests: - TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP arrives in storage -> one tick promotes; exactly 2 GetCP calls. - TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its pod is no longer present in k8s cache. Full findings and resume doc at .omc/plans/containerprofile-cache-component-test-findings.md Follow-up plan updated at .omc/plans/containerprofile-cache-followups.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: cache correctness — right CP slug, partial-on-restart, overlay refs, resurrection guard PR kubescape#788 component tests continued failing after the pending-retry fix. Deep investigation uncovered a fundamental slug misuse and three reviewer- reported correctness gaps. All fixed here. ### Primary bug: wrong slug function plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is NOT — implementation at k8s-interface v0.0.206: func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) { u := uuid.New() hexSuffix := hex.EncodeToString(u[:]) ... } So containerprofilemanager.saveContainerProfile writes a *time-series* CP per tick with a fresh UUID suffix, and the storage-side ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated profile at the STABLE slug (GetSlug(false), no UUID). The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd forever — even with the pending-retry in place. 13 component tests failed with "All alerts: []" and 38+ "container X not found in container-profile cache" log entries. Switched addContainer to GetSlug(false). The refresh path inherits the corrected name via entry.CPName. ### Reviewer #1: resurrection during refresh refreshAllEntries snapshots entries without a lock. Between snapshot and per-entry lock acquisition, deleteContainer or reconcile-evict may have removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry) would resurrect the dead container. Added a load-under-lock guard at the top of refreshOneEntry. ### Reviewer #2: overlay handling regressions (two parts) (a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying user-AP/NN. Containers with only a user-defined profile (no base CP yet) got no entry. Restructured: fetch base CP and user-AP/NN independently; populate if ANY source is available; synthesize an empty base CP when only the overlay exists so projection has something to merge into. (b) UserAPRef / UserNNRef were only recorded on successful fetch. A transient 404 on add would permanently drop the overlay intent — the refresh path had nothing to re-fetch. Now, when the label is set, the refs are always recorded, using the label's name and the container's namespace. Refresh retries the fetch each tick. ### Reviewer #3: partial profiles reused across container restart tryPopulateEntry blindly used whatever CP existed at the stable slug, including Partial completions from the previous container incarnation. Legacy caches explicitly deleted Partial profiles on non-PreRunning restart so rule evaluation fell through to "no profile" until Full arrived. Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we treat the CP as absent → stay pending → retry each tick. When the CP becomes Full (or the container stops), the pending state resolves. The inverse is preserved: PreRunningContainer (agent-restart scenario) accepts the Partial CP as-is so Test_19's "alert on partial profile" semantics still work. ### Tests Five new unit tests, all race-clean: - TestPartialCP_NonPreRunning_StaysPending - TestPartialCP_PreRunning_Accepted - TestOverlayLabel_TransientFetchFailure_RefsRetained - TestRefreshDoesNotResurrectDeletedEntry - TestUserDefinedProfileOnly_NoBaseCP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: read workload-level AP/NN as primary data source The storage server's consolidated ContainerProfile is not exposed via the public k8s API — ContainerProfiles().Get(stableName) returns 404 even after consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the workload-name are queryable. The component tests' WaitForApplicationProfileCompletion waits for the workload-level AP/NN completion — that's what actually exists. The legacy caches read these directly; we do the same now while the server-side consolidated-CP plumbing is completed. Changes: - addContainer computes both cpName (per-container, forward-compat) and workloadName (per-workload, where AP/NN live) via GetSlug(false) and GetSlug(true) respectively. - tryPopulateEntry fetches consolidated CP (kept for forward-compat), workload AP, and workload NN. Treats the workload AP/NN as the primary data source when the consolidated CP isn't available. - projection pre-merges workloadAP + workloadNN onto the base (synthesized when CP is 404), then buildEntry applies user-overlay AP/NN on top. - Partial-on-restart gate extended to cover workload AP/NN too — non PreRunning containers ignore partial workload profiles until they become Full, mirroring legacy deletion-on-restart semantics. - pendingContainer gains workloadName so retries re-fetch the right CRs. - fakeProfileClient gains overlayOnly field so tests can scope AP/NN returns to the overlay name; existing TestOverlayPath_DeepCopies updated accordingly. Forward-compat: once storage publishes a queryable consolidated CP at cpName, its fetch becomes primary and the workload AP/NN path becomes a fallback. No API changes are required to make that transition — just drop the workload-level fetches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * debug: add tick-loop start log + change-detection log in reconciler * fix: remove overly-aggressive pending GC that dropped entries before retry CI run 24781030436 (commit ce32919) proved the reconciler IS ticking with retryPendingEntries running, but the pending-GC pass in reconcileOnce was dropping every pending entry on the first tick (pending_before=4 → pending_after=0 at the FIRST tick, before retryPendingEntries could run). Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also checked isContainerRunning. On a busy node, the k8s pod cache and ContainerStatuses lag the containerwatcher Add event by tens of seconds. So "pod not found" or "container not yet Running" routinely returned true for a container that had just been registered, causing GC to remove the pending entry immediately. Retries then ran against an empty pending map → no promotions → alerts fired without profile → test failure. Change: remove the pending GC entirely. Cleanup for terminated containers flows through deleteContainer (EventTypeRemoveContainer) which clears both entries and pending under the per-container lock. Memory growth is bounded by the node's container churn (containers that never got a profile during their lifetime). Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: merge user-managed AP/NN and refresh workload-level sources Two component-test regressions in PR kubescape#788: Fix A (Test_12 / Test_13): the cache now reads the user-managed ApplicationProfile and NetworkNeighborhood published at "ug-<workloadName>" and projects them onto the base profile as a dedicated ladder pass. Legacy caches did this via the `kubescape.io/managed-by: User` annotation in handleUserManagedProfile; we read them directly by their well-known name. Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the workload-level AP/NN (and user-managed / label-referenced overlays) on every tick, not just the consolidated CP. This propagates the Status= "ready" -> "completed" transition into the cached ProfileState, which flips fail_on_profile from false to true at rule-eval time. CachedContainerProfile gained WorkloadName plus WorkloadAPRV / WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh can fast-skip when every source's RV matches what's cached. refreshOneEntry's rebuild now runs the same projection ladder as tryPopulateEntry: base CP (or synthesized) → workload AP+NN → user-managed (ug-) AP+NN → label-referenced user AP+NN. Also: - Tick-loop log only fires when entries OR pending count actually moved (previously fired whenever pending_before > 0, producing per-tick noise while a stuck-pending entry waited for profile data). - fakeProfileClient in tests returns userManagedAP/userManagedNN when the requested name starts with "ug-". - New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B happy-path) and TestUserManagedProfileMerged (Fix A happy-path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: reconcileOnce no longer evicts on pod-cache lag, only on Terminated CI run 24783250693 (commit 32a76c0) showed reconcileOnce evicting live entries every tick: "entries_before:10, entries_after:0" within 5 seconds of the agent starting. Same class of bug as the pending-GC fix (c45803f): the k8s pod cache lags ContainerCallback Add events by tens of seconds, and evicting on "GetPod returns nil OR !isContainerRunning" churned every entry before any rules could evaluate. Change reconcileOnce eviction gate: - If pod is missing from k8s cache: DO NOT evict. Cache lag is transient; deleteContainer handles real-world cleanup via EventTypeRemoveContainer. - If pod present and container has clearly Terminated: evict (preserves init-container eviction for Test_02 and T2 acceptance). - If pod present and container in Waiting state: retain (new container creation, init-container pre-run both legitimately pass through Waiting). New helper isContainerTerminated mirrors isContainerRunning but gates on State.Terminated only; "not found in any status" treated as terminated. Tests: - TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing - New TestReconcilerEvictsTerminatedContainer - New TestReconcilerKeepsWaitingContainer Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop workload-level AP/NN fetch; CP-direct reading is authoritative The workload-level AP/NN fetch added in d27be01 was a workaround for the eviction/GC bugs (fixed in c45803f and d9ae0ac), not an architectural need. The consolidated ContainerProfile IS queryable at the GetSlug(false) name once storage aggregation runs; the cache simply needs to wait on the pending-retry path. This reverts the workload-AP/NN read while keeping: - consolidated CP as the single base-profile source - user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed because user-managed profiles are authored independently and are not consolidated into the CP server-side - user-defined overlay via pod UserDefinedProfileMetadataKey label - eviction fix (d9ae0ac), GC fix (c45803f), resurrection guard Removed: - workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry - WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the corresponding rebuildEntryFromSources ladder pass - Partial-on-restart gate for workload AP/NN (only applies to CP now) - Synth-CP annotation fallback chain (simplified to Completed/Full) Tests: - TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus (CP now the source; RV transition propagates Status) - TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay instead of workloadAP + ug- overlay This matches the migration plan's original intent: CP-direct, AP/NN only as user overlays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: synthetic entry CPName override, PodUID backfill, phase-labeled reconciler histogram Three review findings from the post-green audit. ### 1 (High) — synthetic entry stored the wrong CPName When tryPopulateEntry synthesized a CP (consolidated CP still 404), the synthetic name was workloadName or overlayName, and buildEntry persisted entry.CPName = cp.Name (i.e. the synthetic name). refreshOneEntry then queried the synthetic name instead of the real GetSlug(false) name; with the stored RV also empty, the fast-skip's "absent matches empty" branch kept the synthetic entry forever and the real consolidated CP could never replace it. Fix: after buildEntry, override entry.CPName = cpName (the real GetSlug(false) result passed into tryPopulateEntry). ### 2 (Medium) — PodUID never backfilled buildEntry only sets PodUID when the pod is already in k8sObjectCache at add time. On busy nodes the pod cache lags, so addContainer often runs before the pod lands and PodUID stays "". isContainerTerminated's empty-ContainerID fallback matches against (ContainerName, PodUID); when PodUID == "" and the status also has empty UID, the loop falls through and returns true (treat as terminated) — evicting a still-live init container. rebuildEntryFromSources copied prev.PodUID unchanged, so the error never healed. Fix: in rebuildEntryFromSources, if prev.PodUID is empty AND the pod is now in the k8s cache, use the fresh UID. ### 3 (Low) — reconciler duration histogram mixed two phases tickLoop (evict) and refreshAllEntries (refresh) both emitted ReportContainerProfileReconcilerDuration into the same plain Histogram, so nodeagent_containerprofile_reconciler_duration_seconds was a blend of two very different workloads. Plan v2 §2.9 had specified a HistogramVec with a "phase" label from the start. Fix: MetricsManager.ReportContainerProfileReconcilerDuration(phase, d). Prometheus implementation becomes a HistogramVec with phase label. tickLoop emits phase="evict", refreshAllEntries emits phase="refresh". MetricsMock/MetricsNoop signatures updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address all CodeRabbit review comments on PR kubescape#788 - ContainerProfileCacheMock.GetContainerProfileState returns synthetic error state instead of nil, matching the real impl's contract - Remove IgnoreContainer check on EventTypeRemoveContainer to prevent stale entries when pod labels change after Add - Deep-copy userAP/userNN in mergeApplicationProfile and mergeNetworkNeighborhood to eliminate aliasing of nested slices (Execs[i].Args, Opens[i].Flags, MatchExpressions[i].Values, etc.) into the cached ContainerProfile - Fix Shared=true bug: buildEntry now takes userManagedApplied bool; fast-path only sets Shared=true when no overlay was applied at all, matching rebuildEntryFromSources logic in reconciler.go - isContainerTerminated returns false when all status slices are empty (kubelet lag guard for brand-new pods) - Fix misplaced doc comment above GetContainerProfile in storage layer - Remove unused (*stubStorage).setCP test helper - Lock stress test evict path now uses ContainerCallback(Remove) to exercise deleteContainer and per-container locking - RuleObjectCacheMock stores per-container profiles in cpByContainerName; GetContainerProfile resolves via InstanceID.GetContainerName(); GetContainerProfileState returns synthetic error state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: thread context.Context through ProfileClient and add per-call RPC budget All five ProfileClient methods now accept ctx as their first argument so callers can enforce cancellation and deadline propagation. Each storage RPC in the reconciler is wrapped via refreshRPC(ctx, ...) which applies a configurable per-call timeout (config.StorageRPCBudget, default 5 s) on top of the parent context, preventing a slow API server from stalling an entire reconciler burst. Tests cover the fast-skip, rebuild, and context-cancellation mid-RPC paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: shared-pointer race-fuzz test + WarmContainerLocksForTest helper Add TestSharedPointerReadersDoNotCorruptCache: 50 concurrent readers traverse the returned *ContainerProfile slices while a writer goroutine alternately calls RefreshAllEntriesForTest + SeedEntryForTest to keep entry rebuilds active. Runs for 500ms under -race, proving the shared- pointer fast-path never produces a concurrent read/write pair. Also add TestSharedPointerFastPathPreservesPointerIdentity: after a refresh against a storage object with a newer RV, the new entry's Profile pointer IS the storage object (Shared=true, no DeepCopy), which keeps the T3 memory budget intact. Fix the pre-existing goradd/maps SafeMap initialisation race in TestLockStressAddEvictInterleaved by pre-warming containerLocks via the new WarmContainerLocksForTest helper (the previous pre-warm via SeedEntryForTest only covered the entries SafeMap, not containerLocks). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: document SetApplicationProfile / SetNetworkNeighborhood field partition in mock Add a block comment above RuleObjectCacheMock spelling out the non-overlapping cp.Spec field partition between the two setters and the first-container-wins rule for r.cp. Without this, future callers risk aliasing NN fields into an AP-only profile or vice-versa. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: T8 integration mirror, mock setter contract doc, SeedEntryWithOverlayForTest Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can set UserAPRef / UserNNRef (which use the internal namespacedName type) without requiring the type to be exported. Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/ using only the public + test-helper API: seeds an entry with a stale UserAPRV, mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the projection with the new execs and drops the stale ones. Add top-of-file block comment to RuleObjectCacheMock documenting the non- overlapping AP-fields / NN-fields partition between SetApplicationProfile and SetNetworkNeighborhood and the first-container-wins rule for r.cp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address Phase 4 code-review findings - reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire) - reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1) with a blocking send so the signal is never silently dropped; bump rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI - containerprofilecache.go: extract defaultStorageRPCBudget const alongside defaultReconcileInterval for discoverability - shared_pointer_race_test.go: fix gofmt const-block alignment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: preserve cached entry when overlay AP/NN fetch fails transiently Before this fix, a refreshRPC timeout on any overlay GET (user-managed ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the overlay variable nil with the error silently discarded. The RV comparison then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal, and rebuilt the entry without the overlay — temporarily stripping user-managed/user-defined profile data from the cache and altering alerting until the next successful tick. Fix: capture each overlay's fetch error and, when it is non-nil and the entry already has a non-empty cached RV for that overlay, return early and keep the existing entry unchanged. Legitimate deletions (nil with err==nil) still propagate correctly. Mirrors the existing CP error- preservation logic at refreshOneEntry:272-288. Add TestRefreshPreservesEntryOnTransientOverlayError covering all four overlay fetch paths (user-managed AP, user-managed NN, user-defined AP, user-defined NN) via a new overlayErrorClient stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeRabbit review issues on PR kubescape#788 - Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match the existing metric namespace convention used across node-agent. - Route all 5 storage GETs in tryPopulateEntry through refreshRPC so they respect the per-call SLO (default 5s); prevents a hung GET from blocking the entire reconciler tick loop when called from retryPendingEntries. - Add WarmPendingForTest helper to pre-initialise the pending SafeMap before concurrent test phases, preventing the goradd/maps nil-check-before-lock initialisation race. - Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and poll for async deleteContainer goroutines to drain before asserting goroutine count. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: distinct RNG seed per stress-test worker Pass worker index into each goroutine closure and mix it into the rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that 100 concurrently-launched goroutines don't all receive the same nanosecond timestamp and end up with identical add/evict sequences. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: move test helpers out of production source into testing.go The six *ForTest / ReconcileOnce helpers were previously mixed into containerprofilecache.go alongside production logic. Move them to a dedicated testing.go file in the same package. export_test.go is the idiomatic alternative but is compiled only when running tests in the same directory; test packages in other directories (tests/containerprofilecache/) import the non-test version of the package and never see _test.go contents. A plain testing.go is the correct pattern here — it signals "test support" by name and groups all scaffolding in one place, while remaining importable by any test binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: move integration tests into package dir; use export_test.go export_test.go (package containerprofilecache) is only compiled during `go test` so test helpers never enter the production binary. This only works when callers are in the same directory; the prior layout put tests in tests/containerprofilecache/ (a separate package), forcing helpers into a plain testing.go that shipped in the binary. Moving the six test files into pkg/objectcache/containerprofilecache/ as package containerprofilecache_test fixes this correctly: - export_test.go replaces testing.go (test-binary-only) - package declaration: containerprofilecache_integration → containerprofilecache_test - packages_deleted_test.go Dir path: ../.. → ../../.. (module root) - tests/containerprofilecache/ directory removed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: nil out overlay pointers when k8s client returns zero-value on 404 The Kubernetes generated client (gentype.Client.Get) pre-allocates a zero-value struct before the HTTP call and returns it as the result even on error (e.g. 404 not-found). In refreshOneEntry, the four overlay fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only the "transient error with cached RV → keep old entry" branch; the "first-time 404, no cached RV" branch fell through with a non-nil empty-ObjectMeta struct still in the pointer, which reached rebuildEntryFromSources → emitOverlayMetrics and logged spurious "user-authored legacy profile merged" warnings with empty namespace/name/resourceVersion fields. Add an explicit nil-out after each non-returning error branch, mirroring the pattern already used in tryPopulateEntry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> * feat: extract client CA file from kubelet config YAML and enhance service file handling (kubescape#791) Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> * add learning period label to TS CPs (kubescape#797) Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> * perf: switch to kubescape/syft v1.32.0-ks.2 + disable file catalogers (kubescape#798) * perf: disable file-digest/metadata/executable catalogers These three catalogers iterate every file in the scan tree and dominate transient allocation, but their outputs are not consumed by the OOM-relevant SBOM path. Disabling them saves ~200 MB peak RSS on gitlab-ee (main) and stacks with upstream selective-indexing + binary-prefilter improvements to ~1.12 GB total (vs 1.62 GB baseline, fits 1.5 GB cgroup). Signed-off-by: Ben <ben@armosec.io> * deps: switch to kubescape/syft v1.32.0-ks.2 for memory reduction Routes anchore/syft imports to the kubescape fork via replace directive. The fork carries selective indexing + binary-cataloger pre-filtering on top of v1.32.0; combined with the file-cataloger disable in the parent commit, this reduces gitlab-ee scan peak RSS from 1,621 MB to 1,123 MB. Refs: NAUT-1283 Signed-off-by: Ben <ben@armosec.io> * fix: check dep.Replace for actual fork version; add cataloger removals to sidecar - packageVersion() now returns dep.Replace.Version when present so the fork tag (v1.32.0-ks.2) propagates to runtime metadata and version-gating logic - pkg/sbomscanner/v1/server.go: add the same WithCatalogerSelection/WithRemovals as sbom_manager.go so both SBOM paths drop file-digest/metadata/executable catalogers and stay in consistent memory behaviour Signed-off-by: Ben <ben@armosec.io> * fix: keep syft tool version at required version Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Signed-off-by: Ben <ben@armosec.io> Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Signed-off-by: Ben <ben@armosec.io> Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-authored-by: Ben Hirschberg <59160382+slashben@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Entlein <eineintlein@gmail.com>
Summary by CodeRabbit
New Features
Chores