Skip to content

Replace AP and NN cache with CP#788

Merged
matthyx merged 31 commits intomainfrom
cp-cache
Apr 27, 2026
Merged

Replace AP and NN cache with CP#788
matthyx merged 31 commits intomainfrom
cp-cache

Conversation

@matthyx
Copy link
Copy Markdown
Contributor

@matthyx matthyx commented Apr 22, 2026

Summary by CodeRabbit

  • New Features

    • Introduced a unified container-profile cache with reconciliation and retry to improve profile availability.
  • Improvements

    • Enhanced metrics for container-profile caches (hits/misses, entries, reconciler durations, evictions, legacy-loads) and deprecation warnings; improved overlay merging and eviction behavior.
  • Tests

    • Added extensive unit/integration coverage, including concurrency, refresh, eviction, and a test that ensures legacy import paths are removed.
  • Chores

    • Added a Makefile target to run the legacy-package detection test.

matthyx and others added 10 commits April 22, 2026 10:30
…-early)

Additive-only scaffolding for the upcoming migration from the two
workload-keyed caches (applicationprofilecache + networkneighborhoodcache)
to a single container-keyed ContainerProfileCache. No consumers are
rewired yet; all new code is unused.

- Storage client: GetContainerProfile(namespace, name) on ProfileClient
  interface + *Storage impl + mock.
- ContainerProfileCache interface + stub impl (methods return zero values;
  filled in by step 3/4).
- Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness}
  deprecation counter + reconciler SLO metrics (entries gauge, hit/miss
  counter, tick duration histogram, eviction counter) registered up front
  so later steps emit cleanly.

Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic
consensus (v2, iteration 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th (steps 3, 3.5, 4)

- CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields
- Option A+ fast-path: shared storage pointer when no user overlay
- projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches
- partial-profile detection with dedup'd WARN log + completeness metric label
- Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix)
- Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity

Step 5 (reconciler) and legacy deletion land in follow-ups.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tickLoop drives evict + refresh on one goroutine, refresh gated by atomic
- reconcileOnce evicts entries whose pod is gone or container stopped
- refreshAllEntries snapshots IDs then refreshes outside Range to avoid a
  SafeMap RLock/WLock deadlock (rebuildEntry calls Set)
- isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID)
  fallback for pre-running init containers with empty ContainerID
- ctx.Err() honored inside Range callbacks for graceful shutdown
- T8 end-to-end test: user-AP mutation -> cached projection reflects change

Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err),
#4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gregator wiring (step 6a)

Adds the ContainerProfileCache reader to the ObjectCache aggregator interface
so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer /
NetworkNeighborhoodContainer shapes for callers that haven't migrated yet.

- pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to
  aggregator interface + mock (both AP/NN stay for 6a-6c transit)
- pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache,
  ContainerProfileCache() accessor
- pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface +
  Get/SetContainerProfile test helpers, Start stub
- pkg/rulemanager/profilehelper/profilehelper.go:
  - GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error)
    — the forward API
  - GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten
    as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated;
    step 6c deletes them after CEL callers migrate.
- cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to
  NewObjectCache; mock-path uses ContainerProfileCacheMock
- test call sites updated for 5-arg NewObjectCache

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- applicationprofile/{capability,exec,http,open,syscall}.go: read fields
  directly off cp.Spec instead of the per-container AP shape
- networkneighborhood/network.go: read Ingress/Egress/LabelSelector off
  cp.Spec directly
- pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so
  SetApplicationProfile / SetNetworkNeighborhood also project into the
  unified ContainerProfile, and GetContainerProfile honours the shared
  container-ID registry (preserves "invalid container ID -> no profile"
  semantics for existing tests)
- profilehelper CP->legacy shims remain in place; step 6c removes them

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (step 6c)

- pkg/rulemanager/profilehelper/profilehelper.go: delete
  GetContainerApplicationProfile, GetContainerNetworkNeighborhood,
  GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile,
  GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now
- pkg/rulemanager/rule_manager.go:
  - :202, :399 call profilehelper.GetContainerProfile instead of the shim
  - HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile;
    method name preserved (external API on RuleManagerInterface per plan v2 §2.4)
- pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile
  and reads cp.Spec.PolicyByRuleId
- pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use
  ContainerProfileCache().GetContainerProfileState (unified state source)

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(step 6d)

- pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache()
  and NetworkNeighborhoodCache() methods — the aggregator is now
  {K8s, ContainerProfile, Dns}
- pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc)
- pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two
  ContainerCallback subscriptions (APC + NNC) collapse to one (CPC)
- cmd/main.go: both branches (runtime-detection + mock) construct only
  ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup
  log: "ContainerProfileCache active; legacy AP/NN caches removed"
- test call sites updated for 3-arg NewObjectCache

Legacy packages still physically present (imports retained where still
referenced, e.g. callstackcache); step 8 deletes them entirely.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tep 8)

- git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC)
- git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC)
- git rm pkg/objectcache/applicationprofilecache_interface.go
- git rm pkg/objectcache/networkneighborhoodcache_interface.go
- mv pkg/objectcache/applicationprofilecache/callstackcache/
    -> pkg/objectcache/callstackcache/ (domain-agnostic, shared)
- Update 4 importers: containerprofilecache_interface.go, v1/mock.go,
  containerprofilecache.go, reconciler.go
- RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache()
  accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as
  test helpers that project into the unified CP
- projection.go comments kept as historical source pointers — git history
  preserves the originals

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 9 partial)

- tests/containerprofilecache/packages_deleted_test.go: go/packages
  dep-graph assertion that legacy AP/NN paths are absent
- tests/containerprofilecache/lock_stress_test.go: 100-goroutine
  interleaved seed/read for same container IDs, 5s budget, race-safe
- tests/containerprofilecache/init_eviction_test.go: T2a (event-path
  evict) + T2b (reconciler-path evict for missed RemoveContainer)
- tests/containerprofilecache/helpers_test.go: shared test builders
- pkg/objectcache/containerprofilecache: export ReconcileOnce and
  SeedEntryForTest as out-of-package test hooks
- Makefile: check-legacy-packages target

T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist
items per plan v2 §2.7 — the pre-migration baselines those tests require
can no longer be captured from this branch.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer,
   reconciler.go reconcileOnce). Security review flagged a race where the
   deleted mutex could be orphaned while a concurrent GetLock creates a new
   one, breaking mutual exclusion for the same container ID. Trade-off:
   bounded memory growth of stale lock entries, proportional to container
   churn — acceptable for a node-agent lifetime.

2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the
   ~20-line overlay metric/deprecation-warn block between buildEntry
   (addContainer path) and rebuildEntry (refresh path). Keeps the two in
   lockstep — code review flagged silent drift risk.

Not addressed in this commit (plan-accepted tradeoffs, follow-up work):
- Shared-pointer read-only invariant is convention-enforced, not type-
  enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is;
  downstream consumers must not mutate.
- Storage RPC context propagation (requires storage.ProfileClient interface
  change, out of scope for this migration).

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces ApplicationProfileCache and NetworkNeighborhoodCache with a new per-container ContainerProfileCache that projects user overlays, indexes call-stacks, runs a periodic reconciler (evict/refresh/retry pending), extends metrics, updates storage/profile APIs, and adds comprehensive tests and CI checks.

Changes

Cohort / File(s) Summary
Build & CI
Makefile, tests/containerprofilecache/packages_deleted_test.go
Added .PHONY target check-legacy-packages and a test that fails if legacy import paths remain.
Startup & Wiring
cmd/main.go, pkg/containerwatcher/v2/container_watcher_collection.go, pkg/objectcache/objectcache_interface.go, pkg/objectcache/v1/objectcache.go, pkg/objectcache/v1/objectcache_test.go
Rewired startup to construct/use ContainerProfileCache, replaced AP/NN cache registrations with CPC, and updated ObjectCache API/tests accordingly.
New ContainerProfileCache
pkg/objectcache/containerprofilecache/...
containerprofilecache.go, metrics.go, projection.go, reconciler.go, *_test.go, projection_test.go
Introduced container-level cache: population (fast-path + overlay projection), pending entries, call-stack indexing, periodic reconciler (evict/refresh/rebuild/retry pending), metrics emission, and many unit/integration tests and test hooks.
Removed Legacy Caches
pkg/objectcache/applicationprofilecache/..., pkg/objectcache/networkneighborhoodcache/..., *_interface.go
Deleted implementations, interfaces, mocks, and tests for ApplicationProfileCache and NetworkNeighborhoodCache.
Metrics surface
pkg/metricsmanager/metrics_manager_interface.go, pkg/metricsmanager/metrics_manager_mock.go, pkg/metricsmanager/metrics_manager_noop.go, pkg/metricsmanager/prometheus/prometheus.go
Extended MetricsManager with container-profile metrics and implemented noop/mock/prometheus handlers and unregister/destroy logic.
Storage additions
pkg/storage/storage_interface.go, pkg/storage/storage_mock.go, pkg/storage/v1/containerprofile.go
Added context-aware storage method signatures and new GetContainerProfile(ctx, namespace, name) across interface, mocks, and storage implementation.
Profile helpers & call-sites
pkg/rulemanager/profilehelper/..., pkg/rulemanager/*, pkg/rulemanager/cel/libraries/*, pkg/rulemanager/rulepolicy.go, pkg/rulemanager/ruleadapters/creator.go
Removed legacy per-CR helpers, added GetContainerProfile helper, and updated CEL libraries, rule manager, adapters, and validators to use ContainerProfile (cp.Spec.*).
ObjectCache mock updates
pkg/objectcache/v1/mock.go
Extended mock to store per-container ContainerProfile data, expose ContainerProfileCache methods, and removed old AP/NN methods.
Integration tests & helpers
tests/containerprofilecache/*
Added helpers and many integration tests: reconciliation, eviction, lock-stress, shared-pointer race, overlay refresh, and other scenarios exercising pending/retry/metrics.
Docs / followups
.omc/plans/*
Added investigation and follow-up planning docs for the containerprofile cache migration.
Config
pkg/config/config.go
Added StorageRPCBudget time.Duration config field.

Sequence Diagram(s)

sequenceDiagram
    participant CC as ContainerCollection
    participant CPC as ContainerProfileCache
    participant Storage as StorageClient
    participant K8s as K8sObjectCache
    participant Metrics as MetricsManager
    participant App as Consumer

    rect rgba(220,100,100,0.5)
    Note over CC,App: Container Add Event
    App->>CPC: ContainerCallback(Add)
    CPC->>CPC: waitForSharedContainerData (retry/backoff)
    CPC->>K8s: GetSharedContainerData
    K8s-->>CPC: shared metadata
    CPC->>Storage: GetContainerProfile(ctx, ns, name)
    Storage-->>CPC: ContainerProfile or nil
    alt user-overlay present
        CPC->>Storage: GetApplicationProfile(ctx, ...)
        Storage-->>CPC: overlays
        CPC->>CPC: projectUserProfiles (merge)
        CPC->>Metrics: ReportContainerProfileLegacyLoad(kind, completeness)
    end
    CPC->>CPC: build CachedContainerProfile (index call-stacks)
    CPC-->>App: cached entry ready
    end

    rect rgba(100,220,100,0.5)
    Note over CPC,Metrics: Periodic Reconcile
    CPC->>CPC: tickLoop -> reconcileOnce
    CPC->>K8s: GetPod(podRef)
    K8s-->>CPC: Pod or nil
    alt evict condition (pod terminated)
        CPC->>CPC: evict entry
        CPC->>Metrics: ReportContainerProfileReconcilerEviction(reason)
    else alive
        CPC->>Storage: GetContainerProfile(ctx, ns, name) (check RV)
        Storage-->>CPC: profile (maybe unchanged)
        alt RV/overlay changed
            CPC->>CPC: rebuildEntryFromSources (project & index)
            CPC->>Metrics: SetContainerProfileCacheEntries(kind,count)
        end
    end
    end

    rect rgba(100,100,220,0.5)
    Note over App,CPC: Read Path
    App->>CPC: GetContainerProfile(containerID)
    alt hit
        CPC->>Metrics: ReportContainerProfileCacheHit(true)
        CPC-->>App: *ContainerProfile
    else miss
        CPC->>Metrics: ReportContainerProfileCacheHit(false)
        CPC-->>App: nil / error state
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • YakirOren

Poem

🐰 I stitched small profiles into one bright heap,
Merged overlays, indexed call-stacks deep,
Reconcilers hum and pending entries wake,
Metrics tick softly for each cache remake—
A merry rabbit hops; the tests awake!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Replace AP and NN cache with CP' directly and concisely describes the main architectural change: swapping ApplicationProfile/NetworkNeighborhood caches for a unified ContainerProfile cache.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cp-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.158 0.155 -2.0%
Peak CPU (cores) 0.166 0.161 -3.0%
Avg Memory (MiB) 406.443 309.240 -23.9%
Peak Memory (MiB) 407.953 311.078 -23.7%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1704 119456 98.6%
network 900 78000 98.9%
open 33980 622265 94.8%
symlink 6000 0 0.0%
syscall 976 1884 65.9%
Event Counters
Metric BEFORE AFTER
capability_counter 9 7
dns_counter 1456 1442
exec_counter 7285 7217
network_counter 95846 94918
open_counter 797996 789829
syscall_counter 3642 3594

@matthyx
Copy link
Copy Markdown
Contributor Author

matthyx commented Apr 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (11)
pkg/rulemanager/ruleadapters/creator.go (1)

146-179: Duplicate branches post-migration — consider consolidating.

After the CP migration, both ApplicationProfile and NetworkProfile cases call the exact same GetContainerProfileState(containerID) and build identical ProfileMetadata structs modulo the Type field. This duplication invites drift. You can collapse to a single block that sets profileMetadata.Type = profileType based on the switch, or compute state once before the switch.

♻️ Proposed refactor
 	switch profileType {
-	case armotypes.ApplicationProfile:
-		state := objectCache.ContainerProfileCache().GetContainerProfileState(triggerEvent.GetContainerID())
-		if state != nil {
-			profileMetadata := &armotypes.ProfileMetadata{
-				Status:            state.Status,
-				Completion:        state.Completion,
-				Name:              state.Name,
-				FailOnProfile:     state.Status == helpersv1.Completed,
-				Type:              armotypes.ApplicationProfile,
-				ProfileDependency: profileRequirment,
-			}
-			if state.Error != nil {
-				profileMetadata.Error = state.Error.Error()
-			}
-			baseRuntimeAlert.ProfileMetadata = profileMetadata
-		}
-
-	case armotypes.NetworkProfile:
-		state := objectCache.ContainerProfileCache().GetContainerProfileState(triggerEvent.GetContainerID())
-		if state != nil {
-			profileMetadata := &armotypes.ProfileMetadata{
-				Status:            state.Status,
-				Completion:        state.Completion,
-				Name:              state.Name,
-				FailOnProfile:     state.Status == helpersv1.Completed,
-				Type:              armotypes.NetworkProfile,
-				ProfileDependency: profileRequirment,
-			}
-			if state.Error != nil {
-				profileMetadata.Error = state.Error.Error()
-			}
-			baseRuntimeAlert.ProfileMetadata = profileMetadata
-		}
+	case armotypes.ApplicationProfile, armotypes.NetworkProfile:
+		state := objectCache.ContainerProfileCache().GetContainerProfileState(triggerEvent.GetContainerID())
+		if state != nil {
+			profileMetadata := &armotypes.ProfileMetadata{
+				Status:            state.Status,
+				Completion:        state.Completion,
+				Name:              state.Name,
+				FailOnProfile:     state.Status == helpersv1.Completed,
+				Type:              profileType,
+				ProfileDependency: profileRequirment,
+			}
+			if state.Error != nil {
+				profileMetadata.Error = state.Error.Error()
+			}
+			baseRuntimeAlert.ProfileMetadata = profileMetadata
+		}
 	default:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rulemanager/ruleadapters/creator.go` around lines 146 - 179, The two
switch branches for armotypes.ApplicationProfile and armotypes.NetworkProfile
duplicate the same call to
objectCache.ContainerProfileCache().GetContainerProfileState(triggerEvent.GetContainerID())
and the identical construction of an armotypes.ProfileMetadata — consolidate by
calling GetContainerProfileState once (e.g., before or at the start of the
switch), guard on state != nil once, build a single ProfileMetadata object and
set profileMetadata.Type = profileType (or assign the differing Type inside the
small switch), preserve the Error handling (state.Error != nil) and assign to
baseRuntimeAlert.ProfileMetadata; update or remove the duplicate cases
accordingly to avoid drift.
pkg/rulemanager/rulepolicy.go (1)

23-35: Consolidate the double map lookup and consider nil-guarding cp.

Lines 24 and 28 perform the same cp.Spec.PolicyByRuleId[ruleId] lookup twice. A single lookup is clearer and slightly faster. Also, if cp could ever be nil at the call site, cp.Spec will panic — worth a defensive guard unless the caller guarantees non-nil.

♻️ Proposed simplification
-func (v *RulePolicyValidator) Validate(ruleId string, process string, cp *v1beta1.ContainerProfile) (bool, error) {
-	if _, ok := cp.Spec.PolicyByRuleId[ruleId]; !ok {
-		return false, nil
-	}
-
-	if policy, ok := cp.Spec.PolicyByRuleId[ruleId]; ok {
-		if policy.AllowedContainer || slices.Contains(policy.AllowedProcesses, process) {
-			return true, nil
-		}
-	}
-
-	return false, nil
-}
+func (v *RulePolicyValidator) Validate(ruleId string, process string, cp *v1beta1.ContainerProfile) (bool, error) {
+	if cp == nil {
+		return false, nil
+	}
+	policy, ok := cp.Spec.PolicyByRuleId[ruleId]
+	if !ok {
+		return false, nil
+	}
+	if policy.AllowedContainer || slices.Contains(policy.AllowedProcesses, process) {
+		return true, nil
+	}
+	return false, nil
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rulemanager/rulepolicy.go` around lines 23 - 35, In
RulePolicyValidator.Validate, consolidate the duplicate map lookup by retrieving
cp.Spec.PolicyByRuleId[ruleId] once into a local variable (e.g., policy, ok :=
cp.Spec.PolicyByRuleId[ruleId]) and use that for the subsequent checks;
additionally add a nil-guard at the start of Validate to return false,nil if cp
== nil (or cp.Spec is nil if applicable) so the function doesn't panic when cp
is nil, and then perform the single lookup and evaluate policy.AllowedContainer
or slices.Contains(policy.AllowedProcesses, process) before returning.
tests/containerprofilecache/helpers_test.go (1)

39-51: Consider supporting EphemeralContainerStatuses in makeTestPod.

Projection logic (and the ContainerProfile cache) takes the first available container across Containers / InitContainers / EphemeralContainers (e.g., RuleObjectCacheMock.SetNetworkNeighborhood in pkg/objectcache/v1/mock.go). makeTestPod currently accepts only container + init statuses, which limits test coverage for ephemeral-container paths. Optional to address now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/containerprofilecache/helpers_test.go` around lines 39 - 51, The test
helper makeTestPod currently only accepts containerStatuses and initStatuses;
add an additional parameter (e.g., ephemeralStatuses []corev1.ContainerStatus)
and set Pod.Status.EphemeralContainerStatuses = ephemeralStatuses in the
returned Pod so tests can exercise ephemeral-container code paths (also update
any test call sites to pass nil or a slice where needed). This change targets
the makeTestPod function to enable coverage of projection logic that inspects
EphemeralContainerStatuses.
pkg/objectcache/containerprofilecache/projection_test.go (1)

59-68: Assert the merged rule-policy contents, not just the keys.

This test would still pass if R0901 existed but lost either the base or overlay AllowedProcesses. Add assertions for the merged process set so utils.MergePolicies regressions are caught.

Suggested test assertion
 	// R0901 merged, R0902 added
 	assert.Contains(t, projected.Spec.PolicyByRuleId, "R0901")
 	assert.Contains(t, projected.Spec.PolicyByRuleId, "R0902")
+	assert.ElementsMatch(t, []string{"ls", "cat"}, projected.Spec.PolicyByRuleId["R0901"].AllowedProcesses)
+	assert.ElementsMatch(t, []string{"echo"}, projected.Spec.PolicyByRuleId["R0902"].AllowedProcesses)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/projection_test.go` around lines 59 -
68, The test currently only checks that rule IDs exist in the projected profile;
instead assert the merged policy contents for those IDs so regressions in
utils.MergePolicies are caught: after calling projectUserProfiles (in this
test), fetch the policies from projected.Spec.PolicyByRuleId for "R0901" and
"R0902" and assert their AllowedProcesses contain the expected merged entries
(use assert.ElementsMatch or equivalent to compare the slices/sets), e.g. verify
"R0901" includes both base and overlay process names and "R0902" includes its
expected set; update the assertions around projected.Spec.PolicyByRuleId to
check AllowedProcesses values rather than just keys.
pkg/metricsmanager/prometheus/prometheus.go (1)

225-246: Keep the Prometheus metric prefix consistent.

All existing metrics use node_agent_, but the new ContainerProfile metrics use nodeagent_. Renaming later would be a breaking change for dashboards/alerts, so it’s better to align before release.

Suggested rename
-			Name: "nodeagent_user_profile_legacy_loads_total",
+			Name: "node_agent_user_profile_legacy_loads_total",
...
-			Name: "nodeagent_containerprofile_cache_entries",
+			Name: "node_agent_containerprofile_cache_entries",
...
-			Name: "nodeagent_containerprofile_cache_hit_total",
+			Name: "node_agent_containerprofile_cache_hit_total",
...
-			Name:    "nodeagent_containerprofile_reconciler_duration_seconds",
+			Name:    "node_agent_containerprofile_reconciler_duration_seconds",
...
-			Name: "nodeagent_containerprofile_reconciler_evictions_total",
+			Name: "node_agent_containerprofile_reconciler_evictions_total",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/prometheus/prometheus.go` around lines 225 - 246, The new
ContainerProfile Prometheus metric names use "nodeagent_" but should match the
existing "node_agent_" prefix; update the Name fields for
cpCacheLegacyLoadsCounter, cpCacheEntriesGauge, cpCacheHitCounter,
cpReconcilerDurationHistogram, and cpReconcilerEvictionsCounter to use
"node_agent_" (e.g., "node_agent_user_profile_legacy_loads_total",
"node_agent_containerprofile_cache_entries",
"node_agent_containerprofile_cache_hit_total",
"node_agent_containerprofile_reconciler_duration_seconds",
"node_agent_containerprofile_reconciler_evictions_total") so dashboards/alerts
remain compatible.
pkg/rulemanager/profilehelper/profilehelper.go (1)

12-25: LGTM.

Accessor is straightforward; map lookup on a nil Annotations is safe in Go (returns "").

Minor doc nit (optional): the comment "legacy callers go through the shims below until step 6c deletes them" (Line 13-14) describes other helpers that aren't in this file anymore per the PR's shim-collapse step. Consider trimming the comment to just describe this function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rulemanager/profilehelper/profilehelper.go` around lines 12 - 25, Update
the function comment for GetContainerProfile to remove the outdated "legacy
callers go through the shims below until step 6c deletes them" text; keep a
concise description that this function returns the ContainerProfile for a given
containerID and the SyncChecksumMetadataKey annotation (and possible error), so
edit the comment block above GetContainerProfile to only describe the function's
behavior and return values.
pkg/objectcache/containerprofilecache/reconciler.go (2)

47-49: Metric collision: the same histogram is emitted for two different operations.

ReportContainerProfileReconcilerDuration is called once after reconcileOnce (Line 49) and again after refreshAllEntries (Line 126). They have very different latency profiles (map walk vs. N storage RPCs), so observers will see a bimodal distribution with no way to tell evict-pass from refresh-pass.

Consider separate metric names (e.g. ReportContainerProfileReconcilerDuration for evict and ReportContainerProfileRefreshDuration for refresh), or a label distinguishing phase.

Also applies to: 124-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/reconciler.go` around lines 47 - 49,
The same histogram metric ReportContainerProfileReconcilerDuration is used for
two distinct phases (evict pass and refresh pass) causing bimodal distributions;
update the calls in reconciler.go so that reconcileOnce(...) continues to call
metricsManager.ReportContainerProfileReconcilerDuration(time.Since(start)) but
refreshAllEntries(...) reports a separate metric (e.g.
metricsManager.ReportContainerProfileRefreshDuration(time.Since(start)) or the
existing ReportContainerProfileReconcilerDuration with a new phase label), and
add the corresponding metric method/name or label handling in metricsManager so
observers can distinguish evict vs refresh latencies; relevant symbols:
reconcileOnce, refreshAllEntries, and
metricsManager.ReportContainerProfileReconcilerDuration.

155-216: refreshOneEntry accepts ctx but ignores it.

The signature is refreshOneEntry(_ context.Context, id string, e *CachedContainerProfile) (Line 155), and none of the storage calls (Line 156, 182, 188, 206, 211) are cancellable. On shutdown, a slow apiserver can block each entry for its full timeout regardless of ctx cancellation. Given the PR notes storage RPC ctx propagation is out of scope, at minimum add a ctx.Err() check between storage calls so cancellation is observed between RPCs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/reconciler.go` around lines 155 - 216,
The refreshOneEntry function currently ignores its context parameter; change the
signature to use ctx (not `_`) and between each storage call
(storageClient.GetContainerProfile, GetApplicationProfile,
GetNetworkNeighborhood) check ctx.Err() and return immediately if cancelled so
the loop observes shutdown quickly; keep using the already-fetched ap/nn values
when rebuilding (rebuildEntry) but bail out when ctx cancelled instead of
proceeding to further RPCs.
pkg/objectcache/containerprofilecache/reconciler_test.go (2)

558-561: Remove the helpersv1 import-sink hack.

The file has no actual use of helpersv1.* anywhere; Line 561 (var _ = helpersv1.CompletionMetadataKey) only exists to silence the unused-import linter. If the symbol is unused, drop the import entirely — the sink adds noise and will rot.

♻️ Proposed cleanup
 import (
 	"context"
 	"sync"
 	"sync/atomic"
 	"testing"
 	"time"

-	helpersv1 "github.com/kubescape/k8s-interface/instanceidhandler/v1/helpers"
 	"github.com/kubescape/node-agent/pkg/config"
...
 )
...
-// silence unused-import linter: helpersv1 is referenced only via the const in
-// containerprofilecache.go (used by some entries). Import explicitly so the
-// file compiles without the import when those constants aren't dereferenced.
-var _ = helpersv1.CompletionMetadataKey
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/reconciler_test.go` around lines 558 -
561, Remove the unused import-sink hack: delete the import of helpersv1 from the
reconciler_test.go imports and remove the dummy variable line "var _ =
helpersv1.CompletionMetadataKey" so the test no longer references helpersv1;
ensure the import block is updated to remove helpersv1 and run `go test` to
verify the file compiles without the sink.

506-528: Use strconv.Itoa instead of a hand-rolled int-to-string.

strconv is stdlib and already a transitive import of most Go test files. The custom itoa in Line 508-528 is 21 lines of code (including a negative-number path that is never exercised by the callers using i >= 0) replacing a 1-line stdlib call. The justification in the comment ("so tests don't pull in strconv") isn't a real cost.

♻️ Proposed simplification
 import (
+	"strconv"
 	...
 )
...
-	id := "c-" + itoa(i)
+	id := "c-" + strconv.Itoa(i)
...
-// itoa is a local int-to-string so tests don't pull in strconv just for one
-// call site.
-func itoa(i int) string { ... }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/reconciler_test.go` around lines 506 -
528, Replace the custom itoa function in reconciler_test.go with the stdlib
strconv.Itoa: remove the itoa(i int) implementation, add "strconv" to the test
file imports, and update any call sites that use itoa(...) to strconv.Itoa(...).
Keep behavior identical (strconv handles negatives) and remove the old comment
about avoiding strconv.
pkg/objectcache/containerprofilecache/containerprofilecache.go (1)

96-99: Swap the jitter and zero-check order.

utils.AddJitter(cfg.ProfilesCacheRefreshRate, 10) is computed first (Line 96) and then potentially discarded (Line 97-99). This is wasted work when ProfilesCacheRefreshRate <= 0, and more importantly the jitter is applied to a value the zero-check replaces — the default path does not get jitter, which may or may not be intentional but is at least worth a comment.

♻️ Proposed reorder
-	reconcileEvery := utils.AddJitter(cfg.ProfilesCacheRefreshRate, 10)
-	if cfg.ProfilesCacheRefreshRate <= 0 {
-		reconcileEvery = defaultReconcileInterval
-	}
+	reconcileEvery := cfg.ProfilesCacheRefreshRate
+	if reconcileEvery <= 0 {
+		reconcileEvery = defaultReconcileInterval
+	}
+	reconcileEvery = utils.AddJitter(reconcileEvery, 10)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/containerprofilecache.go` around lines
96 - 99, The code currently calls utils.AddJitter(cfg.ProfilesCacheRefreshRate,
10) before checking for a zero/negative refresh rate; change the order so you
first check if cfg.ProfilesCacheRefreshRate <= 0 and set reconcileEvery =
defaultReconcileInterval, otherwise set reconcileEvery =
utils.AddJitter(cfg.ProfilesCacheRefreshRate, 10). Update the reconcileEvery
assignment logic around cfg.ProfilesCacheRefreshRate and utils.AddJitter (and
optionally add a short comment explaining that jitter is applied only to
positive configured intervals).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/objectcache/containerprofilecache_interface.go`:
- Around line 28-30: The mock ContainerProfileCacheMock.GetContainerProfileState
currently returns nil which diverges from
ContainerProfileCacheImpl.GetContainerProfileState's contract (the impl returns
a synthetic *ProfileState with Error set on misses); update
ContainerProfileCacheMock.GetContainerProfileState to return a synthetic error
ProfileState (e.g., &ProfileState{Error: errors.New("mock: profile not found")})
instead of nil so callers like ruleadapters.creator.setProfileMetadata (which
checks if state != nil) follow the same branching as production; ensure the
returned ProfileState mirrors the shape of real error states so tests exercise
the same code paths.

In `@pkg/objectcache/containerprofilecache/containerprofilecache.go`:
- Around line 123-147: The Remove branch in ContainerCallback currently returns
early when cfg.IgnoreContainer(...) is true which can leave stale entries if
labels changed after Add; remove that ignore check so EventTypeRemoveContainer
always schedules deletion via go
c.deleteContainer(notif.Container.Runtime.ContainerID) (preserve host handling
for namespace logic if needed), and add a brief comment near ContainerCallback
explaining we intentionally skip IgnoreContainer on removals to avoid leaking
cache entries; reference ContainerCallback, cfg.IgnoreContainer, and
deleteContainer in the change.

In `@pkg/objectcache/containerprofilecache/projection.go`:
- Around line 78-90: The projected ContainerProfile is storing references to
user-overlay data (e.g., when appending matched.Capabilities, matched.Execs,
matched.Opens, matched.Syscalls, matched.Endpoints and assigning
matched.PolicyByRuleId into projected), which can lead to aliasing if the
original CRD object is mutated later; deep-copy all user-owned values before
appending/assigning into projected (create new slices and copy elements,
deep-copy nested structs/args/selectors, and new maps for PolicyByRuleId
entries), and when merging policies via utils.MergePolicies ensure you copy the
merged result into projected.Spec.PolicyByRuleId rather than storing references;
apply the same defensive copying to the other affected blocks noted (around
lines 107-124, 240-245, 282-285, 300-303, 324-329) so no nested slice/map in
projected points back to matched or the legacy CRD.

In `@pkg/objectcache/containerprofilecache/reconciler.go`:
- Around line 170-216: The fast-skip and CP-RV-changed paths in refreshOneEntry
are swallowing fetch errors for overlays (userAP/userNN) and then calling
rebuildEntry(..., nil, nil) which causes shared=true even when prev had
overlays; change the overlay fetch logic to detect fetch failures (e.g.
apErr/nnErr booleans) vs "not found" (nil without error), and on error either
(a) skip rebuilding this tick (return) or (b) pass the previous overlay
objects/refs into rebuildEntry so we don't flip Shared to true; ensure
rebuildEntry is only called with nil overlay args when the fetch truly indicates
the overlay no longer exists, and do not set Shared=true when prev.UserAPRef or
prev.UserNNRef were present but the current fetch errored.

In `@pkg/objectcache/v1/mock.go`:
- Around line 45-83: The mock currently projects only the first container into a
single r.cp and then returns it for any container ID; change the mock to store
and return container profiles per-container: create/use a map field (e.g.,
r.cpByID or r.containerProfiles) keyed by container name/ID and, when projecting
in the projection block (the code that builds r.cp today), build and store a
ContainerProfile for each container encountered using that container’s
identifier as the key; update GetContainerProfile to resolve the requested
container via ContainerIDToSharedData (as the comment suggests) to find the
correct container name/ID and then return the corresponding profile from the
per-container map (fall back to nil if not found).

In `@pkg/storage/v1/containerprofile.go`:
- Around line 10-14: The doc comment describing CreateContainerProfileDirect was
accidentally left above GetContainerProfile—move that comment back above
CreateContainerProfileDirect or replace it with a proper comment describing
GetContainerProfile to keep docs accurate; additionally, avoid using
context.Background() in GetContainerProfile: change the signature to accept ctx
context.Context (or plan a follow-up to add ctx) so callers like the CP cache
reconciler can propagate cancellation/timeouts instead of being stuck on
storageClient.ContainerProfiles(...).Get with a background context.

In `@tests/containerprofilecache/helpers_test.go`:
- Around line 67-71: The helper method (*stubStorage).setCP is reported as
unused by golangci-lint; either remove this unused function from the test
helpers or mark it intentionally exempt by adding a comment like //nolint:unused
with a short justification above the function declaration (target the setCP
method on stubStorage) so CI stops flagging it.

In `@tests/containerprofilecache/lock_stress_test.go`:
- Around line 103-128: The evict branch currently only calls GetContainerProfile
so it never exercises deleteContainer/ReconcileOnce or containerLocks; replace
the read-only call with a real eviction flow by creating a terminating Pod
object that matches the seeded entry (same ContainerProfile
id/podName/namespace/podUID or the keys your cache expects) and invoke the
reconciler to drive removal (call ReconcileOnce with that terminating Pod or, if
a direct API exists in the test harness, call the cache's delete/evict method
that triggers deleteContainer). Ensure you pass a context that stops after a
single step if needed and use the same identifying symbols used in the test
(SeedEntryForTest, ReconcileOnce, deleteContainer, containerLocks) so the test
actually exercises the deletion/lock path.

---

Nitpick comments:
In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 225-246: The new ContainerProfile Prometheus metric names use
"nodeagent_" but should match the existing "node_agent_" prefix; update the Name
fields for cpCacheLegacyLoadsCounter, cpCacheEntriesGauge, cpCacheHitCounter,
cpReconcilerDurationHistogram, and cpReconcilerEvictionsCounter to use
"node_agent_" (e.g., "node_agent_user_profile_legacy_loads_total",
"node_agent_containerprofile_cache_entries",
"node_agent_containerprofile_cache_hit_total",
"node_agent_containerprofile_reconciler_duration_seconds",
"node_agent_containerprofile_reconciler_evictions_total") so dashboards/alerts
remain compatible.

In `@pkg/objectcache/containerprofilecache/containerprofilecache.go`:
- Around line 96-99: The code currently calls
utils.AddJitter(cfg.ProfilesCacheRefreshRate, 10) before checking for a
zero/negative refresh rate; change the order so you first check if
cfg.ProfilesCacheRefreshRate <= 0 and set reconcileEvery =
defaultReconcileInterval, otherwise set reconcileEvery =
utils.AddJitter(cfg.ProfilesCacheRefreshRate, 10). Update the reconcileEvery
assignment logic around cfg.ProfilesCacheRefreshRate and utils.AddJitter (and
optionally add a short comment explaining that jitter is applied only to
positive configured intervals).

In `@pkg/objectcache/containerprofilecache/projection_test.go`:
- Around line 59-68: The test currently only checks that rule IDs exist in the
projected profile; instead assert the merged policy contents for those IDs so
regressions in utils.MergePolicies are caught: after calling projectUserProfiles
(in this test), fetch the policies from projected.Spec.PolicyByRuleId for
"R0901" and "R0902" and assert their AllowedProcesses contain the expected
merged entries (use assert.ElementsMatch or equivalent to compare the
slices/sets), e.g. verify "R0901" includes both base and overlay process names
and "R0902" includes its expected set; update the assertions around
projected.Spec.PolicyByRuleId to check AllowedProcesses values rather than just
keys.

In `@pkg/objectcache/containerprofilecache/reconciler_test.go`:
- Around line 558-561: Remove the unused import-sink hack: delete the import of
helpersv1 from the reconciler_test.go imports and remove the dummy variable line
"var _ = helpersv1.CompletionMetadataKey" so the test no longer references
helpersv1; ensure the import block is updated to remove helpersv1 and run `go
test` to verify the file compiles without the sink.
- Around line 506-528: Replace the custom itoa function in reconciler_test.go
with the stdlib strconv.Itoa: remove the itoa(i int) implementation, add
"strconv" to the test file imports, and update any call sites that use itoa(...)
to strconv.Itoa(...). Keep behavior identical (strconv handles negatives) and
remove the old comment about avoiding strconv.

In `@pkg/objectcache/containerprofilecache/reconciler.go`:
- Around line 47-49: The same histogram metric
ReportContainerProfileReconcilerDuration is used for two distinct phases (evict
pass and refresh pass) causing bimodal distributions; update the calls in
reconciler.go so that reconcileOnce(...) continues to call
metricsManager.ReportContainerProfileReconcilerDuration(time.Since(start)) but
refreshAllEntries(...) reports a separate metric (e.g.
metricsManager.ReportContainerProfileRefreshDuration(time.Since(start)) or the
existing ReportContainerProfileReconcilerDuration with a new phase label), and
add the corresponding metric method/name or label handling in metricsManager so
observers can distinguish evict vs refresh latencies; relevant symbols:
reconcileOnce, refreshAllEntries, and
metricsManager.ReportContainerProfileReconcilerDuration.
- Around line 155-216: The refreshOneEntry function currently ignores its
context parameter; change the signature to use ctx (not `_`) and between each
storage call (storageClient.GetContainerProfile, GetApplicationProfile,
GetNetworkNeighborhood) check ctx.Err() and return immediately if cancelled so
the loop observes shutdown quickly; keep using the already-fetched ap/nn values
when rebuilding (rebuildEntry) but bail out when ctx cancelled instead of
proceeding to further RPCs.

In `@pkg/rulemanager/profilehelper/profilehelper.go`:
- Around line 12-25: Update the function comment for GetContainerProfile to
remove the outdated "legacy callers go through the shims below until step 6c
deletes them" text; keep a concise description that this function returns the
ContainerProfile for a given containerID and the SyncChecksumMetadataKey
annotation (and possible error), so edit the comment block above
GetContainerProfile to only describe the function's behavior and return values.

In `@pkg/rulemanager/ruleadapters/creator.go`:
- Around line 146-179: The two switch branches for armotypes.ApplicationProfile
and armotypes.NetworkProfile duplicate the same call to
objectCache.ContainerProfileCache().GetContainerProfileState(triggerEvent.GetContainerID())
and the identical construction of an armotypes.ProfileMetadata — consolidate by
calling GetContainerProfileState once (e.g., before or at the start of the
switch), guard on state != nil once, build a single ProfileMetadata object and
set profileMetadata.Type = profileType (or assign the differing Type inside the
small switch), preserve the Error handling (state.Error != nil) and assign to
baseRuntimeAlert.ProfileMetadata; update or remove the duplicate cases
accordingly to avoid drift.

In `@pkg/rulemanager/rulepolicy.go`:
- Around line 23-35: In RulePolicyValidator.Validate, consolidate the duplicate
map lookup by retrieving cp.Spec.PolicyByRuleId[ruleId] once into a local
variable (e.g., policy, ok := cp.Spec.PolicyByRuleId[ruleId]) and use that for
the subsequent checks; additionally add a nil-guard at the start of Validate to
return false,nil if cp == nil (or cp.Spec is nil if applicable) so the function
doesn't panic when cp is nil, and then perform the single lookup and evaluate
policy.AllowedContainer or slices.Contains(policy.AllowedProcesses, process)
before returning.

In `@tests/containerprofilecache/helpers_test.go`:
- Around line 39-51: The test helper makeTestPod currently only accepts
containerStatuses and initStatuses; add an additional parameter (e.g.,
ephemeralStatuses []corev1.ContainerStatus) and set
Pod.Status.EphemeralContainerStatuses = ephemeralStatuses in the returned Pod so
tests can exercise ephemeral-container code paths (also update any test call
sites to pass nil or a slice where needed). This change targets the makeTestPod
function to enable coverage of projection logic that inspects
EphemeralContainerStatuses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d99a491-b1ca-4c68-9ffa-0de1d08771a8

📥 Commits

Reviewing files that changed from the base of the PR and between 900e72b and c2966c0.

📒 Files selected for processing (45)
  • Makefile
  • cmd/main.go
  • pkg/containerwatcher/v2/container_watcher_collection.go
  • pkg/metricsmanager/metrics_manager_interface.go
  • pkg/metricsmanager/metrics_manager_mock.go
  • pkg/metricsmanager/metrics_manager_noop.go
  • pkg/metricsmanager/prometheus/prometheus.go
  • pkg/objectcache/applicationprofilecache/applicationprofilecache.go
  • pkg/objectcache/applicationprofilecache/applicationprofilecache_test.go
  • pkg/objectcache/applicationprofilecache_interface.go
  • pkg/objectcache/callstackcache/callstackcache.go
  • pkg/objectcache/callstackcache/callstackcache_test.go
  • pkg/objectcache/containerprofilecache/containerprofilecache.go
  • pkg/objectcache/containerprofilecache/containerprofilecache_test.go
  • pkg/objectcache/containerprofilecache/metrics.go
  • pkg/objectcache/containerprofilecache/projection.go
  • pkg/objectcache/containerprofilecache/projection_test.go
  • pkg/objectcache/containerprofilecache/reconciler.go
  • pkg/objectcache/containerprofilecache/reconciler_test.go
  • pkg/objectcache/containerprofilecache_interface.go
  • pkg/objectcache/networkneighborhoodcache/networkneighborhoodcache.go
  • pkg/objectcache/networkneighborhoodcache/networkneighborhoodcache_test.go
  • pkg/objectcache/networkneighborhoodcache_interface.go
  • pkg/objectcache/objectcache_interface.go
  • pkg/objectcache/v1/mock.go
  • pkg/objectcache/v1/objectcache.go
  • pkg/objectcache/v1/objectcache_test.go
  • pkg/rulemanager/cel/libraries/applicationprofile/capability.go
  • pkg/rulemanager/cel/libraries/applicationprofile/exec.go
  • pkg/rulemanager/cel/libraries/applicationprofile/http.go
  • pkg/rulemanager/cel/libraries/applicationprofile/open.go
  • pkg/rulemanager/cel/libraries/applicationprofile/syscall.go
  • pkg/rulemanager/cel/libraries/k8s/k8s_test.go
  • pkg/rulemanager/cel/libraries/networkneighborhood/network.go
  • pkg/rulemanager/profilehelper/profilehelper.go
  • pkg/rulemanager/rule_manager.go
  • pkg/rulemanager/ruleadapters/creator.go
  • pkg/rulemanager/rulepolicy.go
  • pkg/storage/storage_interface.go
  • pkg/storage/storage_mock.go
  • pkg/storage/v1/containerprofile.go
  • tests/containerprofilecache/helpers_test.go
  • tests/containerprofilecache/init_eviction_test.go
  • tests/containerprofilecache/lock_stress_test.go
  • tests/containerprofilecache/packages_deleted_test.go
💤 Files with no reviewable changes (6)
  • pkg/objectcache/applicationprofilecache/applicationprofilecache_test.go
  • pkg/objectcache/applicationprofilecache_interface.go
  • pkg/objectcache/networkneighborhoodcache/networkneighborhoodcache_test.go
  • pkg/objectcache/networkneighborhoodcache_interface.go
  • pkg/objectcache/networkneighborhoodcache/networkneighborhoodcache.go
  • pkg/objectcache/applicationprofilecache/applicationprofilecache.go

Comment thread pkg/objectcache/containerprofilecache_interface.go
Comment thread pkg/objectcache/containerprofilecache/containerprofilecache.go
Comment thread pkg/objectcache/containerprofilecache/projection.go
Comment thread pkg/objectcache/containerprofilecache/reconciler.go
Comment thread pkg/objectcache/v1/mock.go Outdated
Comment thread pkg/storage/v1/containerprofile.go Outdated
Comment thread tests/containerprofilecache/helpers_test.go Outdated
Comment thread pkg/objectcache/containerprofilecache/lock_stress_test.go
matthyx and others added 2 commits April 22, 2026 14:13
…ner-start

Component tests on PR #788 regressed with "All alerts: []" and 54+
"container X not found in container-profile cache" log entries. Root cause:
addContainer did a one-shot GetContainerProfile at EventTypeAddContainer
time and bailed on 404. But the CP is created asynchronously by
containerprofilemanager ~60s AFTER container-start, so the one-shot GET
almost always missed; the cache entry was never created; rule evaluation
short-circuited as "no profile".

Legacy caches hid this via a periodic ListProfiles scan that picked up
late-arriving profiles on the next tick. The point-lookup model dropped
that scan. This commit adds an equivalent: a pending-container retry path
in the reconciler.

Changes:
- CachedContainerProfile unchanged; new pendingContainer struct captures
  (container, sharedData, cpName) needed to retry the initial GET.
- ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting
  for their CP to land in storage.
- addContainer extracts the populate/GET into tryPopulateEntry. On miss
  (err or nil CP) it records a pending entry; the per-container goroutine
  exits. No more waiting 10 min inside addContainerWithTimeout.
- reconciler.retryPendingEntries iterates pending under per-container
  locks, re-issues the GET, and promotes via tryPopulateEntry on success.
- reconcileOnce gains a pending GC pass: containers whose pod is gone or
  whose status is not Running get dropped from pending so we don't retry
  forever on terminated containers.
- deleteContainer also clears from pending on EventTypeRemoveContainer.
- metrics: cache_entries gauge gains a "pending" kind; reconciler
  eviction counter gets a "pending_pod_stopped" reason.

Tests:
- TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP
  arrives in storage -> one tick promotes; exactly 2 GetCP calls.
- TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its
  pod is no longer present in k8s cache.

Full findings and resume doc at
  .omc/plans/containerprofile-cache-component-test-findings.md

Follow-up plan updated at
  .omc/plans/containerprofile-cache-followups.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efs, resurrection guard

PR #788 component tests continued failing after the pending-retry fix.
Deep investigation uncovered a fundamental slug misuse and three reviewer-
reported correctness gaps. All fixed here.

### Primary bug: wrong slug function

plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is
NOT — implementation at k8s-interface v0.0.206:
  func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) {
      u := uuid.New()
      hexSuffix := hex.EncodeToString(u[:])
      ...
  }
So containerprofilemanager.saveContainerProfile writes a *time-series* CP
per tick with a fresh UUID suffix, and the storage-side
ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated
profile at the STABLE slug (GetSlug(false), no UUID).

The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd
forever — even with the pending-retry in place. 13 component tests failed
with "All alerts: []" and 38+ "container X not found in container-profile
cache" log entries.

Switched addContainer to GetSlug(false). The refresh path inherits the
corrected name via entry.CPName.

### Reviewer #1: resurrection during refresh

refreshAllEntries snapshots entries without a lock. Between snapshot and
per-entry lock acquisition, deleteContainer or reconcile-evict may have
removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry)
would resurrect the dead container.

Added a load-under-lock guard at the top of refreshOneEntry.

### Reviewer #2: overlay handling regressions (two parts)

(a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying
user-AP/NN. Containers with only a user-defined profile (no base CP yet)
got no entry. Restructured: fetch base CP and user-AP/NN independently;
populate if ANY source is available; synthesize an empty base CP when only
the overlay exists so projection has something to merge into.

(b) UserAPRef / UserNNRef were only recorded on successful fetch. A
transient 404 on add would permanently drop the overlay intent — the
refresh path had nothing to re-fetch. Now, when the label is set, the
refs are always recorded, using the label's name and the container's
namespace. Refresh retries the fetch each tick.

### Reviewer #3: partial profiles reused across container restart

tryPopulateEntry blindly used whatever CP existed at the stable slug,
including Partial completions from the previous container incarnation.
Legacy caches explicitly deleted Partial profiles on non-PreRunning
restart so rule evaluation fell through to "no profile" until Full
arrived.

Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we
treat the CP as absent → stay pending → retry each tick. When the CP
becomes Full (or the container stops), the pending state resolves.

The inverse is preserved: PreRunningContainer (agent-restart scenario)
accepts the Partial CP as-is so Test_19's "alert on partial profile"
semantics still work.

### Tests

Five new unit tests, all race-clean:
- TestPartialCP_NonPreRunning_StaysPending
- TestPartialCP_PreRunning_Accepted
- TestOverlayLabel_TransientFetchFailure_RefsRetained
- TestRefreshDoesNotResurrectDeletedEntry
- TestUserDefinedProfileOnly_NoBaseCP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.181 0.000 -100.0%
Peak CPU (cores) 0.186 0.000 -100.0%
Avg Memory (MiB) 396.760 0.000 -100.0%
Peak Memory (MiB) 401.250 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 9 0
dns_counter 1428 0
exec_counter 7176 0
network_counter 94327 0
open_counter 785424 0
syscall_counter 3499 0

The storage server's consolidated ContainerProfile is not exposed via the
public k8s API — ContainerProfiles().Get(stableName) returns 404 even after
consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the
server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the
workload-name are queryable.

The component tests' WaitForApplicationProfileCompletion waits for the
workload-level AP/NN completion — that's what actually exists. The legacy
caches read these directly; we do the same now while the server-side
consolidated-CP plumbing is completed.

Changes:
- addContainer computes both cpName (per-container, forward-compat) and
  workloadName (per-workload, where AP/NN live) via GetSlug(false) and
  GetSlug(true) respectively.
- tryPopulateEntry fetches consolidated CP (kept for forward-compat),
  workload AP, and workload NN. Treats the workload AP/NN as the primary
  data source when the consolidated CP isn't available.
- projection pre-merges workloadAP + workloadNN onto the base (synthesized
  when CP is 404), then buildEntry applies user-overlay AP/NN on top.
- Partial-on-restart gate extended to cover workload AP/NN too — non
  PreRunning containers ignore partial workload profiles until they
  become Full, mirroring legacy deletion-on-restart semantics.
- pendingContainer gains workloadName so retries re-fetch the right CRs.
- fakeProfileClient gains overlayOnly field so tests can scope AP/NN
  returns to the overlay name; existing TestOverlayPath_DeepCopies updated
  accordingly.

Forward-compat: once storage publishes a queryable consolidated CP at
cpName, its fetch becomes primary and the workload AP/NN path becomes a
fallback. No API changes are required to make that transition — just drop
the workload-level fetches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.157 0.000 -100.0%
Peak CPU (cores) 0.162 0.000 -100.0%
Avg Memory (MiB) 418.124 0.000 -100.0%
Peak Memory (MiB) 421.621 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 10 0
dns_counter 1441 0
exec_counter 7211 0
network_counter 94832 0
open_counter 789446 0
syscall_counter 3511 0

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

…retry

CI run 24781030436 (commit ce32919) proved the reconciler IS ticking with
retryPendingEntries running, but the pending-GC pass in reconcileOnce was
dropping every pending entry on the first tick (pending_before=4 →
pending_after=0 at the FIRST tick, before retryPendingEntries could run).

Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also
checked isContainerRunning. On a busy node, the k8s pod cache and
ContainerStatuses lag the containerwatcher Add event by tens of seconds.
So "pod not found" or "container not yet Running" routinely returned true
for a container that had just been registered, causing GC to remove the
pending entry immediately. Retries then ran against an empty pending map
→ no promotions → alerts fired without profile → test failure.

Change: remove the pending GC entirely. Cleanup for terminated containers
flows through deleteContainer (EventTypeRemoveContainer) which clears
both entries and pending under the per-container lock. Memory growth is
bounded by the node's container churn (containers that never got a
profile during their lifetime).

Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by
TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

Two component-test regressions in PR #788:

Fix A (Test_12 / Test_13): the cache now reads the user-managed
ApplicationProfile and NetworkNeighborhood published at
"ug-<workloadName>" and projects them onto the base profile as a
dedicated ladder pass. Legacy caches did this via the
`kubescape.io/managed-by: User` annotation in handleUserManagedProfile;
we read them directly by their well-known name.

Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the
workload-level AP/NN (and user-managed / label-referenced overlays) on
every tick, not just the consolidated CP. This propagates the Status=
"ready" -> "completed" transition into the cached ProfileState, which
flips fail_on_profile from false to true at rule-eval time.

CachedContainerProfile gained WorkloadName plus WorkloadAPRV /
WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh
can fast-skip when every source's RV matches what's cached.
refreshOneEntry's rebuild now runs the same projection ladder as
tryPopulateEntry: base CP (or synthesized) → workload AP+NN →
user-managed (ug-) AP+NN → label-referenced user AP+NN.

Also:
- Tick-loop log only fires when entries OR pending count actually moved
  (previously fired whenever pending_before > 0, producing per-tick
  noise while a stuck-pending entry waited for profile data).
- fakeProfileClient in tests returns userManagedAP/userManagedNN when
  the requested name starts with "ug-".
- New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B
  happy-path) and TestUserManagedProfileMerged (Fix A happy-path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.174 0.000 -100.0%
Peak CPU (cores) 0.180 0.000 -100.0%
Avg Memory (MiB) 423.340 0.000 -100.0%
Peak Memory (MiB) 424.797 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 11 0
dns_counter 1426 0
exec_counter 7169 0
network_counter 94199 0
open_counter 784952 0
syscall_counter 3662 0

CI run 24783250693 (commit 32a76c0) showed reconcileOnce evicting live
entries every tick: "entries_before:10, entries_after:0" within 5 seconds
of the agent starting. Same class of bug as the pending-GC fix (c45803f):
the k8s pod cache lags ContainerCallback Add events by tens of seconds,
and evicting on "GetPod returns nil OR !isContainerRunning" churned every
entry before any rules could evaluate.

Change reconcileOnce eviction gate:
- If pod is missing from k8s cache: DO NOT evict. Cache lag is transient;
  deleteContainer handles real-world cleanup via EventTypeRemoveContainer.
- If pod present and container has clearly Terminated: evict (preserves
  init-container eviction for Test_02 and T2 acceptance).
- If pod present and container in Waiting state: retain (new container
  creation, init-container pre-run both legitimately pass through Waiting).

New helper isContainerTerminated mirrors isContainerRunning but gates on
State.Terminated only; "not found in any status" treated as terminated.

Tests:
- TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing
- New TestReconcilerEvictsTerminatedContainer
- New TestReconcilerKeepsWaitingContainer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.172 0.000 -100.0%
Peak CPU (cores) 0.181 0.000 -100.0%
Avg Memory (MiB) 417.246 0.000 -100.0%
Peak Memory (MiB) 423.527 0.000 -100.0%
Dedup Effectiveness

No data available.

Event Counters
Metric BEFORE AFTER
capability_counter 10 0
dns_counter 1440 0
exec_counter 7326 0
network_counter 96115 0
open_counter 801083 0
syscall_counter 3517 0

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.163 0.172 +5.9%
Peak CPU (cores) 0.172 0.177 +3.2%
Avg Memory (MiB) 406.013 315.948 -22.2%
Peak Memory (MiB) 408.496 320.188 -21.6%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1701 119455 98.6%
network 901 77997 98.9%
open 36300 619942 94.5%
symlink 6000 0 0.0%
syscall 986 1908 65.9%
Event Counters
Metric BEFORE AFTER
capability_counter 11 8
dns_counter 1443 1412
exec_counter 7252 7066
network_counter 95309 92962
open_counter 792957 775413
syscall_counter 3517 3376

…ritative

The workload-level AP/NN fetch added in d27be01 was a workaround for the
eviction/GC bugs (fixed in c45803f and d9ae0ac), not an architectural
need. The consolidated ContainerProfile IS queryable at the GetSlug(false)
name once storage aggregation runs; the cache simply needs to wait on the
pending-retry path.

This reverts the workload-AP/NN read while keeping:
- consolidated CP as the single base-profile source
- user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed
  because user-managed profiles are authored independently and are not
  consolidated into the CP server-side
- user-defined overlay via pod UserDefinedProfileMetadataKey label
- eviction fix (d9ae0ac), GC fix (c45803f), resurrection guard

Removed:
- workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry
- WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the
  corresponding rebuildEntryFromSources ladder pass
- Partial-on-restart gate for workload AP/NN (only applies to CP now)
- Synth-CP annotation fallback chain (simplified to Completed/Full)

Tests:
- TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus
  (CP now the source; RV transition propagates Status)
- TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay
  instead of workloadAP + ug- overlay

This matches the migration plan's original intent: CP-direct, AP/NN only
as user overlays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ithOverlayForTest

Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can
set UserAPRef / UserNNRef (which use the internal namespacedName type) without
requiring the type to be exported.

Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/
using only the public + test-helper API: seeds an entry with a stale UserAPRV,
mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the
projection with the new execs and drops the stale ones.

Add top-of-file block comment to RuleObjectCacheMock documenting the non-
overlapping AP-fields / NN-fields partition between SetApplicationProfile and
SetNetworkNeighborhood and the first-container-wins rule for r.cp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

- reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns
  exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire)
- reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1)
  with a blocking send so the signal is never silently dropped; bump
  rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI
- containerprofilecache.go: extract defaultStorageRPCBudget const alongside
  defaultReconcileInterval for discoverability
- shared_pointer_race_test.go: fix gofmt const-block alignment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.000 0.000 N/A
Peak CPU (cores) 0.000 0.000 N/A
Avg Memory (MiB) 0.000 0.000 N/A
Peak Memory (MiB) 0.000 0.000 N/A
Dedup Effectiveness

No data available.

@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.153 0.148 -3.5%
Peak CPU (cores) 0.159 0.153 -4.3%
Avg Memory (MiB) 417.071 314.243 -24.7%
Peak Memory (MiB) 419.902 316.805 -24.6%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 5999 0 0.0%
http 1700 119456 98.6%
network 899 77931 98.9%
open 35957 619967 94.5%
symlink 5999 0 0.0%
syscall 980 1896 65.9%
Event Counters
Metric BEFORE AFTER
capability_counter 9 8
dns_counter 1416 1417
exec_counter 7152 7120
network_counter 93911 93612
open_counter 783634 779566
syscall_counter 3483 3517

Before this fix, a refreshRPC timeout on any overlay GET (user-managed
ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the
overlay variable nil with the error silently discarded. The RV comparison
then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal,
and rebuilt the entry without the overlay — temporarily stripping
user-managed/user-defined profile data from the cache and altering
alerting until the next successful tick.

Fix: capture each overlay's fetch error and, when it is non-nil and the
entry already has a non-empty cached RV for that overlay, return early
and keep the existing entry unchanged. Legitimate deletions (nil with
err==nil) still propagate correctly. Mirrors the existing CP error-
preservation logic at refreshOneEntry:272-288.

Add TestRefreshPreservesEntryOnTransientOverlayError covering all four
overlay fetch paths (user-managed AP, user-managed NN, user-defined AP,
user-defined NN) via a new overlayErrorClient stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/objectcache/containerprofilecache/reconciler.go (1)

38-80: Downstream note: tick-loop blocking from synchronous retryPendingEntries.

retryPendingEntries runs synchronously on the tick (Line 56). Combined with the unbounded-ctx issue in tryPopulateEntry (see the containerprofilecache.go review comment on Lines 280-376), a single hung storage GET against any pending entry will freeze the entire evict + refresh-dispatch loop until shutdown — even though refreshAllEntries is correctly single-flighted on its own goroutine. Fixing the root cause in tryPopulateEntry (route GETs through refreshRPC) resolves this without any change here, but if you prefer defense-in-depth you could also run retryPendingEntries on the same single-flight goroutine as refreshAllEntries (or spawn it separately) so the tick itself never blocks on I/O.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/objectcache/containerprofilecache/reconciler.go` around lines 38 - 80,
The tickLoop currently calls retryPendingEntries synchronously, so a hung
tryPopulateEntry/storage GET can block the entire tick; to fix this, stop
calling retryPendingEntries inline and instead run it asynchronously under the
same single-flight guard used for refreshAllEntries: when
refreshInProgress.CompareAndSwap(false, true) succeeds, start a goroutine that
defers refreshInProgress.Store(false) and invokes retryPendingEntries(ctx) then
refreshAllEntries(ctx) (or vice-versa), ensuring both use the refreshInProgress
guard; alternatively spawn retryPendingEntries in its own goroutine so the
ticker loop never blocks. Reference: tickLoop, retryPendingEntries,
refreshAllEntries, tryPopulateEntry, refreshInProgress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/objectcache/containerprofilecache/containerprofilecache.go`:
- Around line 280-376: tryPopulateEntry performs storage GETs
(storageClient.GetContainerProfile, GetApplicationProfile,
GetNetworkNeighborhood) using the caller ctx which can be unbounded when called
from retryPendingEntries; update tryPopulateEntry so every storage GET is
invoked through c.refreshRPC(ctx, func(ctx context.Context) error { ... }) (or
the equivalent c.refreshRPC wrapper used elsewhere) instead of calling
storageClient.Get* directly — specifically wrap the cp fetch
(GetContainerProfile) and the four AP/NN fetches (userManagedAP/userManagedNN
via ugName, and userAP/userNN via overlayName) with c.refreshRPC to apply the
StorageRPCBudget per-RPC timeout and avoid blocking the reconciler tick loop.

---

Nitpick comments:
In `@pkg/objectcache/containerprofilecache/reconciler.go`:
- Around line 38-80: The tickLoop currently calls retryPendingEntries
synchronously, so a hung tryPopulateEntry/storage GET can block the entire tick;
to fix this, stop calling retryPendingEntries inline and instead run it
asynchronously under the same single-flight guard used for refreshAllEntries:
when refreshInProgress.CompareAndSwap(false, true) succeeds, start a goroutine
that defers refreshInProgress.Store(false) and invokes retryPendingEntries(ctx)
then refreshAllEntries(ctx) (or vice-versa), ensuring both use the
refreshInProgress guard; alternatively spawn retryPendingEntries in its own
goroutine so the ticker loop never blocks. Reference: tickLoop,
retryPendingEntries, refreshAllEntries, tryPopulateEntry, refreshInProgress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 107e27d1-3485-4016-9b57-166cec9271f0

📥 Commits

Reviewing files that changed from the base of the PR and between 74d4652 and 3909a3b.

📒 Files selected for processing (15)
  • pkg/config/config.go
  • pkg/objectcache/containerprofilecache/containerprofilecache.go
  • pkg/objectcache/containerprofilecache/containerprofilecache_test.go
  • pkg/objectcache/containerprofilecache/reconciler.go
  • pkg/objectcache/containerprofilecache/reconciler_test.go
  • pkg/objectcache/v1/mock.go
  • pkg/storage/storage_interface.go
  • pkg/storage/storage_mock.go
  • pkg/storage/v1/applicationprofile.go
  • pkg/storage/v1/containerprofile.go
  • pkg/storage/v1/networkneighborhood.go
  • tests/containerprofilecache/helpers_test.go
  • tests/containerprofilecache/lock_stress_test.go
  • tests/containerprofilecache/shared_pointer_race_test.go
  • tests/containerprofilecache/t8_overlay_refresh_test.go
✅ Files skipped from review due to trivial changes (2)
  • tests/containerprofilecache/helpers_test.go
  • pkg/objectcache/containerprofilecache/containerprofilecache_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/storage/v1/containerprofile.go
  • pkg/storage/storage_interface.go
  • pkg/storage/storage_mock.go
  • pkg/objectcache/v1/mock.go
  • pkg/objectcache/containerprofilecache/reconciler_test.go

Comment thread pkg/objectcache/containerprofilecache/containerprofilecache.go
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.150 0.150 +0.5%
Peak CPU (cores) 0.154 0.156 +1.3%
Avg Memory (MiB) 407.115 312.258 -23.3%
Peak Memory (MiB) 410.043 314.645 -23.3%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1704 119456 98.6%
network 901 77822 98.9%
open 35736 620514 94.6%
symlink 6000 0 0.0%
syscall 979 1929 66.3%
Event Counters
Metric BEFORE AFTER
capability_counter 10 8
dns_counter 1415 1382
exec_counter 7146 7002
network_counter 93838 91933
open_counter 782410 767636
syscall_counter 3518 3396

@amitschendel
Copy link
Copy Markdown
Collaborator

This may cause some false positives for runtime detection, might worth the risk for the simplicity tho.

@matthyx
Copy link
Copy Markdown
Contributor Author

matthyx commented Apr 23, 2026

This may cause some false positives for runtime detection, might worth the risk for the simplicity tho.

Why? If done correctly it's the same functionality... But wait for the next one, you'll love it!

- Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match
  the existing metric namespace convention used across node-agent.
- Route all 5 storage GETs in tryPopulateEntry through refreshRPC so
  they respect the per-call SLO (default 5s); prevents a hung GET from
  blocking the entire reconciler tick loop when called from
  retryPendingEntries.
- Add WarmPendingForTest helper to pre-initialise the pending SafeMap
  before concurrent test phases, preventing the goradd/maps
  nil-check-before-lock initialisation race.
- Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and
  poll for async deleteContainer goroutines to drain before asserting
  goroutine count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@amitschendel
Copy link
Copy Markdown
Collaborator

This may cause some false positives for runtime detection, might worth the risk for the simplicity tho.

Why? If done correctly it's the same functionality... But wait for the next one, you'll love it!

Because in case of a deployment for example you can have let's say 2 pods, k8s can decide to route traffic to one pod for 24h because there is no load, but, a container that used to be stale, in case of a load in the cluster can start getting traffic which will cause drifts from the profile of the stale container, it's kind of a long shot, but the same concept applies if someone execs to one of the pods etc... but again, might not be a real concern.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/containerprofilecache/lock_stress_test.go (1)

150-161: ⚠️ Potential issue | 🟡 Minor

200ms drain window may flake for ~2500 spawned deleteContainer goroutines.

Each evict iteration calls ContainerCallback(EventTypeRemoveContainer), which spawns a go deleteContainer(...). With 100 workers × ~25 evicts on average, up to ~2500 async goroutines can be in flight, yet the poll loop only waits up to 200ms before the hard assertion. Under a loaded CI runner this can trip baseline+10 spuriously. Consider lengthening the deadline (e.g. 2s) or converting the check to assert.Eventually, which was the direction of a previous review comment as well.

Proposed tweak
-	drainDeadline := time.Now().Add(200 * time.Millisecond)
-	for runtime.NumGoroutine() > baseline+10 && time.Now().Before(drainDeadline) {
-		runtime.Gosched()
-		time.Sleep(5 * time.Millisecond)
-	}
-	runtime.GC()
-	assert.LessOrEqual(t, runtime.NumGoroutine(), baseline+10,
-		"goroutine count should stay near baseline (no leaked goroutines)")
+	assert.Eventually(t, func() bool {
+		runtime.Gosched()
+		runtime.GC()
+		return runtime.NumGoroutine() <= baseline+10
+	}, 2*time.Second, 10*time.Millisecond,
+		"goroutine count should return near baseline (no leaked goroutines)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/containerprofilecache/lock_stress_test.go` around lines 150 - 161, The
test's 200ms drainDeadline for waiting on asynchronous goroutines spawned by
ContainerCallback(EventTypeRemoveContainer) (which starts go
deleteContainer(...)) is too short for up to ~2500 in-flight goroutines and
causes flakiness; update the test to wait longer (e.g., 2s) or replace the
manual poll loop with assert.Eventually polling on runtime.NumGoroutine() <=
baseline+10 (or both), referencing the drainDeadline variable, the
runtime.NumGoroutine check and the ContainerCallback/deleteContainer goroutine
source to ensure the assertion only runs after a reasonable drain window.
🧹 Nitpick comments (3)
pkg/metricsmanager/prometheus/prometheus.go (1)

225-246: Namespace prefix fix looks good; consider container_profile word split for consistency.

The nodeagent_node_agent_ rename from the prior review is applied correctly. One minor nit: the rest of this file uses snake_case between logical words (e.g., node_agent_rule_evaluation_time_seconds, node_agent_container_start_counter), whereas these new metrics keep containerprofile as a single token. If you want strict consistency, consider node_agent_container_profile_cache_entries, ..._cache_hit_total, ..._reconciler_duration_seconds, ..._reconciler_evictions_total. Not a blocker since these are net-new metrics and easy to rename now before release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metricsmanager/prometheus/prometheus.go` around lines 225 - 246, The
metric names for ContainerProfile metrics use "containerprofile" as a single
token; rename them to use snake_case "container_profile" for consistency: update
the Name fields in cpCacheLegacyLoadsCounter, cpCacheEntriesGauge,
cpCacheHitCounter, cpReconcilerDurationHistogram, and
cpReconcilerEvictionsCounter to "node_agent_container_profile_..." (e.g.,
node_agent_container_profile_cache_entries,
node_agent_container_profile_cache_hit_total,
node_agent_container_profile_reconciler_duration_seconds,
node_agent_container_profile_reconciler_evictions_total, and
node_agent_user_profile_legacy_loads_total -> if desired rename to
node_agent_user_container_profile_legacy_loads_total), and then update any
code/tests/alerts that reference the old metric names to the new names.
tests/containerprofilecache/lock_stress_test.go (2)

187-200: Prefer strconv.Itoa over the hand-rolled itoa3.

The helper duplicates standard library functionality with no stated reason (the comment just says "without strconv"). Using strconv.Itoa is clearer and eliminates a small custom routine that a future reader has to verify.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/containerprofilecache/lock_stress_test.go` around lines 187 - 200, The
custom itoa3 function duplicates stdlib behavior; replace its usage with
strconv.Itoa by removing or deprecating the itoa3 helper and calling
strconv.Itoa(i) wherever itoa3 is used (or replace the function body with a
simple return strconv.Itoa(i)) so tests use the standard implementation from the
strconv package; ensure you add the strconv import if not present and run tests
to confirm no regressions.

162-163: Dead assertion.

assert.True(t, true, ...) is a no-op — the comment above already explains that a panic would have failed the test. Drop the line (or replace it with a real invariant check, e.g. that at least some entries remain/were evicted) to avoid misleading readers into thinking something is being verified here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/containerprofilecache/lock_stress_test.go` around lines 162 - 163,
Remove the dead assertion assert.True(t, true, "no panic occurred") in the test
and either drop it entirely or replace it with a real invariant check that
verifies cache behavior after the stress run (for example assert that the cache
still contains at least one entry or that eviction occurred as expected); locate
the assertion by searching for the exact call assert.True(t, true, "no panic
occurred") in the lock stress test and update it to assert a meaningful
condition against the cache state (e.g., entry count or eviction/property
invariants).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/containerprofilecache/lock_stress_test.go`:
- Around line 101-135: The per-goroutine RNG is seeded with
time.Now().UnixNano() and multiple workers can get identical seeds; update the
goroutine to accept the worker index (to avoid loop-variable capture) and use it
in the seed, e.g. change go func() { ... }() to go func(worker int) { r :=
rand.New(rand.NewSource(time.Now().UnixNano() + int64(worker))) ... }(w), so
each worker gets a distinct seed and the RNG sequences diverge; keep existing
uses of r, containerIDs, cache.SeedEntryForTest, cache.ContainerCallback and
makeTestContainer unchanged.

---

Duplicate comments:
In `@tests/containerprofilecache/lock_stress_test.go`:
- Around line 150-161: The test's 200ms drainDeadline for waiting on
asynchronous goroutines spawned by ContainerCallback(EventTypeRemoveContainer)
(which starts go deleteContainer(...)) is too short for up to ~2500 in-flight
goroutines and causes flakiness; update the test to wait longer (e.g., 2s) or
replace the manual poll loop with assert.Eventually polling on
runtime.NumGoroutine() <= baseline+10 (or both), referencing the drainDeadline
variable, the runtime.NumGoroutine check and the
ContainerCallback/deleteContainer goroutine source to ensure the assertion only
runs after a reasonable drain window.

---

Nitpick comments:
In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 225-246: The metric names for ContainerProfile metrics use
"containerprofile" as a single token; rename them to use snake_case
"container_profile" for consistency: update the Name fields in
cpCacheLegacyLoadsCounter, cpCacheEntriesGauge, cpCacheHitCounter,
cpReconcilerDurationHistogram, and cpReconcilerEvictionsCounter to
"node_agent_container_profile_..." (e.g.,
node_agent_container_profile_cache_entries,
node_agent_container_profile_cache_hit_total,
node_agent_container_profile_reconciler_duration_seconds,
node_agent_container_profile_reconciler_evictions_total, and
node_agent_user_profile_legacy_loads_total -> if desired rename to
node_agent_user_container_profile_legacy_loads_total), and then update any
code/tests/alerts that reference the old metric names to the new names.

In `@tests/containerprofilecache/lock_stress_test.go`:
- Around line 187-200: The custom itoa3 function duplicates stdlib behavior;
replace its usage with strconv.Itoa by removing or deprecating the itoa3 helper
and calling strconv.Itoa(i) wherever itoa3 is used (or replace the function body
with a simple return strconv.Itoa(i)) so tests use the standard implementation
from the strconv package; ensure you add the strconv import if not present and
run tests to confirm no regressions.
- Around line 162-163: Remove the dead assertion assert.True(t, true, "no panic
occurred") in the test and either drop it entirely or replace it with a real
invariant check that verifies cache behavior after the stress run (for example
assert that the cache still contains at least one entry or that eviction
occurred as expected); locate the assertion by searching for the exact call
assert.True(t, true, "no panic occurred") in the lock stress test and update it
to assert a meaningful condition against the cache state (e.g., entry count or
eviction/property invariants).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec3b2f88-7e14-4497-8b80-d75d8b5eeb76

📥 Commits

Reviewing files that changed from the base of the PR and between 3909a3b and ff0d1ff.

📒 Files selected for processing (3)
  • pkg/metricsmanager/prometheus/prometheus.go
  • pkg/objectcache/containerprofilecache/containerprofilecache.go
  • tests/containerprofilecache/lock_stress_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/objectcache/containerprofilecache/containerprofilecache.go

Comment thread pkg/objectcache/containerprofilecache/lock_stress_test.go
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.157 0.158 +0.8%
Peak CPU (cores) 0.166 0.167 +0.7%
Avg Memory (MiB) 416.199 312.817 -24.8%
Peak Memory (MiB) 425.223 314.773 -26.0%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1700 119458 98.6%
network 900 77999 98.9%
open 36188 619986 94.5%
symlink 6000 0 0.0%
syscall 981 1895 65.9%
Event Counters
Metric BEFORE AFTER
capability_counter 9 7
dns_counter 712 1423
exec_counter 7330 7153
network_counter 88932 93993
open_counter 802690 784368
syscall_counter 3636 3455

Pass worker index into each goroutine closure and mix it into the
rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that
100 concurrently-launched goroutines don't all receive the same
nanosecond timestamp and end up with identical add/evict sequences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.126 0.138 +9.8%
Peak CPU (cores) 0.141 0.144 +2.3%
Avg Memory (MiB) 402.368 314.983 -21.7%
Peak Memory (MiB) 404.605 316.484 -21.8%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 1 0 0.0%
hardlink 6000 0 0.0%
http 1704 119454 98.6%
network 900 77999 98.9%
open 50393 620795 92.5%
symlink 6000 0 0.0%
syscall 979 1892 65.9%
Event Counters
Metric BEFORE AFTER
capability_counter 9 11
dns_counter 1408 1413
exec_counter 7046 7069
network_counter 92661 92996
open_counter 788240 794306
syscall_counter 3471 3463

matthyx and others added 2 commits April 23, 2026 13:28
The six *ForTest / ReconcileOnce helpers were previously mixed into
containerprofilecache.go alongside production logic. Move them to a
dedicated testing.go file in the same package.

export_test.go is the idiomatic alternative but is compiled only when
running tests in the same directory; test packages in other directories
(tests/containerprofilecache/) import the non-test version of the
package and never see _test.go contents. A plain testing.go is the
correct pattern here — it signals "test support" by name and groups all
scaffolding in one place, while remaining importable by any test binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
export_test.go (package containerprofilecache) is only compiled during
`go test` so test helpers never enter the production binary. This only
works when callers are in the same directory; the prior layout put tests
in tests/containerprofilecache/ (a separate package), forcing helpers
into a plain testing.go that shipped in the binary.

Moving the six test files into pkg/objectcache/containerprofilecache/
as package containerprofilecache_test fixes this correctly:
- export_test.go replaces testing.go (test-binary-only)
- package declaration: containerprofilecache_integration → containerprofilecache_test
- packages_deleted_test.go Dir path: ../.. → ../../.. (module root)
- tests/containerprofilecache/ directory removed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.163 0.162 -0.9%
Peak CPU (cores) 0.170 0.171 +0.5%
Avg Memory (MiB) 312.456 262.648 -15.9%
Peak Memory (MiB) 314.312 268.039 -14.7%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 2 0 0.0%
hardlink 6000 0 0.0%
http 1704 119456 98.6%
network 900 77916 98.9%
open 36255 619921 94.5%
symlink 6000 0 0.0%
syscall 988 1896 65.7%
Event Counters
Metric BEFORE AFTER
capability_counter 9 9
dns_counter 1454 1410
exec_counter 7309 7092
network_counter 96045 93188
open_counter 801322 776818
syscall_counter 3652 3487

The Kubernetes generated client (gentype.Client.Get) pre-allocates a
zero-value struct before the HTTP call and returns it as the result even
on error (e.g. 404 not-found). In refreshOneEntry, the four overlay
fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only
the "transient error with cached RV → keep old entry" branch; the
"first-time 404, no cached RV" branch fell through with a non-nil
empty-ObjectMeta struct still in the pointer, which reached
rebuildEntryFromSources → emitOverlayMetrics and logged spurious
"user-authored legacy profile merged" warnings with empty
namespace/name/resourceVersion fields.

Add an explicit nil-out after each non-returning error branch, mirroring
the pattern already used in tryPopulateEntry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Performance Benchmark Results

Node-Agent Resource Usage
Metric BEFORE AFTER Delta
Avg CPU (cores) 0.154 0.154 +0.4%
Peak CPU (cores) 0.162 0.159 -1.3%
Avg Memory (MiB) 341.072 258.594 -24.2%
Peak Memory (MiB) 345.863 264.617 -23.5%
Dedup Effectiveness (AFTER only)
Event Type Passed Deduped Ratio
capabilities 0 0 N/A
hardlink 6000 0 0.0%
http 1704 119456 98.6%
network 900 78000 98.9%
open 33716 622437 94.9%
symlink 6000 0 0.0%
syscall 979 1873 65.7%
Event Counters
Metric BEFORE AFTER
capability_counter 11 8
dns_counter 1434 1415
exec_counter 7171 7079
network_counter 94295 93127
open_counter 784596 775437
syscall_counter 3529 3493

@matthyx matthyx merged commit 7c2c899 into main Apr 27, 2026
27 of 28 checks passed
@matthyx matthyx deleted the cp-cache branch April 27, 2026 10:35
@matthyx matthyx moved this to To Archive in KS PRs tracking Apr 27, 2026
entlein pushed a commit to k8sstormcenter/node-agent that referenced this pull request Apr 30, 2026
Upstream PR kubescape#788 (Replace AP and NN cache with CP) deleted the legacy
applicationprofilecache where the fork's emitTamperAlert (commit
c2d681e 'Feat/tamperalert' #22) lived. After the merge, R1016 alerts
no longer fired for tampered user-defined profiles, breaking
Test_31_TamperDetectionAlert (passed 3/3 on main, fails on the merged
branch — confirmed regression introduced by PR #36).

This restores the contract: every cache load of a user-supplied
ApplicationProfile or NetworkNeighborhood overlay re-verifies the
signature, and emits an R1016 'Signed profile tampered' alert
through the rule-alert exporter when the signature is present but no
longer valid. Alert shape preserved from the legacy cache so dashboards
and component tests keep matching.

Implementation:
  - new file pkg/objectcache/containerprofilecache/tamper_alert.go:
    verifyUserApplicationProfile / verifyUserNetworkNeighborhood /
    emitTamperAlert / extractWlidFromContainerID. Self-contained;
    keeps containerprofilecache.go diff small.
  - containerprofilecache.go: new tamperAlertExporter field +
    SetTamperAlertExporter setter + verify hooks immediately after
    GetApplicationProfile / GetNetworkNeighborhood succeed in the
    user-overlay branch of addContainer.
  - cmd/main.go: wire the alert exporter via SetTamperAlertExporter
    after the cache constructor (kept the constructor signature
    unchanged to avoid blast radius on tests).

The setter is nil-safe: when no exporter is wired, verification still
runs and is logged but no alert is emitted — matches the legacy
behavior for unit-tests-with-no-exporter.

Test_31 expanded from one scenario to four subtests, each in its own
namespace to avoid alert cross-contamination:
  31a tampered_user_defined_AP_fires_R1016 — original regression case
  31b untampered_signed_AP_no_R1016        — negative: clean signature
  31c unsigned_AP_no_R1016                 — signing is opt-in
  31d tampered_user_defined_NN_fires_R1016 — parallel NN code path

Verified locally:
  - go build ./... clean
  - go test ./pkg/objectcache/containerprofilecache/... green
  - go test ./pkg/signature/... green
  - go vet -tags=component ./tests/... clean
  - go test -tags=component -run='^$' ./tests/... compiles
entlein added a commit to k8sstormcenter/node-agent that referenced this pull request May 2, 2026
…eCache (kubescape#788) (#36)

* Replace AP and NN cache with CP (kubescape#788)

* feat: foundation for ContainerProfileCache unification (steps 1, 2, 5-early)

Additive-only scaffolding for the upcoming migration from the two
workload-keyed caches (applicationprofilecache + networkneighborhoodcache)
to a single container-keyed ContainerProfileCache. No consumers are
rewired yet; all new code is unused.

- Storage client: GetContainerProfile(namespace, name) on ProfileClient
  interface + *Storage impl + mock.
- ContainerProfileCache interface + stub impl (methods return zero values;
  filled in by step 3/4).
- Prometheus metrics: nodeagent_user_profile_legacy_loads_total{kind,completeness}
  deprecation counter + reconciler SLO metrics (entries gauge, hit/miss
  counter, tick duration histogram, eviction counter) registered up front
  so later steps emit cleanly.

Plan artifacts in .omc/plans/; approved by ralplan Planner/Architect/Critic
consensus (v2, iteration 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCacheImpl + projection + shared-pointer fast-path (steps 3, 3.5, 4)

- CachedContainerProfile entry with Shared/RV/UserAP/UserNNRV fields
- Option A+ fast-path: shared storage pointer when no user overlay
- projection.go ports mergeContainers/mergeNetworkNeighbors from legacy caches
- partial-profile detection with dedup'd WARN log + completeness metric label
- Event-path delete with WithLock+ReleaseLock (Critic #2 lock-gap fix)
- Unit tests T4 (projection) + T6 (callstack parity) + fast-path identity

Step 5 (reconciler) and legacy deletion land in follow-ups.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ContainerProfileCache reconciler with evict + refresh (step 5)

- tickLoop drives evict + refresh on one goroutine, refresh gated by atomic
- reconcileOnce evicts entries whose pod is gone or container stopped
- refreshAllEntries snapshots IDs then refreshes outside Range to avoid a
  SafeMap RLock/WLock deadlock (rebuildEntry calls Set)
- isContainerRunning(pod, entry, id): containerID primary, (Name, PodUID)
  fallback for pre-running init containers with empty ContainerID
- ctx.Err() honored inside Range callbacks for graceful shutdown
- T8 end-to-end test: user-AP mutation -> cached projection reflects change

Plan: .omc/plans/containerprofile-cache-unification-consensus.md
Consensus deltas applied: #1 (isContainerRunning signature), #3 (ctx.Err),
#4 (extend fast-skip to overlay RVs), #5 (T8 test), #7 (RPC-cost comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: profilehelper CP->legacy-shape shims + ContainerProfileCache aggregator wiring (step 6a)

Adds the ContainerProfileCache reader to the ObjectCache aggregator interface
so profilehelper can read CP and synthesize the legacy ApplicationProfileContainer /
NetworkNeighborhoodContainer shapes for callers that haven't migrated yet.

- pkg/objectcache/objectcache_interface.go: add ContainerProfileCache() to
  aggregator interface + mock (both AP/NN stay for 6a-6c transit)
- pkg/objectcache/v1/objectcache.go: add cp field, 5-arg NewObjectCache,
  ContainerProfileCache() accessor
- pkg/objectcache/v1/mock.go: RuleObjectCacheMock implements CP surface +
  Get/SetContainerProfile test helpers, Start stub
- pkg/rulemanager/profilehelper/profilehelper.go:
  - GetContainerProfile(objectCache, id) returns (*CP, syncChecksum, error)
    — the forward API
  - GetContainerApplicationProfile + GetContainerNetworkNeighborhood rewritten
    as ~30-LOC CP->legacy-shape shims (consensus delta #2). Marked deprecated;
    step 6c deletes them after CEL callers migrate.
- cmd/main.go: construct ContainerProfileCache alongside APC+NNC, pass to
  NewObjectCache; mock-path uses ContainerProfileCacheMock
- test call sites updated for 5-arg NewObjectCache

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: migrate 20 CEL call sites to GetContainerProfile (step 6b)

- applicationprofile/{capability,exec,http,open,syscall}.go: read fields
  directly off cp.Spec instead of the per-container AP shape
- networkneighborhood/network.go: read Ingress/Egress/LabelSelector off
  cp.Spec directly
- pkg/objectcache/v1/mock.go: extend RuleObjectCacheMock so
  SetApplicationProfile / SetNetworkNeighborhood also project into the
  unified ContainerProfile, and GetContainerProfile honours the shared
  container-ID registry (preserves "invalid container ID -> no profile"
  semantics for existing tests)
- profilehelper CP->legacy shims remain in place; step 6c removes them

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete profilehelper shims + migrate rule_manager + creator (step 6c)

- pkg/rulemanager/profilehelper/profilehelper.go: delete
  GetContainerApplicationProfile, GetContainerNetworkNeighborhood,
  GetApplicationProfile, GetNetworkNeighborhood, GetContainerFromApplicationProfile,
  GetContainerFromNetworkNeighborhood — CP-direct API is the only surface now
- pkg/rulemanager/rule_manager.go:
  - :202, :399 call profilehelper.GetContainerProfile instead of the shim
  - HasFinalApplicationProfile reads cp via ContainerProfileCache().GetContainerProfile;
    method name preserved (external API on RuleManagerInterface per plan v2 §2.4)
- pkg/rulemanager/rulepolicy.go: Validate takes *v1beta1.ContainerProfile
  and reads cp.Spec.PolicyByRuleId
- pkg/rulemanager/ruleadapters/creator.go: both AP + NN branches use
  ContainerProfileCache().GetContainerProfileState (unified state source)

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: ObjectCache aggregator CP-only + collapse 2 callbacks to 1 (step 6d)

- pkg/objectcache/objectcache_interface.go: drop ApplicationProfileCache()
  and NetworkNeighborhoodCache() methods — the aggregator is now
  {K8s, ContainerProfile, Dns}
- pkg/objectcache/v1/objectcache.go: 3-arg NewObjectCache(k, cp, dc)
- pkg/containerwatcher/v2/container_watcher_collection.go:63-64: two
  ContainerCallback subscriptions (APC + NNC) collapse to one (CPC)
- cmd/main.go: both branches (runtime-detection + mock) construct only
  ContainerProfileCache + Dns; legacy APC/NNC wiring removed with startup
  log: "ContainerProfileCache active; legacy AP/NN caches removed"
- test call sites updated for 3-arg NewObjectCache

Legacy packages still physically present (imports retained where still
referenced, e.g. callstackcache); step 8 deletes them entirely.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: delete legacy AP/NN cache packages + move callstackcache (step 8)

- git rm -r pkg/objectcache/applicationprofilecache/ (766 LOC)
- git rm -r pkg/objectcache/networkneighborhoodcache/ (758 LOC)
- git rm pkg/objectcache/applicationprofilecache_interface.go
- git rm pkg/objectcache/networkneighborhoodcache_interface.go
- mv pkg/objectcache/applicationprofilecache/callstackcache/
    -> pkg/objectcache/callstackcache/ (domain-agnostic, shared)
- Update 4 importers: containerprofilecache_interface.go, v1/mock.go,
  containerprofilecache.go, reconciler.go
- RuleObjectCacheMock drops ApplicationProfileCache()/NetworkNeighborhoodCache()
  accessor methods; SetApplicationProfile/SetNetworkNeighborhood remain as
  test helpers that project into the unified CP
- projection.go comments kept as historical source pointers — git history
  preserves the originals

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add T2 init-eviction, T5 packages-deleted, T7 lock-stress (step 9 partial)

- tests/containerprofilecache/packages_deleted_test.go: go/packages
  dep-graph assertion that legacy AP/NN paths are absent
- tests/containerprofilecache/lock_stress_test.go: 100-goroutine
  interleaved seed/read for same container IDs, 5s budget, race-safe
- tests/containerprofilecache/init_eviction_test.go: T2a (event-path
  evict) + T2b (reconciler-path evict for missed RemoveContainer)
- tests/containerprofilecache/helpers_test.go: shared test builders
- pkg/objectcache/containerprofilecache: export ReconcileOnce and
  SeedEntryForTest as out-of-package test hooks
- Makefile: check-legacy-packages target

T1 (golden-alert parity) and T3 (memory benchmark) are release-checklist
items per plan v2 §2.7 — the pre-migration baselines those tests require
can no longer be captured from this branch.

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Phase 4 review P1 findings

1. Drop ReleaseLock on delete paths (containerprofilecache.go deleteContainer,
   reconciler.go reconcileOnce). Security review flagged a race where the
   deleted mutex could be orphaned while a concurrent GetLock creates a new
   one, breaking mutual exclusion for the same container ID. Trade-off:
   bounded memory growth of stale lock entries, proportional to container
   churn — acceptable for a node-agent lifetime.

2. Extract emitOverlayMetrics helper (metrics.go) to de-duplicate the
   ~20-line overlay metric/deprecation-warn block between buildEntry
   (addContainer path) and rebuildEntry (refresh path). Keeps the two in
   lockstep — code review flagged silent drift risk.

Not addressed in this commit (plan-accepted tradeoffs, follow-up work):
- Shared-pointer read-only invariant is convention-enforced, not type-
  enforced (plan v2 §2.3 step 7, ADR consequences). Retaining as-is;
  downstream consumers must not mutate.
- Storage RPC context propagation (requires storage.ProfileClient interface
  change, out of scope for this migration).

Plan: .omc/plans/containerprofile-cache-unification-consensus.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: retry pending ContainerProfile GETs when CP appears after container-start

Component tests on PR kubescape#788 regressed with "All alerts: []" and 54+
"container X not found in container-profile cache" log entries. Root cause:
addContainer did a one-shot GetContainerProfile at EventTypeAddContainer
time and bailed on 404. But the CP is created asynchronously by
containerprofilemanager ~60s AFTER container-start, so the one-shot GET
almost always missed; the cache entry was never created; rule evaluation
short-circuited as "no profile".

Legacy caches hid this via a periodic ListProfiles scan that picked up
late-arriving profiles on the next tick. The point-lookup model dropped
that scan. This commit adds an equivalent: a pending-container retry path
in the reconciler.

Changes:
- CachedContainerProfile unchanged; new pendingContainer struct captures
  (container, sharedData, cpName) needed to retry the initial GET.
- ContainerProfileCacheImpl.pending SafeMap records containerIDs waiting
  for their CP to land in storage.
- addContainer extracts the populate/GET into tryPopulateEntry. On miss
  (err or nil CP) it records a pending entry; the per-container goroutine
  exits. No more waiting 10 min inside addContainerWithTimeout.
- reconciler.retryPendingEntries iterates pending under per-container
  locks, re-issues the GET, and promotes via tryPopulateEntry on success.
- reconcileOnce gains a pending GC pass: containers whose pod is gone or
  whose status is not Running get dropped from pending so we don't retry
  forever on terminated containers.
- deleteContainer also clears from pending on EventTypeRemoveContainer.
- metrics: cache_entries gauge gains a "pending" kind; reconciler
  eviction counter gets a "pending_pod_stopped" reason.

Tests:
- TestRetryPendingEntries_CPCreatedAfterAdd: 404 on add -> pending; CP
  arrives in storage -> one tick promotes; exactly 2 GetCP calls.
- TestRetryPendingEntries_PodGoneIsGCed: pending entry dropped when its
  pod is no longer present in k8s cache.

Full findings and resume doc at
  .omc/plans/containerprofile-cache-component-test-findings.md

Follow-up plan updated at
  .omc/plans/containerprofile-cache-followups.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: cache correctness — right CP slug, partial-on-restart, overlay refs, resurrection guard

PR kubescape#788 component tests continued failing after the pending-retry fix.
Deep investigation uncovered a fundamental slug misuse and three reviewer-
reported correctness gaps. All fixed here.

### Primary bug: wrong slug function

plan v2 §2.3 asserted that GetOneTimeSlug(false) was deterministic. It is
NOT — implementation at k8s-interface v0.0.206:
  func (id *InstanceID) GetOneTimeSlug(noContainer bool) (string, error) {
      u := uuid.New()
      hexSuffix := hex.EncodeToString(u[:])
      ...
  }
So containerprofilemanager.saveContainerProfile writes a *time-series* CP
per tick with a fresh UUID suffix, and the storage-side
ContainerProfileProcessor.consolidateKeyTimeSeries writes the consolidated
profile at the STABLE slug (GetSlug(false), no UUID).

The cache was querying for CPs at GetOneTimeSlug(false), so every GET 404'd
forever — even with the pending-retry in place. 13 component tests failed
with "All alerts: []" and 38+ "container X not found in container-profile
cache" log entries.

Switched addContainer to GetSlug(false). The refresh path inherits the
corrected name via entry.CPName.

### Reviewer #1: resurrection during refresh

refreshAllEntries snapshots entries without a lock. Between snapshot and
per-entry lock acquisition, deleteContainer or reconcile-evict may have
removed the entry. Previously, rebuildEntry's c.entries.Set(id, newEntry)
would resurrect the dead container.

Added a load-under-lock guard at the top of refreshOneEntry.

### Reviewer #2: overlay handling regressions (two parts)

(a) tryPopulateEntry returned "pending" on base-CP 404 BEFORE trying
user-AP/NN. Containers with only a user-defined profile (no base CP yet)
got no entry. Restructured: fetch base CP and user-AP/NN independently;
populate if ANY source is available; synthesize an empty base CP when only
the overlay exists so projection has something to merge into.

(b) UserAPRef / UserNNRef were only recorded on successful fetch. A
transient 404 on add would permanently drop the overlay intent — the
refresh path had nothing to re-fetch. Now, when the label is set, the
refs are always recorded, using the label's name and the container's
namespace. Refresh retries the fetch each tick.

### Reviewer #3: partial profiles reused across container restart

tryPopulateEntry blindly used whatever CP existed at the stable slug,
including Partial completions from the previous container incarnation.
Legacy caches explicitly deleted Partial profiles on non-PreRunning
restart so rule evaluation fell through to "no profile" until Full
arrived.

Now: if CP.completion == Partial && !sharedData.PreRunningContainer, we
treat the CP as absent → stay pending → retry each tick. When the CP
becomes Full (or the container stops), the pending state resolves.

The inverse is preserved: PreRunningContainer (agent-restart scenario)
accepts the Partial CP as-is so Test_19's "alert on partial profile"
semantics still work.

### Tests

Five new unit tests, all race-clean:
- TestPartialCP_NonPreRunning_StaysPending
- TestPartialCP_PreRunning_Accepted
- TestOverlayLabel_TransientFetchFailure_RefsRetained
- TestRefreshDoesNotResurrectDeletedEntry
- TestUserDefinedProfileOnly_NoBaseCP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: read workload-level AP/NN as primary data source

The storage server's consolidated ContainerProfile is not exposed via the
public k8s API — ContainerProfiles().Get(stableName) returns 404 even after
consolidation runs. Only time-series CPs (named <stable>-<UUID>) and the
server-aggregated ApplicationProfile / NetworkNeighborhood CRs at the
workload-name are queryable.

The component tests' WaitForApplicationProfileCompletion waits for the
workload-level AP/NN completion — that's what actually exists. The legacy
caches read these directly; we do the same now while the server-side
consolidated-CP plumbing is completed.

Changes:
- addContainer computes both cpName (per-container, forward-compat) and
  workloadName (per-workload, where AP/NN live) via GetSlug(false) and
  GetSlug(true) respectively.
- tryPopulateEntry fetches consolidated CP (kept for forward-compat),
  workload AP, and workload NN. Treats the workload AP/NN as the primary
  data source when the consolidated CP isn't available.
- projection pre-merges workloadAP + workloadNN onto the base (synthesized
  when CP is 404), then buildEntry applies user-overlay AP/NN on top.
- Partial-on-restart gate extended to cover workload AP/NN too — non
  PreRunning containers ignore partial workload profiles until they
  become Full, mirroring legacy deletion-on-restart semantics.
- pendingContainer gains workloadName so retries re-fetch the right CRs.
- fakeProfileClient gains overlayOnly field so tests can scope AP/NN
  returns to the overlay name; existing TestOverlayPath_DeepCopies updated
  accordingly.

Forward-compat: once storage publishes a queryable consolidated CP at
cpName, its fetch becomes primary and the workload AP/NN path becomes a
fallback. No API changes are required to make that transition — just drop
the workload-level fetches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* debug: add tick-loop start log + change-detection log in reconciler

* fix: remove overly-aggressive pending GC that dropped entries before retry

CI run 24781030436 (commit ce32919) proved the reconciler IS ticking with
retryPendingEntries running, but the pending-GC pass in reconcileOnce was
dropping every pending entry on the first tick (pending_before=4 →
pending_after=0 at the FIRST tick, before retryPendingEntries could run).

Root cause: the GC pass asked k8sObjectCache.GetPod(ns, pod) and also
checked isContainerRunning. On a busy node, the k8s pod cache and
ContainerStatuses lag the containerwatcher Add event by tens of seconds.
So "pod not found" or "container not yet Running" routinely returned true
for a container that had just been registered, causing GC to remove the
pending entry immediately. Retries then ran against an empty pending map
→ no promotions → alerts fired without profile → test failure.

Change: remove the pending GC entirely. Cleanup for terminated containers
flows through deleteContainer (EventTypeRemoveContainer) which clears
both entries and pending under the per-container lock. Memory growth is
bounded by the node's container churn (containers that never got a
profile during their lifetime).

Test updated: TestRetryPendingEntries_PodGoneIsGCed replaced by
TestPendingEntriesAreNotGCedBeforeRetry which asserts the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: merge user-managed AP/NN and refresh workload-level sources

Two component-test regressions in PR kubescape#788:

Fix A (Test_12 / Test_13): the cache now reads the user-managed
ApplicationProfile and NetworkNeighborhood published at
"ug-<workloadName>" and projects them onto the base profile as a
dedicated ladder pass. Legacy caches did this via the
`kubescape.io/managed-by: User` annotation in handleUserManagedProfile;
we read them directly by their well-known name.

Fix B (Test_17 / Test_19): the reconciler refresh path re-fetches the
workload-level AP/NN (and user-managed / label-referenced overlays) on
every tick, not just the consolidated CP. This propagates the Status=
"ready" -> "completed" transition into the cached ProfileState, which
flips fail_on_profile from false to true at rule-eval time.

CachedContainerProfile gained WorkloadName plus WorkloadAPRV /
WorkloadNNRV / UserManagedAPRV / UserManagedNNRV fields so the refresh
can fast-skip when every source's RV matches what's cached.
refreshOneEntry's rebuild now runs the same projection ladder as
tryPopulateEntry: base CP (or synthesized) → workload AP+NN →
user-managed (ug-) AP+NN → label-referenced user AP+NN.

Also:
- Tick-loop log only fires when entries OR pending count actually moved
  (previously fired whenever pending_before > 0, producing per-tick
  noise while a stuck-pending entry waited for profile data).
- fakeProfileClient in tests returns userManagedAP/userManagedNN when
  the requested name starts with "ug-".
- New tests: TestWorkloadAPMerged_AndRefreshUpdatesStatus (Fix B
  happy-path) and TestUserManagedProfileMerged (Fix A happy-path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: reconcileOnce no longer evicts on pod-cache lag, only on Terminated

CI run 24783250693 (commit 32a76c0) showed reconcileOnce evicting live
entries every tick: "entries_before:10, entries_after:0" within 5 seconds
of the agent starting. Same class of bug as the pending-GC fix (c45803f):
the k8s pod cache lags ContainerCallback Add events by tens of seconds,
and evicting on "GetPod returns nil OR !isContainerRunning" churned every
entry before any rules could evaluate.

Change reconcileOnce eviction gate:
- If pod is missing from k8s cache: DO NOT evict. Cache lag is transient;
  deleteContainer handles real-world cleanup via EventTypeRemoveContainer.
- If pod present and container has clearly Terminated: evict (preserves
  init-container eviction for Test_02 and T2 acceptance).
- If pod present and container in Waiting state: retain (new container
  creation, init-container pre-run both legitimately pass through Waiting).

New helper isContainerTerminated mirrors isContainerRunning but gates on
State.Terminated only; "not found in any status" treated as terminated.

Tests:
- TestReconcilerEvictsWhenPodMissing → TestReconcilerKeepsEntryWhenPodMissing
- New TestReconcilerEvictsTerminatedContainer
- New TestReconcilerKeepsWaitingContainer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop workload-level AP/NN fetch; CP-direct reading is authoritative

The workload-level AP/NN fetch added in d27be01 was a workaround for the
eviction/GC bugs (fixed in c45803f and d9ae0ac), not an architectural
need. The consolidated ContainerProfile IS queryable at the GetSlug(false)
name once storage aggregation runs; the cache simply needs to wait on the
pending-retry path.

This reverts the workload-AP/NN read while keeping:
- consolidated CP as the single base-profile source
- user-managed AP/NN at "ug-<workloadName>" (merged on top) — still needed
  because user-managed profiles are authored independently and are not
  consolidated into the CP server-side
- user-defined overlay via pod UserDefinedProfileMetadataKey label
- eviction fix (d9ae0ac), GC fix (c45803f), resurrection guard

Removed:
- workload-AP/NN fetch in tryPopulateEntry and refreshOneEntry
- WorkloadAPRV / WorkloadNNRV fields on CachedContainerProfile and the
  corresponding rebuildEntryFromSources ladder pass
- Partial-on-restart gate for workload AP/NN (only applies to CP now)
- Synth-CP annotation fallback chain (simplified to Completed/Full)

Tests:
- TestWorkloadAPMerged_AndRefreshUpdatesStatus → TestRefreshUpdatesCPStatus
  (CP now the source; RV transition propagates Status)
- TestUserManagedProfileMerged rewired to use a real base CP + ug- overlay
  instead of workloadAP + ug- overlay

This matches the migration plan's original intent: CP-direct, AP/NN only
as user overlays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: synthetic entry CPName override, PodUID backfill, phase-labeled reconciler histogram

Three review findings from the post-green audit.

### 1 (High) — synthetic entry stored the wrong CPName

When tryPopulateEntry synthesized a CP (consolidated CP still 404), the
synthetic name was workloadName or overlayName, and buildEntry persisted
entry.CPName = cp.Name (i.e. the synthetic name). refreshOneEntry then
queried the synthetic name instead of the real GetSlug(false) name; with
the stored RV also empty, the fast-skip's "absent matches empty" branch
kept the synthetic entry forever and the real consolidated CP could never
replace it.

Fix: after buildEntry, override entry.CPName = cpName (the real
GetSlug(false) result passed into tryPopulateEntry).

### 2 (Medium) — PodUID never backfilled

buildEntry only sets PodUID when the pod is already in k8sObjectCache at
add time. On busy nodes the pod cache lags, so addContainer often runs
before the pod lands and PodUID stays "". isContainerTerminated's
empty-ContainerID fallback matches against (ContainerName, PodUID);
when PodUID == "" and the status also has empty UID, the loop falls
through and returns true (treat as terminated) — evicting a still-live
init container. rebuildEntryFromSources copied prev.PodUID unchanged, so
the error never healed.

Fix: in rebuildEntryFromSources, if prev.PodUID is empty AND the pod is
now in the k8s cache, use the fresh UID.

### 3 (Low) — reconciler duration histogram mixed two phases

tickLoop (evict) and refreshAllEntries (refresh) both emitted
ReportContainerProfileReconcilerDuration into the same plain Histogram,
so nodeagent_containerprofile_reconciler_duration_seconds was a blend of
two very different workloads. Plan v2 §2.9 had specified a HistogramVec
with a "phase" label from the start.

Fix: MetricsManager.ReportContainerProfileReconcilerDuration(phase, d).
Prometheus implementation becomes a HistogramVec with phase label.
tickLoop emits phase="evict", refreshAllEntries emits phase="refresh".
MetricsMock/MetricsNoop signatures updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address all CodeRabbit review comments on PR kubescape#788

- ContainerProfileCacheMock.GetContainerProfileState returns synthetic
  error state instead of nil, matching the real impl's contract
- Remove IgnoreContainer check on EventTypeRemoveContainer to prevent
  stale entries when pod labels change after Add
- Deep-copy userAP/userNN in mergeApplicationProfile and
  mergeNetworkNeighborhood to eliminate aliasing of nested slices
  (Execs[i].Args, Opens[i].Flags, MatchExpressions[i].Values, etc.)
  into the cached ContainerProfile
- Fix Shared=true bug: buildEntry now takes userManagedApplied bool;
  fast-path only sets Shared=true when no overlay was applied at all,
  matching rebuildEntryFromSources logic in reconciler.go
- isContainerTerminated returns false when all status slices are empty
  (kubelet lag guard for brand-new pods)
- Fix misplaced doc comment above GetContainerProfile in storage layer
- Remove unused (*stubStorage).setCP test helper
- Lock stress test evict path now uses ContainerCallback(Remove) to
  exercise deleteContainer and per-container locking
- RuleObjectCacheMock stores per-container profiles in cpByContainerName;
  GetContainerProfile resolves via InstanceID.GetContainerName();
  GetContainerProfileState returns synthetic error state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: thread context.Context through ProfileClient and add per-call RPC budget

All five ProfileClient methods now accept ctx as their first argument so
callers can enforce cancellation and deadline propagation. Each storage RPC
in the reconciler is wrapped via refreshRPC(ctx, ...) which applies a
configurable per-call timeout (config.StorageRPCBudget, default 5 s) on top
of the parent context, preventing a slow API server from stalling an entire
reconciler burst. Tests cover the fast-skip, rebuild, and context-cancellation
mid-RPC paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: shared-pointer race-fuzz test + WarmContainerLocksForTest helper

Add TestSharedPointerReadersDoNotCorruptCache: 50 concurrent readers
traverse the returned *ContainerProfile slices while a writer goroutine
alternately calls RefreshAllEntriesForTest + SeedEntryForTest to keep
entry rebuilds active. Runs for 500ms under -race, proving the shared-
pointer fast-path never produces a concurrent read/write pair.

Also add TestSharedPointerFastPathPreservesPointerIdentity: after a
refresh against a storage object with a newer RV, the new entry's
Profile pointer IS the storage object (Shared=true, no DeepCopy), which
keeps the T3 memory budget intact.

Fix the pre-existing goradd/maps SafeMap initialisation race in
TestLockStressAddEvictInterleaved by pre-warming containerLocks via the
new WarmContainerLocksForTest helper (the previous pre-warm via
SeedEntryForTest only covered the entries SafeMap, not containerLocks).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: document SetApplicationProfile / SetNetworkNeighborhood field partition in mock

Add a block comment above RuleObjectCacheMock spelling out the non-overlapping
cp.Spec field partition between the two setters and the first-container-wins
rule for r.cp. Without this, future callers risk aliasing NN fields into an
AP-only profile or vice-versa.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: T8 integration mirror, mock setter contract doc, SeedEntryWithOverlayForTest

Add SeedEntryWithOverlayForTest helper so out-of-package integration tests can
set UserAPRef / UserNNRef (which use the internal namespacedName type) without
requiring the type to be exported.

Mirror TestT8_EndToEndRefreshUpdatesProjection at tests/containerprofilecache/
using only the public + test-helper API: seeds an entry with a stale UserAPRV,
mutates storage to apV2 (RV=51), asserts RefreshAllEntriesForTest rebuilds the
projection with the new execs and drops the stale ones.

Add top-of-file block comment to RuleObjectCacheMock documenting the non-
overlapping AP-fields / NN-fields partition between SetApplicationProfile and
SetNetworkNeighborhood and the first-container-wins rule for r.cp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Phase 4 code-review findings

- reconciler.go: simplify dead-code cpErr/rpcErr guard (refreshRPC returns
  exactly cpErr; the rpcErr != nil && cpErr == nil branch could never fire)
- reconciler_test.go: make blockingProfileClient.blocked a buffered chan(1)
  with a blocking send so the signal is never silently dropped; bump
  rpcBudget to 100ms and timeout to 2s to reduce flakiness on loaded CI
- containerprofilecache.go: extract defaultStorageRPCBudget const alongside
  defaultReconcileInterval for discoverability
- shared_pointer_race_test.go: fix gofmt const-block alignment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: preserve cached entry when overlay AP/NN fetch fails transiently

Before this fix, a refreshRPC timeout on any overlay GET (user-managed
ug-<workload> AP/NN or user-defined label-referenced AP/NN) left the
overlay variable nil with the error silently discarded. The RV comparison
then saw rvOf(nil)="" != cached RV (e.g. "50"), treated it as a removal,
and rebuilt the entry without the overlay — temporarily stripping
user-managed/user-defined profile data from the cache and altering
alerting until the next successful tick.

Fix: capture each overlay's fetch error and, when it is non-nil and the
entry already has a non-empty cached RV for that overlay, return early
and keep the existing entry unchanged. Legitimate deletions (nil with
err==nil) still propagate correctly. Mirrors the existing CP error-
preservation logic at refreshOneEntry:272-288.

Add TestRefreshPreservesEntryOnTransientOverlayError covering all four
overlay fetch paths (user-managed AP, user-managed NN, user-defined AP,
user-defined NN) via a new overlayErrorClient stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeRabbit review issues on PR kubescape#788

- Rename 5 CP cache metrics from nodeagent_* to node_agent_* to match
  the existing metric namespace convention used across node-agent.
- Route all 5 storage GETs in tryPopulateEntry through refreshRPC so
  they respect the per-call SLO (default 5s); prevents a hung GET from
  blocking the entire reconciler tick loop when called from
  retryPendingEntries.
- Add WarmPendingForTest helper to pre-initialise the pending SafeMap
  before concurrent test phases, preventing the goradd/maps
  nil-check-before-lock initialisation race.
- Pre-warm pending SafeMap in TestLockStressAddEvictInterleaved and
  poll for async deleteContainer goroutines to drain before asserting
  goroutine count.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: distinct RNG seed per stress-test worker

Pass worker index into each goroutine closure and mix it into the
rand.NewSource seed (time.Now().UnixNano() + int64(worker)), so that
100 concurrently-launched goroutines don't all receive the same
nanosecond timestamp and end up with identical add/evict sequences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move test helpers out of production source into testing.go

The six *ForTest / ReconcileOnce helpers were previously mixed into
containerprofilecache.go alongside production logic. Move them to a
dedicated testing.go file in the same package.

export_test.go is the idiomatic alternative but is compiled only when
running tests in the same directory; test packages in other directories
(tests/containerprofilecache/) import the non-test version of the
package and never see _test.go contents. A plain testing.go is the
correct pattern here — it signals "test support" by name and groups all
scaffolding in one place, while remaining importable by any test binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move integration tests into package dir; use export_test.go

export_test.go (package containerprofilecache) is only compiled during
`go test` so test helpers never enter the production binary. This only
works when callers are in the same directory; the prior layout put tests
in tests/containerprofilecache/ (a separate package), forcing helpers
into a plain testing.go that shipped in the binary.

Moving the six test files into pkg/objectcache/containerprofilecache/
as package containerprofilecache_test fixes this correctly:
- export_test.go replaces testing.go (test-binary-only)
- package declaration: containerprofilecache_integration → containerprofilecache_test
- packages_deleted_test.go Dir path: ../.. → ../../.. (module root)
- tests/containerprofilecache/ directory removed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: nil out overlay pointers when k8s client returns zero-value on 404

The Kubernetes generated client (gentype.Client.Get) pre-allocates a
zero-value struct before the HTTP call and returns it as the result even
on error (e.g. 404 not-found). In refreshOneEntry, the four overlay
fetch paths (userManagedAP, userManagedNN, userAP, userNN) guarded only
the "transient error with cached RV → keep old entry" branch; the
"first-time 404, no cached RV" branch fell through with a non-nil
empty-ObjectMeta struct still in the pointer, which reached
rebuildEntryFromSources → emitOverlayMetrics and logged spurious
"user-authored legacy profile merged" warnings with empty
namespace/name/resourceVersion fields.

Add an explicit nil-out after each non-returning error branch, mirroring
the pattern already used in tryPopulateEntry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* feat: extract client CA file from kubelet config YAML and enhance service file handling (kubescape#791)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* add learning period label to TS CPs (kubescape#797)

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

* perf: switch to kubescape/syft v1.32.0-ks.2 + disable file catalogers (kubescape#798)

* perf: disable file-digest/metadata/executable catalogers

These three catalogers iterate every file in the scan tree and dominate
transient allocation, but their outputs are not consumed by the OOM-relevant
SBOM path. Disabling them saves ~200 MB peak RSS on gitlab-ee (main) and
stacks with upstream selective-indexing + binary-prefilter improvements to
~1.12 GB total (vs 1.62 GB baseline, fits 1.5 GB cgroup).
Signed-off-by: Ben <ben@armosec.io>

* deps: switch to kubescape/syft v1.32.0-ks.2 for memory reduction

Routes anchore/syft imports to the kubescape fork via replace directive.
The fork carries selective indexing + binary-cataloger pre-filtering on
top of v1.32.0; combined with the file-cataloger disable in the parent
commit, this reduces gitlab-ee scan peak RSS from 1,621 MB to 1,123 MB.

Refs: NAUT-1283
Signed-off-by: Ben <ben@armosec.io>

* fix: check dep.Replace for actual fork version; add cataloger removals to sidecar

- packageVersion() now returns dep.Replace.Version when present so the fork
  tag (v1.32.0-ks.2) propagates to runtime metadata and version-gating logic
- pkg/sbomscanner/v1/server.go: add the same WithCatalogerSelection/WithRemovals
  as sbom_manager.go so both SBOM paths drop file-digest/metadata/executable
  catalogers and stay in consistent memory behaviour
Signed-off-by: Ben <ben@armosec.io>

* fix: keep syft tool version at required version

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Signed-off-by: Ben <ben@armosec.io>
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Signed-off-by: Ben <ben@armosec.io>
Co-authored-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Co-authored-by: Ben Hirschberg <59160382+slashben@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Entlein <eineintlein@gmail.com>
@coderabbitai coderabbitai Bot mentioned this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants