fix(chart): roll operator pod when its configmap changes by nddq · Pull Request #2162 · microsoft/retina

nddq · 2026-04-07T23:16:28Z

Description

The operator Deployment's pod template has no checksum/config annotation, so helm upgrade doesn't roll the operator when its ConfigMap rendering changes. The operator reads its config once at startup and caches it, so values like remoteContext stay stale until the pod is manually restarted — leaving the operator unable to create retinaendpoint resources that agents in remote-context mode rely on.

This PR adds the annotation, matching what daemonset.yaml already does. The operator ConfigMap is split out of operator.yaml into its own operator-configmap.yaml so the include used by the checksum doesn't recurse into itself.

Found while testing #2152 on a live cluster.

Related Issue

Relates to #2152

Checklist

I have read the contributing guidelines
I have signed the CLA
Unit tests added/updated — N/A, chart template change
Integration tests added/updated — N/A, chart template change
Documentation updated — N/A
Helm chart updated

Testing

helm lint deploy/standard/manifests/controller/helm/retina/
# 1 chart(s) linted, 0 chart(s) failed

Verified the checksum annotation actually moves when operator config values change:

HASH1=$(helm template test deploy/standard/manifests/controller/helm/retina/ \
  --set operator.enabled=true --set remoteContext=false | grep 'checksum/config' | tail -1)
HASH2=$(helm template test deploy/standard/manifests/controller/helm/retina/ \
  --set operator.enabled=true --set remoteContext=true  | grep 'checksum/config' | tail -1)
# different → pod will roll on the next helm upgrade

The operator Deployment's pod template was missing a `checksum/config` annotation, so `helm upgrade` would rewrite `retina-operator-config` without rolling the operator pod. The operator process reads its config once at startup and caches it, so any value change (e.g. flipping `remoteContext`) silently required a manual `kubectl rollout restart`. Found while testing the tcpretrans plugin on a cluster: after enabling `remoteContext=true` via `helm upgrade --reuse-values`, every advanced metric (`adv_drop_count`, `adv_forward_count`, `adv_dns_*`, `adv_tcpretrans_count`) began emitting `source_*="unknown"` and `destination_*="unknown"` for in-cluster pod IPs. Root cause: the daemonset rolled (its template does have the checksum annotation) so agents came up with `remoteContext=true` and started waiting for `retinaendpoint` CRDs per `cmd/standard/daemon.go:252-267`, while the stale operator with `RemoteContext=false` never ran the pod controller at `operator/cmd/standard/deployment.go:202-219` and never created any `retinaendpoint` resources. Agent cache stayed empty, enricher returned nil on every `GetObjByIP`, and every advanced metric emitted "unknown" labels. `kubectl rollout restart deploy/retina-operator` resolved it once; this change makes `helm upgrade` do the right thing. Changes: - Split the operator ConfigMap out of `operator.yaml` into its own `operator-configmap.yaml` template so the Deployment can reference it via `include` without recursing into itself. - Add `checksum/config` annotation on the operator Deployment pod template, pointing at the new ConfigMap template file. Matches the pattern already used by `daemonset.yaml`. Verification: - `helm lint` clean. - `helm template` renders the ConfigMap (1x) and Deployment with the annotation populated (`checksum/config: <sha>`). - Rendering with `remoteContext=false` vs `remoteContext=true` produces different hashes, confirming the pod template changes whenever the ConfigMap changes. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>

github-actions · 2026-04-07T23:26:04Z

Retina Code Coverage Report

Total coverage no change

Decreased diff

Impacted Files	Coverage
pkg/controllers/operator/retinaendpoint/retinaendpoint_controller.go	`83.28%` ... `82.25%` (`-1.03%`)	⬇️

# Description **PR 1 of 2** for removing inspektor-gadget from retina plugins (split from #2148 per review feedback) Cilium v1.19.2 pins `cilium/ebpf` at a version that removed `CollectionSpec.RewriteConstants()`, which inspektor-gadget relies on. Upgrading IG is not viable — v0.42.0+ removed the built-in `trace/tcpretrans` gadget entirely. This PR replaces the IG-based implementation with a native cilium/ebpf one ahead of the Cilium upgrade. - **BPF program**: New `tracepoint/tcp/tcp_retransmit_skb` program (kernel 5.8+ for `bpf_ktime_get_boot_ns`) with `BPF_CORE_READ` for CO-RE field relocation. Handles the kernel 6.17 tracepoint struct rename via CO-RE type flavors. TCP flags read from `tcp_skb_cb`. Supports IPv4 and IPv6. - **Go plugin**: Perf buffer event loop with worker goroutines, replacing the IG tracer/gadget-context pattern. `SetupChannel` now wires up the external channel (was a no-op under IG), enabling the advanced metrics module to receive events directly. - **Build**: Pre-compiled via `bpf2go@v0.18.0` with per-arch targets (amd64 + arm64) and embedded in the binary — no runtime compilation. `Compile()` and `Generate()` retained as no-ops for plugin manager lifecycle compatibility. - **`ToFlow` IPv6 labeling** (`pkg/utils/flow_utils.go`): now derives `IpVersion` from `sourceIP.To4()` instead of hardcoding `IPVersion_IPv4`, so IPv6 retransmission flows are labeled correctly. Backward-compatible for every existing IPv4 caller. - **No dependency changes**: Works with existing `cilium/ebpf` v0.18.0 in `go.mod`. ## Related Issue Partial fix for #1788 Split from #2148 Related: #2162 (chart fix surfaced while validating this PR — latent helm-upgrade reload bug for the operator ConfigMap) ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [x] I signed and signed-off the commits (`git commit -S -s ...`). - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [x] I have updated the documentation, if necessary. - [x] I have added tests, if applicable. ## Testing Completed ### Standard / Advanced mode (AKS cluster `retinaTest-v119`, 3-node, k8s v1.33.6, kernel 5.15.0-1102-azure) Image: `acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64` Plugin initialized and attached on all three agent pods: ``` level=info caller=common/common_linux.go:79 msg="perf reader created" Map=PerfEventArray(retina_tcpretrans_events)#100 PageSize=4096 BufferSize=65536 level=info caller=tcpretrans/tcpretrans_linux.go:106 msg="tcpretrans plugin initialized" level=info caller=pluginmanager/pluginmanager.go:174 msg="starting plugin tcpretrans" ``` `BufferSize=65536` = 4096 × 16 pages, confirming the starting buffer size is live (down from an initial oversized default). Advanced metric flowing with real pod-level labels, counter incrementing in real time: ``` networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.1.183",namespace="kube-system",podname="konnectivity-agent-5858c6b5d7-882m2"} 6 → 7 → 9 networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.2.41",namespace="kube-system",podname="metrics-server-b957f9d87-tg22p"} 1 ``` Cross-checked against the node-level TCP stats from linuxutil on the same pod: `networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"} = 55` — different data source (kernel TCP stats vs kernel tracepoint), same order of magnitude. ### Advanced mode with remote context (source + destination labels) With `remoteContext=true` and `destinationLabels` set on the `tcp_retransmission_count` entry in the `MetricsConfiguration` CRD: ``` networkobservability_adv_tcpretrans_count{ source_ip="10.224.1.183",source_namespace="kube-system",source_podname="konnectivity-agent-5858c6b5d7-882m2", destination_ip="51.143.116.92",destination_namespace="kubernetes-apiserver",destination_podname="kubernetes-apiserver", direction="EGRESS" } 4 ``` No plugin changes were needed for remote context — the rewrite already populates both sides of the flow through `utils.ToFlow`, the enricher handles both sides, and the metrics module is already wired to emit both label sets when configured. Discovering this path is what led to filing #2162. ### Build / vet / test ```bash go build ./... # clean go vet ./pkg/utils/... ./pkg/plugin/tcpretrans/... # clean gofumpt -l pkg/utils/flow_utils.go pkg/plugin/tcpretrans/ # clean go test -tags=unit,dashboard ./pkg/plugin/... ./pkg/module/... # all pass ``` All existing `utils.ToFlow` callers (dropreason, dns, packetparser, latency_test.go) still pass — the `IpVersion` derivation is backward-compatible for every IPv4 caller. ## Additional Notes - PR 2 of 2 (#2153, `plugin/dns-ebpf`) rewrites the DNS plugin with the same approach. - After both merge, inspektor-gadget can be fully removed in the Cilium v1.19 upgrade PR. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>

nddq requested a review from a team as a code owner April 7, 2026 23:16

nddq requested review from mainred and rbtr April 7, 2026 23:16

nddq mentioned this pull request Apr 7, 2026

feat(tcpretrans): rewrite plugin with native cilium/ebpf #2152

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chart): roll operator pod when its configmap changes#2162

fix(chart): roll operator pod when its configmap changes#2162
nddq wants to merge 1 commit intomainfrom
chore/operator-configmap-checksum

nddq commented Apr 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nddq commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Checklist

Testing

Uh oh!

github-actions bot commented Apr 7, 2026

Retina Code Coverage Report

Total coverage no change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nddq commented Apr 7, 2026 •

edited

Loading