fix(chart): roll operator pod when its configmap changes#2162
Open
fix(chart): roll operator pod when its configmap changes#2162
Conversation
The operator Deployment's pod template was missing a `checksum/config` annotation, so `helm upgrade` would rewrite `retina-operator-config` without rolling the operator pod. The operator process reads its config once at startup and caches it, so any value change (e.g. flipping `remoteContext`) silently required a manual `kubectl rollout restart`. Found while testing the tcpretrans plugin on a cluster: after enabling `remoteContext=true` via `helm upgrade --reuse-values`, every advanced metric (`adv_drop_count`, `adv_forward_count`, `adv_dns_*`, `adv_tcpretrans_count`) began emitting `source_*="unknown"` and `destination_*="unknown"` for in-cluster pod IPs. Root cause: the daemonset rolled (its template does have the checksum annotation) so agents came up with `remoteContext=true` and started waiting for `retinaendpoint` CRDs per `cmd/standard/daemon.go:252-267`, while the stale operator with `RemoteContext=false` never ran the pod controller at `operator/cmd/standard/deployment.go:202-219` and never created any `retinaendpoint` resources. Agent cache stayed empty, enricher returned nil on every `GetObjByIP`, and every advanced metric emitted "unknown" labels. `kubectl rollout restart deploy/retina-operator` resolved it once; this change makes `helm upgrade` do the right thing. Changes: - Split the operator ConfigMap out of `operator.yaml` into its own `operator-configmap.yaml` template so the Deployment can reference it via `include` without recursing into itself. - Add `checksum/config` annotation on the operator Deployment pod template, pointing at the new ConfigMap template file. Matches the pattern already used by `daemonset.yaml`. Verification: - `helm lint` clean. - `helm template` renders the ConfigMap (1x) and Deployment with the annotation populated (`checksum/config: <sha>`). - Rendering with `remoteContext=false` vs `remoteContext=true` produces different hashes, confirming the pod template changes whenever the ConfigMap changes. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
7 tasks
Retina Code Coverage ReportTotal coverage no changeDecreased diff
|
github-merge-queue bot
pushed a commit
that referenced
this pull request
Apr 13, 2026
# Description **PR 1 of 2** for removing inspektor-gadget from retina plugins (split from #2148 per review feedback) Cilium v1.19.2 pins `cilium/ebpf` at a version that removed `CollectionSpec.RewriteConstants()`, which inspektor-gadget relies on. Upgrading IG is not viable — v0.42.0+ removed the built-in `trace/tcpretrans` gadget entirely. This PR replaces the IG-based implementation with a native cilium/ebpf one ahead of the Cilium upgrade. - **BPF program**: New `tracepoint/tcp/tcp_retransmit_skb` program (kernel 5.8+ for `bpf_ktime_get_boot_ns`) with `BPF_CORE_READ` for CO-RE field relocation. Handles the kernel 6.17 tracepoint struct rename via CO-RE type flavors. TCP flags read from `tcp_skb_cb`. Supports IPv4 and IPv6. - **Go plugin**: Perf buffer event loop with worker goroutines, replacing the IG tracer/gadget-context pattern. `SetupChannel` now wires up the external channel (was a no-op under IG), enabling the advanced metrics module to receive events directly. - **Build**: Pre-compiled via `bpf2go@v0.18.0` with per-arch targets (amd64 + arm64) and embedded in the binary — no runtime compilation. `Compile()` and `Generate()` retained as no-ops for plugin manager lifecycle compatibility. - **`ToFlow` IPv6 labeling** (`pkg/utils/flow_utils.go`): now derives `IpVersion` from `sourceIP.To4()` instead of hardcoding `IPVersion_IPv4`, so IPv6 retransmission flows are labeled correctly. Backward-compatible for every existing IPv4 caller. - **No dependency changes**: Works with existing `cilium/ebpf` v0.18.0 in `go.mod`. ## Related Issue Partial fix for #1788 Split from #2148 Related: #2162 (chart fix surfaced while validating this PR — latent helm-upgrade reload bug for the operator ConfigMap) ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/Contributing/overview). - [x] I signed and signed-off the commits (`git commit -S -s ...`). - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [x] I have updated the documentation, if necessary. - [x] I have added tests, if applicable. ## Testing Completed ### Standard / Advanced mode (AKS cluster `retinaTest-v119`, 3-node, k8s v1.33.6, kernel 5.15.0-1102-azure) Image: `acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64` Plugin initialized and attached on all three agent pods: ``` level=info caller=common/common_linux.go:79 msg="perf reader created" Map=PerfEventArray(retina_tcpretrans_events)#100 PageSize=4096 BufferSize=65536 level=info caller=tcpretrans/tcpretrans_linux.go:106 msg="tcpretrans plugin initialized" level=info caller=pluginmanager/pluginmanager.go:174 msg="starting plugin tcpretrans" ``` `BufferSize=65536` = 4096 × 16 pages, confirming the starting buffer size is live (down from an initial oversized default). Advanced metric flowing with real pod-level labels, counter incrementing in real time: ``` networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.1.183",namespace="kube-system",podname="konnectivity-agent-5858c6b5d7-882m2"} 6 → 7 → 9 networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.2.41",namespace="kube-system",podname="metrics-server-b957f9d87-tg22p"} 1 ``` Cross-checked against the node-level TCP stats from linuxutil on the same pod: `networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"} = 55` — different data source (kernel TCP stats vs kernel tracepoint), same order of magnitude. ### Advanced mode with remote context (source + destination labels) With `remoteContext=true` and `destinationLabels` set on the `tcp_retransmission_count` entry in the `MetricsConfiguration` CRD: ``` networkobservability_adv_tcpretrans_count{ source_ip="10.224.1.183",source_namespace="kube-system",source_podname="konnectivity-agent-5858c6b5d7-882m2", destination_ip="51.143.116.92",destination_namespace="kubernetes-apiserver",destination_podname="kubernetes-apiserver", direction="EGRESS" } 4 ``` No plugin changes were needed for remote context — the rewrite already populates both sides of the flow through `utils.ToFlow`, the enricher handles both sides, and the metrics module is already wired to emit both label sets when configured. Discovering this path is what led to filing #2162. ### Build / vet / test ```bash go build ./... # clean go vet ./pkg/utils/... ./pkg/plugin/tcpretrans/... # clean gofumpt -l pkg/utils/flow_utils.go pkg/plugin/tcpretrans/ # clean go test -tags=unit,dashboard ./pkg/plugin/... ./pkg/module/... # all pass ``` All existing `utils.ToFlow` callers (dropreason, dns, packetparser, latency_test.go) still pass — the `IpVersion` derivation is backward-compatible for every IPv4 caller. ## Additional Notes - PR 2 of 2 (#2153, `plugin/dns-ebpf`) rewrites the DNS plugin with the same approach. - After both merge, inspektor-gadget can be fully removed in the Cilium v1.19 upgrade PR. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The operator Deployment's pod template has no
checksum/configannotation, sohelm upgradedoesn't roll the operator when its ConfigMap rendering changes. The operator reads its config once at startup and caches it, so values likeremoteContextstay stale until the pod is manually restarted — leaving the operator unable to createretinaendpointresources that agents in remote-context mode rely on.This PR adds the annotation, matching what
daemonset.yamlalready does. The operator ConfigMap is split out ofoperator.yamlinto its ownoperator-configmap.yamlso theincludeused by the checksum doesn't recurse into itself.Found while testing #2152 on a live cluster.
Related Issue
Relates to #2152
Checklist
Testing
helm lint deploy/standard/manifests/controller/helm/retina/ # 1 chart(s) linted, 0 chart(s) failedVerified the checksum annotation actually moves when operator config values change: