Skip to content

fix(chart): roll operator pod when its configmap changes#2162

Open
nddq wants to merge 1 commit intomainfrom
chore/operator-configmap-checksum
Open

fix(chart): roll operator pod when its configmap changes#2162
nddq wants to merge 1 commit intomainfrom
chore/operator-configmap-checksum

Conversation

@nddq
Copy link
Copy Markdown
Member

@nddq nddq commented Apr 7, 2026

Description

The operator Deployment's pod template has no checksum/config annotation, so helm upgrade doesn't roll the operator when its ConfigMap rendering changes. The operator reads its config once at startup and caches it, so values like remoteContext stay stale until the pod is manually restarted — leaving the operator unable to create retinaendpoint resources that agents in remote-context mode rely on.

This PR adds the annotation, matching what daemonset.yaml already does. The operator ConfigMap is split out of operator.yaml into its own operator-configmap.yaml so the include used by the checksum doesn't recurse into itself.

Found while testing #2152 on a live cluster.

Related Issue

Relates to #2152

Checklist

  • I have read the contributing guidelines
  • I have signed the CLA
  • Unit tests added/updated — N/A, chart template change
  • Integration tests added/updated — N/A, chart template change
  • Documentation updated — N/A
  • Helm chart updated

Testing

helm lint deploy/standard/manifests/controller/helm/retina/
# 1 chart(s) linted, 0 chart(s) failed

Verified the checksum annotation actually moves when operator config values change:

HASH1=$(helm template test deploy/standard/manifests/controller/helm/retina/ \
  --set operator.enabled=true --set remoteContext=false | grep 'checksum/config' | tail -1)
HASH2=$(helm template test deploy/standard/manifests/controller/helm/retina/ \
  --set operator.enabled=true --set remoteContext=true  | grep 'checksum/config' | tail -1)
# different → pod will roll on the next helm upgrade

The operator Deployment's pod template was missing a `checksum/config`
annotation, so `helm upgrade` would rewrite `retina-operator-config`
without rolling the operator pod. The operator process reads its config
once at startup and caches it, so any value change (e.g. flipping
`remoteContext`) silently required a manual `kubectl rollout restart`.

Found while testing the tcpretrans plugin on a cluster: after enabling
`remoteContext=true` via `helm upgrade --reuse-values`, every advanced
metric (`adv_drop_count`, `adv_forward_count`, `adv_dns_*`,
`adv_tcpretrans_count`) began emitting `source_*="unknown"` and
`destination_*="unknown"` for in-cluster pod IPs. Root cause: the
daemonset rolled (its template does have the checksum annotation) so
agents came up with `remoteContext=true` and started waiting for
`retinaendpoint` CRDs per `cmd/standard/daemon.go:252-267`, while the
stale operator with `RemoteContext=false` never ran the pod controller
at `operator/cmd/standard/deployment.go:202-219` and never created any
`retinaendpoint` resources. Agent cache stayed empty, enricher returned
nil on every `GetObjByIP`, and every advanced metric emitted "unknown"
labels. `kubectl rollout restart deploy/retina-operator` resolved it
once; this change makes `helm upgrade` do the right thing.

Changes:
- Split the operator ConfigMap out of `operator.yaml` into its own
  `operator-configmap.yaml` template so the Deployment can reference it
  via `include` without recursing into itself.
- Add `checksum/config` annotation on the operator Deployment pod
  template, pointing at the new ConfigMap template file. Matches the
  pattern already used by `daemonset.yaml`.

Verification:
- `helm lint` clean.
- `helm template` renders the ConfigMap (1x) and Deployment with the
  annotation populated (`checksum/config: <sha>`).
- Rendering with `remoteContext=false` vs `remoteContext=true` produces
  different hashes, confirming the pod template changes whenever the
  ConfigMap changes.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq requested a review from a team as a code owner April 7, 2026 23:16
@nddq nddq requested review from mainred and rbtr April 7, 2026 23:16
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Retina Code Coverage Report

Total coverage no change

Decreased diff

Impacted Files Coverage
pkg/controllers/operator/retinaendpoint/retinaendpoint_controller.go 83.28% ... 82.25% (-1.03%) ⬇️

github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2026
# Description

**PR 1 of 2** for removing inspektor-gadget from retina plugins (split
from #2148 per review feedback)

Cilium v1.19.2 pins `cilium/ebpf` at a version that removed
`CollectionSpec.RewriteConstants()`, which inspektor-gadget relies on.
Upgrading IG is not viable — v0.42.0+ removed the built-in
`trace/tcpretrans` gadget entirely. This PR replaces the IG-based
implementation with a native cilium/ebpf one ahead of the Cilium
upgrade.

- **BPF program**: New `tracepoint/tcp/tcp_retransmit_skb` program
(kernel 5.8+ for `bpf_ktime_get_boot_ns`) with `BPF_CORE_READ` for CO-RE
field relocation. Handles the kernel 6.17 tracepoint struct rename via
CO-RE type flavors. TCP flags read from `tcp_skb_cb`. Supports IPv4 and
IPv6.
- **Go plugin**: Perf buffer event loop with worker goroutines,
replacing the IG tracer/gadget-context pattern. `SetupChannel` now wires
up the external channel (was a no-op under IG), enabling the advanced
metrics module to receive events directly.
- **Build**: Pre-compiled via `bpf2go@v0.18.0` with per-arch targets
(amd64 + arm64) and embedded in the binary — no runtime compilation.
`Compile()` and `Generate()` retained as no-ops for plugin manager
lifecycle compatibility.
- **`ToFlow` IPv6 labeling** (`pkg/utils/flow_utils.go`): now derives
`IpVersion` from `sourceIP.To4()` instead of hardcoding
`IPVersion_IPv4`, so IPv6 retransmission flows are labeled correctly.
Backward-compatible for every existing IPv4 caller.
- **No dependency changes**: Works with existing `cilium/ebpf` v0.18.0
in `go.mod`.

## Related Issue

Partial fix for #1788
Split from #2148
Related: #2162 (chart fix surfaced while validating this PR — latent
helm-upgrade reload bug for the operator ConfigMap)

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [x] I signed and signed-off the commits (`git commit -S -s ...`).
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Testing Completed

### Standard / Advanced mode (AKS cluster `retinaTest-v119`, 3-node, k8s
v1.33.6, kernel 5.15.0-1102-azure)

Image:
`acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64`

Plugin initialized and attached on all three agent pods:

```
level=info caller=common/common_linux.go:79 msg="perf reader created" Map=PerfEventArray(retina_tcpretrans_events)#100 PageSize=4096 BufferSize=65536
level=info caller=tcpretrans/tcpretrans_linux.go:106 msg="tcpretrans plugin initialized"
level=info caller=pluginmanager/pluginmanager.go:174 msg="starting plugin tcpretrans"
```

`BufferSize=65536` = 4096 × 16 pages, confirming the starting buffer
size is live (down from an initial oversized default).

Advanced metric flowing with real pod-level labels, counter incrementing
in real time:

```
networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.1.183",namespace="kube-system",podname="konnectivity-agent-5858c6b5d7-882m2"} 6 → 7 → 9
networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.2.41",namespace="kube-system",podname="metrics-server-b957f9d87-tg22p"} 1
```

Cross-checked against the node-level TCP stats from linuxutil on the
same pod:
`networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"}
= 55` — different data source (kernel TCP stats vs kernel tracepoint),
same order of magnitude.

### Advanced mode with remote context (source + destination labels)

With `remoteContext=true` and `destinationLabels` set on the
`tcp_retransmission_count` entry in the `MetricsConfiguration` CRD:

```
networkobservability_adv_tcpretrans_count{
  source_ip="10.224.1.183",source_namespace="kube-system",source_podname="konnectivity-agent-5858c6b5d7-882m2",
  destination_ip="51.143.116.92",destination_namespace="kubernetes-apiserver",destination_podname="kubernetes-apiserver",
  direction="EGRESS"
} 4
```

No plugin changes were needed for remote context — the rewrite already
populates both sides of the flow through `utils.ToFlow`, the enricher
handles both sides, and the metrics module is already wired to emit both
label sets when configured. Discovering this path is what led to filing
#2162.

### Build / vet / test

```bash
go build ./...                                                  # clean
go vet ./pkg/utils/... ./pkg/plugin/tcpretrans/...              # clean
gofumpt -l pkg/utils/flow_utils.go pkg/plugin/tcpretrans/       # clean
go test -tags=unit,dashboard ./pkg/plugin/... ./pkg/module/...  # all pass
```

All existing `utils.ToFlow` callers (dropreason, dns, packetparser,
latency_test.go) still pass — the `IpVersion` derivation is
backward-compatible for every IPv4 caller.

## Additional Notes

- PR 2 of 2 (#2153, `plugin/dns-ebpf`) rewrites the DNS plugin with the
same approach.
- After both merge, inspektor-gadget can be fully removed in the Cilium
v1.19 upgrade PR.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant