feat(tcpretrans): rewrite plugin with native cilium/ebpf#2152
Conversation
088b7d8 to
01cef4e
Compare
Retina Code Coverage ReportTotal coverage increased from
|
| Impacted Files | Coverage | |
|---|---|---|
| pkg/controllers/operator/retinaendpoint/retinaendpoint_controller.go | 82.25% ... 83.28% (1.03%) |
⬆️ |
| pkg/plugin/tcpretrans/tcpretrans_linux.go | 0.0% ... 46.21% (46.21%) |
⬆️ |
| pkg/utils/flow_utils.go | 43.47% ... 44.47% (1.0%) |
⬆️ |
Decreased diff
| Impacted Files | Coverage | |
|---|---|---|
| pkg/controllers/daemon/namespace/namespace_controller.go | 78.46% ... 76.24% (-2.22%) |
⬇️ |
6cabc3b to
60dea07
Compare
ed4ace3 to
bb850a3
Compare
sardarmscs
left a comment
There was a problem hiding this comment.
Code Review: feat(tcpretrans): rewrite plugin with native cilium/ebpf
Summary
Excellent strategic move — replacing the inspektor-gadget (IG) based tcpretrans plugin with a native cilium/ebpf implementation. The IG dependency was becoming untenable (Cilium v1.19.2 pins cilium/ebpf at a version that removed CollectionSpec.RewriteConstants(), and IG v0.42.0+ removed the built-in trace/tcpretrans gadget). Owning the BPF program directly gives Retina full control over the data path.
BPF Program Correctness
Strengths:
- Correct use of
BPF_CORE_READfor CO-RE field relocation againsttrace_event_raw_tcp_event_sk_skb - Address family derived from
sk->__sk_common.skc_family(robust across kernel versions) - Proper
AF_INET/AF_INET6handling with ABI-stable literal constants
tcp_skb_cb offset — correct but fragile:
The tcp_flags read via offsetof(struct tcp_skb_cb, tcp_flags) is resolved at compile time, not CO-RE relocated. If a future kernel reorders fields in tcp_skb_cb (unlikely but not impossible — it's not a UAPI struct), this would silently read garbage. Worth an explicit code comment noting this is a compile-time offset and flagging it as a known limitation. Consider whether bpf_core_field_offset() could be used as a future improvement.
Go Code Quality
Event loop architecture is solid:
- Reader goroutine → buffered channel → 2 worker goroutines
- Lost sample tracking via
metrics.LostEventsCounterfor both kernel perf ring and Go channel - Clean init/cleanup with
okflag + deferred resource release
Resource leak in Stop():
If Init() succeeds but Start() is never called, isRunning is false, and Stop() returns early without closing BPF objects, tracepoint link, or perf reader. These kernel resources will persist until process exit.
Recommendation: Either set isRunning after Init(), or restructure Stop() to always clean up kernel resources regardless of isRunning.
IPv6 slice aliasing concern:
srcIP = event.SrcIp6[:]
dstIP = event.DstIp6[:]These slices alias the perf buffer memory. Safe today because ToFlow calls .String() (allocating a new string), but fragile if ToFlow internals ever store the net.IP slice directly. Consider defensive copy() into local buffers.
ToFlow IPv6 Fix
if sourceIP.To4() == nil {
ipVersion = flow.IPVersion_IPv6
}Correct and backward-compatible. To4() returns nil for pure IPv6 addresses; IPv4-mapped IPv6 addresses are correctly labeled IPv4 (matching kernel AF_INET behavior).
Performance
- 16 pages/CPU perf buffer (64 KiB) is appropriate for TCP retransmissions (comparatively rare events)
- Graceful degradation on ENOMEM (halving down to 1 page)
- Minimal per-event allocations in the hot path
- Performance equivalent to or better than the IG-based implementation
Testing
This is the weakest area. No unit tests are added for the new Go code. Recommended:
handleTCPRetransEventtests: syntheticperf.Recordbytes for IPv4, IPv6, truncated records, various TCP flagsflagBittests (trivial but good for coverage)Initerror path tests with mocked failures
The manual AKS cluster testing is thorough and demonstrates correctness, but automated regression tests should follow.
Miscellaneous
flagBithelper replacing string-parsinggetTCPFlagswith bitwise ops is a clear improvement- The NS flag documentation (byte 13 only, NS deprecated per RFC 8311) is excellent
ktime.MonotonicOffsettimestamp conversion is correct and matches other Retina plugins- BPF license
"Dual MIT/GPL"is fine for loading GPL-only helpers
Recommended Actions
Should fix before merge:
- Resource leak in
Stop()— BPF objects not cleaned up ifInit()succeeded butStart()never called
Should fix soon (follow-up acceptable):
2. Unit tests for handleTCPRetransEvent (at minimum IPv4, IPv6, truncated record, flag parsing)
3. Defensive copy for IPv6 net.IP slices
4. Explicit comment in BPF C noting tcp_skb_cb offset is compile-time, not CO-RE
Verdict
Well-executed migration from inspektor-gadget to native cilium/ebpf. The BPF program is correct, the Go code follows established Retina patterns, and the IPv6/ToFlow fix is a nice bonus. The resource leak in Stop() is the main actionable issue. Great work, @nddq!
4dc7bfe to
707504f
Compare
There was a problem hiding this comment.
LGTM, thanks @nddq!
To test the happy path locally I used a test nginx pod + ephemeral debug container, then exec into it and run: tc qdisc add dev eth0 root netem loss 30%, which configures the eth0 interface to randomly drop about 30% of outgoing network packets, simulating network packet loss. Metrics is incremented as expected.
Replace the inspektor-gadget based tcpretrans plugin with a native cilium/ebpf implementation using bpf2go for compile-time BPF embedding. BPF program: - Tracepoint on tcp/tcp_retransmit_skb captures 5-tuple, TCP state, and flags from retransmitted packets - CO-RE ___flavor suffixes handle the kernel 6.13+ struct rename from trace_event_raw_tcp_event_sk_skb to trace_event_raw_tcp_retransmit_skb, with bpf_core_type_exists() selecting the right type at load time - IPs read from sock (skc_rcv_saddr/skc_daddr) for kernel-version stability; TCP flags read from tcp_skb_cb via skb control buffer - Works with both static repo vmlinux.h and runtime-generated headers Go plugin: - Perf buffer reader with buffered channel and 2 worker goroutines - Graceful degradation on ENOMEM (halving perf buffer to 1 page) - Lost sample tracking via metrics counters - Stop() cleans up BPF resources regardless of whether Start() was called - IPv6 flow version detection via net.IP.To4() in ToFlow Tests: - BPF tests (ebpf tag): load+verify, tracepoint attach, perf reader, Stop-after-Init lifecycle - Unit tests: handleTCPRetransEvent for IPv4/IPv6/truncated/unknown-AF/ all-flags, flagBit, enricher integration, channel-full drop Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
707504f to
fcba33e
Compare
Description
PR 1 of 2 for removing inspektor-gadget from retina plugins (split from #2148 per review feedback)
Cilium v1.19.2 pins
cilium/ebpfat a version that removedCollectionSpec.RewriteConstants(), which inspektor-gadget relies on. Upgrading IG is not viable — v0.42.0+ removed the built-intrace/tcpretransgadget entirely. This PR replaces the IG-based implementation with a native cilium/ebpf one ahead of the Cilium upgrade.tracepoint/tcp/tcp_retransmit_skbprogram (kernel 5.8+ forbpf_ktime_get_boot_ns) withBPF_CORE_READfor CO-RE field relocation. Handles the kernel 6.17 tracepoint struct rename via CO-RE type flavors. TCP flags read fromtcp_skb_cb. Supports IPv4 and IPv6.SetupChannelnow wires up the external channel (was a no-op under IG), enabling the advanced metrics module to receive events directly.bpf2go@v0.18.0with per-arch targets (amd64 + arm64) and embedded in the binary — no runtime compilation.Compile()andGenerate()retained as no-ops for plugin manager lifecycle compatibility.ToFlowIPv6 labeling (pkg/utils/flow_utils.go): now derivesIpVersionfromsourceIP.To4()instead of hardcodingIPVersion_IPv4, so IPv6 retransmission flows are labeled correctly. Backward-compatible for every existing IPv4 caller.cilium/ebpfv0.18.0 ingo.mod.Related Issue
Partial fix for #1788
Split from #2148
Related: #2162 (chart fix surfaced while validating this PR — latent helm-upgrade reload bug for the operator ConfigMap)
Checklist
git commit -S -s ...).Testing Completed
Standard / Advanced mode (AKS cluster
retinaTest-v119, 3-node, k8s v1.33.6, kernel 5.15.0-1102-azure)Image:
acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64Plugin initialized and attached on all three agent pods:
BufferSize=65536= 4096 × 16 pages, confirming the starting buffer size is live (down from an initial oversized default).Advanced metric flowing with real pod-level labels, counter incrementing in real time:
Cross-checked against the node-level TCP stats from linuxutil on the same pod:
networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"} = 55— different data source (kernel TCP stats vs kernel tracepoint), same order of magnitude.Advanced mode with remote context (source + destination labels)
With
remoteContext=trueanddestinationLabelsset on thetcp_retransmission_countentry in theMetricsConfigurationCRD:No plugin changes were needed for remote context — the rewrite already populates both sides of the flow through
utils.ToFlow, the enricher handles both sides, and the metrics module is already wired to emit both label sets when configured. Discovering this path is what led to filing #2162.Build / vet / test
All existing
utils.ToFlowcallers (dropreason, dns, packetparser, latency_test.go) still pass — theIpVersionderivation is backward-compatible for every IPv4 caller.Additional Notes
plugin/dns-ebpf) rewrites the DNS plugin with the same approach.