Skip to content

feat(tcpretrans): rewrite plugin with native cilium/ebpf#2152

Merged
nddq merged 1 commit intomainfrom
plugin/tcpretrans-ebpf
Apr 13, 2026
Merged

feat(tcpretrans): rewrite plugin with native cilium/ebpf#2152
nddq merged 1 commit intomainfrom
plugin/tcpretrans-ebpf

Conversation

@nddq
Copy link
Copy Markdown
Member

@nddq nddq commented Mar 31, 2026

Description

PR 1 of 2 for removing inspektor-gadget from retina plugins (split from #2148 per review feedback)

Cilium v1.19.2 pins cilium/ebpf at a version that removed CollectionSpec.RewriteConstants(), which inspektor-gadget relies on. Upgrading IG is not viable — v0.42.0+ removed the built-in trace/tcpretrans gadget entirely. This PR replaces the IG-based implementation with a native cilium/ebpf one ahead of the Cilium upgrade.

  • BPF program: New tracepoint/tcp/tcp_retransmit_skb program (kernel 5.8+ for bpf_ktime_get_boot_ns) with BPF_CORE_READ for CO-RE field relocation. Handles the kernel 6.17 tracepoint struct rename via CO-RE type flavors. TCP flags read from tcp_skb_cb. Supports IPv4 and IPv6.
  • Go plugin: Perf buffer event loop with worker goroutines, replacing the IG tracer/gadget-context pattern. SetupChannel now wires up the external channel (was a no-op under IG), enabling the advanced metrics module to receive events directly.
  • Build: Pre-compiled via bpf2go@v0.18.0 with per-arch targets (amd64 + arm64) and embedded in the binary — no runtime compilation. Compile() and Generate() retained as no-ops for plugin manager lifecycle compatibility.
  • ToFlow IPv6 labeling (pkg/utils/flow_utils.go): now derives IpVersion from sourceIP.To4() instead of hardcoding IPVersion_IPv4, so IPv6 retransmission flows are labeled correctly. Backward-compatible for every existing IPv4 caller.
  • No dependency changes: Works with existing cilium/ebpf v0.18.0 in go.mod.

Related Issue

Partial fix for #1788
Split from #2148
Related: #2162 (chart fix surfaced while validating this PR — latent helm-upgrade reload bug for the operator ConfigMap)

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...).
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Testing Completed

Standard / Advanced mode (AKS cluster retinaTest-v119, 3-node, k8s v1.33.6, kernel 5.15.0-1102-azure)

Image: acnpublic.azurecr.io/microsoft/retina/retina-agent:f4adc1e0-fix1-linux-amd64

Plugin initialized and attached on all three agent pods:

level=info caller=common/common_linux.go:79 msg="perf reader created" Map=PerfEventArray(retina_tcpretrans_events)#100 PageSize=4096 BufferSize=65536
level=info caller=tcpretrans/tcpretrans_linux.go:106 msg="tcpretrans plugin initialized"
level=info caller=pluginmanager/pluginmanager.go:174 msg="starting plugin tcpretrans"

BufferSize=65536 = 4096 × 16 pages, confirming the starting buffer size is live (down from an initial oversized default).

Advanced metric flowing with real pod-level labels, counter incrementing in real time:

networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.1.183",namespace="kube-system",podname="konnectivity-agent-5858c6b5d7-882m2"} 6 → 7 → 9
networkobservability_adv_tcpretrans_count{direction="egress",ip="10.224.2.41",namespace="kube-system",podname="metrics-server-b957f9d87-tg22p"} 1

Cross-checked against the node-level TCP stats from linuxutil on the same pod: networkobservability_tcp_connection_stats{statistic_name="TCPLostRetransmit"} = 55 — different data source (kernel TCP stats vs kernel tracepoint), same order of magnitude.

Advanced mode with remote context (source + destination labels)

With remoteContext=true and destinationLabels set on the tcp_retransmission_count entry in the MetricsConfiguration CRD:

networkobservability_adv_tcpretrans_count{
  source_ip="10.224.1.183",source_namespace="kube-system",source_podname="konnectivity-agent-5858c6b5d7-882m2",
  destination_ip="51.143.116.92",destination_namespace="kubernetes-apiserver",destination_podname="kubernetes-apiserver",
  direction="EGRESS"
} 4

No plugin changes were needed for remote context — the rewrite already populates both sides of the flow through utils.ToFlow, the enricher handles both sides, and the metrics module is already wired to emit both label sets when configured. Discovering this path is what led to filing #2162.

Build / vet / test

go build ./...                                                  # clean
go vet ./pkg/utils/... ./pkg/plugin/tcpretrans/...              # clean
gofumpt -l pkg/utils/flow_utils.go pkg/plugin/tcpretrans/       # clean
go test -tags=unit,dashboard ./pkg/plugin/... ./pkg/module/...  # all pass

All existing utils.ToFlow callers (dropreason, dns, packetparser, latency_test.go) still pass — the IpVersion derivation is backward-compatible for every IPv4 caller.

Additional Notes

@nddq nddq force-pushed the plugin/tcpretrans-ebpf branch 3 times, most recently from 088b7d8 to 01cef4e Compare March 31, 2026 22:36
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 31, 2026

Retina Code Coverage Report

Total coverage increased from 34.4% to 34.6%

Increased diff

Impacted Files Coverage
pkg/controllers/operator/retinaendpoint/retinaendpoint_controller.go 82.25% ... 83.28% (1.03%) ⬆️
pkg/plugin/tcpretrans/tcpretrans_linux.go 0.0% ... 46.21% (46.21%) ⬆️
pkg/utils/flow_utils.go 43.47% ... 44.47% (1.0%) ⬆️

Decreased diff

Impacted Files Coverage
pkg/controllers/daemon/namespace/namespace_controller.go 78.46% ... 76.24% (-2.22%) ⬇️

@nddq nddq force-pushed the plugin/tcpretrans-ebpf branch 2 times, most recently from 6cabc3b to 60dea07 Compare March 31, 2026 23:13
@nddq nddq marked this pull request as ready for review March 31, 2026 23:35
@nddq nddq requested a review from a team as a code owner March 31, 2026 23:35
@nddq nddq marked this pull request as draft April 1, 2026 00:58
@nddq nddq force-pushed the plugin/tcpretrans-ebpf branch 2 times, most recently from ed4ace3 to bb850a3 Compare April 7, 2026 23:21
Copy link
Copy Markdown

@sardarmscs sardarmscs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: feat(tcpretrans): rewrite plugin with native cilium/ebpf

Summary

Excellent strategic move — replacing the inspektor-gadget (IG) based tcpretrans plugin with a native cilium/ebpf implementation. The IG dependency was becoming untenable (Cilium v1.19.2 pins cilium/ebpf at a version that removed CollectionSpec.RewriteConstants(), and IG v0.42.0+ removed the built-in trace/tcpretrans gadget). Owning the BPF program directly gives Retina full control over the data path.


BPF Program Correctness

Strengths:

  • Correct use of BPF_CORE_READ for CO-RE field relocation against trace_event_raw_tcp_event_sk_skb
  • Address family derived from sk->__sk_common.skc_family (robust across kernel versions)
  • Proper AF_INET/AF_INET6 handling with ABI-stable literal constants

tcp_skb_cb offset — correct but fragile:
The tcp_flags read via offsetof(struct tcp_skb_cb, tcp_flags) is resolved at compile time, not CO-RE relocated. If a future kernel reorders fields in tcp_skb_cb (unlikely but not impossible — it's not a UAPI struct), this would silently read garbage. Worth an explicit code comment noting this is a compile-time offset and flagging it as a known limitation. Consider whether bpf_core_field_offset() could be used as a future improvement.


Go Code Quality

Event loop architecture is solid:

  • Reader goroutine → buffered channel → 2 worker goroutines
  • Lost sample tracking via metrics.LostEventsCounter for both kernel perf ring and Go channel
  • Clean init/cleanup with ok flag + deferred resource release

Resource leak in Stop():
If Init() succeeds but Start() is never called, isRunning is false, and Stop() returns early without closing BPF objects, tracepoint link, or perf reader. These kernel resources will persist until process exit.

Recommendation: Either set isRunning after Init(), or restructure Stop() to always clean up kernel resources regardless of isRunning.

IPv6 slice aliasing concern:

srcIP = event.SrcIp6[:]
dstIP = event.DstIp6[:]

These slices alias the perf buffer memory. Safe today because ToFlow calls .String() (allocating a new string), but fragile if ToFlow internals ever store the net.IP slice directly. Consider defensive copy() into local buffers.


ToFlow IPv6 Fix

if sourceIP.To4() == nil {
    ipVersion = flow.IPVersion_IPv6
}

Correct and backward-compatible. To4() returns nil for pure IPv6 addresses; IPv4-mapped IPv6 addresses are correctly labeled IPv4 (matching kernel AF_INET behavior).


Performance

  • 16 pages/CPU perf buffer (64 KiB) is appropriate for TCP retransmissions (comparatively rare events)
  • Graceful degradation on ENOMEM (halving down to 1 page)
  • Minimal per-event allocations in the hot path
  • Performance equivalent to or better than the IG-based implementation

Testing

This is the weakest area. No unit tests are added for the new Go code. Recommended:

  1. handleTCPRetransEvent tests: synthetic perf.Record bytes for IPv4, IPv6, truncated records, various TCP flags
  2. flagBit tests (trivial but good for coverage)
  3. Init error path tests with mocked failures

The manual AKS cluster testing is thorough and demonstrates correctness, but automated regression tests should follow.


Miscellaneous

  • flagBit helper replacing string-parsing getTCPFlags with bitwise ops is a clear improvement
  • The NS flag documentation (byte 13 only, NS deprecated per RFC 8311) is excellent
  • ktime.MonotonicOffset timestamp conversion is correct and matches other Retina plugins
  • BPF license "Dual MIT/GPL" is fine for loading GPL-only helpers

Recommended Actions

Should fix before merge:

  1. Resource leak in Stop() — BPF objects not cleaned up if Init() succeeded but Start() never called

Should fix soon (follow-up acceptable):
2. Unit tests for handleTCPRetransEvent (at minimum IPv4, IPv6, truncated record, flag parsing)
3. Defensive copy for IPv6 net.IP slices
4. Explicit comment in BPF C noting tcp_skb_cb offset is compile-time, not CO-RE


Verdict

Well-executed migration from inspektor-gadget to native cilium/ebpf. The BPF program is correct, the Go code follows established Retina patterns, and the IPv6/ToFlow fix is a nice bonus. The resource leak in Stop() is the main actionable issue. Great work, @nddq!

@nddq nddq marked this pull request as ready for review April 8, 2026 13:20
Copy link
Copy Markdown
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nddq can you please add a BPF test and unit test for the plugin?

@nddq nddq force-pushed the plugin/tcpretrans-ebpf branch 4 times, most recently from 4dc7bfe to 707504f Compare April 10, 2026 22:49
Comment thread pkg/plugin/tcpretrans/tcpretrans_ebpf_test.go
Copy link
Copy Markdown
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @nddq!

To test the happy path locally I used a test nginx pod + ephemeral debug container, then exec into it and run: tc qdisc add dev eth0 root netem loss 30%, which configures the eth0 interface to randomly drop about 30% of outgoing network packets, simulating network packet loss. Metrics is incremented as expected.

image

Replace the inspektor-gadget based tcpretrans plugin with a native
cilium/ebpf implementation using bpf2go for compile-time BPF embedding.

BPF program:
- Tracepoint on tcp/tcp_retransmit_skb captures 5-tuple, TCP state,
  and flags from retransmitted packets
- CO-RE ___flavor suffixes handle the kernel 6.13+ struct rename from
  trace_event_raw_tcp_event_sk_skb to trace_event_raw_tcp_retransmit_skb,
  with bpf_core_type_exists() selecting the right type at load time
- IPs read from sock (skc_rcv_saddr/skc_daddr) for kernel-version
  stability; TCP flags read from tcp_skb_cb via skb control buffer
- Works with both static repo vmlinux.h and runtime-generated headers

Go plugin:
- Perf buffer reader with buffered channel and 2 worker goroutines
- Graceful degradation on ENOMEM (halving perf buffer to 1 page)
- Lost sample tracking via metrics counters
- Stop() cleans up BPF resources regardless of whether Start() was called
- IPv6 flow version detection via net.IP.To4() in ToFlow

Tests:
- BPF tests (ebpf tag): load+verify, tracepoint attach, perf reader,
  Stop-after-Init lifecycle
- Unit tests: handleTCPRetransEvent for IPv4/IPv6/truncated/unknown-AF/
  all-flags, flagBit, enricher integration, channel-full drop

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq force-pushed the plugin/tcpretrans-ebpf branch from 707504f to fcba33e Compare April 13, 2026 17:31
@nddq nddq enabled auto-merge April 13, 2026 17:32
@nddq nddq added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit f5239ff Apr 13, 2026
36 checks passed
@nddq nddq deleted the plugin/tcpretrans-ebpf branch April 13, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants