You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Agent: resilient BPF attach and sampler status. The BPF builder now
attaches each program individually and tolerates per-program failures
(load/verify failures stay fatal), so a single dead probe no longer
takes down its sibling programs. A new GET /samplers endpoint
reports each sampler as active/disabled/failed with per-program
attach detail, and recordings capture this status under per_source_metadata.<source>.sampler_status. (#954)
GPU (NVIDIA): gpu_tensor_utilization now breaks out per-tensor-pipe
activity via a pipe label — hmma (FP16/BF16, and FP32 matmul that
runs as TF32), imma (integer), and dfma (FP64) — alongside the
existing aggregate (pipe=any). Collected from NVML GPM, so it
requires Hopper+ and is reported only where the corresponding pipe is
supported. (#946)
Fixed
BPF samplers that rely on in-kernel BTF (cpu_usage, cpu_migrations, cpu_perf, scheduler_runqueue, syscall_counts) now work on kernels
built without /sys/kernel/btf/vmlinux (e.g. NVIDIA Tegra/L4T). Each tp_btf hook gains a raw_tp twin selected at runtime via kernel_has_btf(), and syscall_counts uses bpf_get_current_task()
instead of bpf_get_current_task_btf(). CO-RE still uses the external
BTF file (btf_path). Stock BTF kernels are unaffected. (#948)
BPF sampler correctness fixes from a full review against docs/principles.md: histogram bucketing used 32-bit shifts,
mis-bucketing values ≥ 2³¹ (long-tail latencies ≥ ~2.15 s were
misreported); blockio latency tracking silently dropped all requests
on kernels < 5.11 due to a tracepoint argument layout difference;
scheduler/runqueue could charge runqueue-wait and off-cpu time to the
wrong cgroup; a full ringbuf no longer permanently suppresses a
cgroup's name; tcp_retransmit now counts segments instead of calls
(it undercounted with TSO/GSO); plus smaller metadata, histogram, and
defensive-check fixes. (#956)