You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-03-21 12:23:35.575638+00:00 info [p:2611192 t:2611192] eBPF program successfully compiled
2024-03-21 12:23:36.010724+00:00 error [p:2611192 t:2611192] Exception during BPFHandler initialization, closing connection: Only up to 128 cpus are currently supported
Failed to compile eBPF code for the Linux distro 'unknown' running kernel version 6.0.0-0.deb11.2-amd64.
troubleshoot item bpf_compilation_failed (os=Linux,flavor=unknown,headers_src=unknown,kernel=6.0.0-0.deb11.2-amd64): Only up to 128 cpus are currently supported
This usually means that kernel headers weren't installed correctly.Please reach out to support and include this log in its entirety so we can diagnose and fixthe problem.2024-03-21 12:23:36.010872+00:00 error [p:2611192 t:2611192] troubleshoot item bpf_compilation_failed (os=Linux,flavor=unknown,headers_src=unknown,kernel=6.0.0-0.deb11.2-amd64): Only up to 128 cpus are currently supported
Additional context
most likely BPF_MAX_CPUS constant should be increased here
level-a
changed the title
kernel-collector doesn't work on systems with more than 128 logical CPUs
kernel-collector doesn't start on systems with more than 128 logical CPUs
Mar 21, 2024
Yes there should be an artificial upper limit to the number of CPUs, that is used to allocate some static memory in the kernel collector. There is no inherent limitation to the number of CPUs that would be supported.
I think the current limitation would only manifest in:
the perf ring allocation when loading eBPF
when dequeueing events from the perf rings, iirc there is a fixed-size heap to sort incoming events
Happy to review a patch if you have the bandwidth!
cc @open-telemetry/network-maintainers if you remember anywhere else the CPU core count would manifest
PerfContainer.entries_ is a heap that holds indexes of non-empty readers, so messages can be dequeued in-order. It is statically sized to BPF_MAX_CPUS.
Looking at mentions of BPF_MAX_CPUS it seems like it is used in just a few places:
collector/kernel/perf_reader.h
99: PerfEntry entries_[BPF_MAX_CPUS];
103: std::bitset<BPF_MAX_CPUS> readers_in_entries_;
collector/kernel/bpf_src/render_bpf.h
14:#define BPF_MAX_CPUS 128 // Maximum number of CPUs to support
collector/kernel/bpf_src/tcp-processor/bpf_types.h
108:BPF_ARRAY(bpf_log_globals_per_cpu, struct BPF_LOG_GLOBALS, BPF_MAX_CPUS);
119: if (cpu < 0 || cpu >= BPF_MAX_CPUS) {
collector/kernel/perf_reader.cc
24: if (readers_.size() >= BPF_MAX_CPUS)
25: throw std::runtime_error("Only up to " _STRINGIZE(BPF_MAX_CPUS) " cpus are currently supported");
33: if (data_readers_.size() >= BPF_MAX_CPUS)
34: throw std::runtime_error("Only up to " _STRINGIZE(BPF_MAX_CPUS) " cpus are currently supported");
Bottom line, I think just increasing BPF_MAX_CPUS should get you better coverage. I don't see a large memory requirement or reduction in performance.
What happened?
Description
kernel-collector fails to start on a system with more than 128 logical CPUs
Steps to Reproduce
run on baremetal server with 256 CPUs
Expected Result
running
Actual Result
failed with error
eBPF Collector version
v0.10.2
Environment information
Environment
OS: Debian GNU/Linux 11 (bullseye)
Kernel: 6.0.0-0.deb11.2-amd64 (with installed linux-headers-amd64)
eBPF Collector configuration
default
Log output
Additional context
most likely
BPF_MAX_CPUS
constant should be increased hereopentelemetry-network/collector/kernel/bpf_src/render_bpf.h
Line 14 in 0e33f66
The text was updated successfully, but these errors were encountered: