kernel-collector doesn't start on systems with more than 128 logical CPUs #258

level-a · 2024-03-21T13:08:33Z

What happened?

Description

kernel-collector fails to start on a system with more than 128 logical CPUs

Steps to Reproduce

run on baremetal server with 256 CPUs

Expected Result

running

Actual Result

failed with error

eBPF Collector version

v0.10.2

Environment information

Environment

OS: Debian GNU/Linux 11 (bullseye)
Kernel: 6.0.0-0.deb11.2-amd64 (with installed linux-headers-amd64)

eBPF Collector configuration

default

Log output

2024-03-21 12:23:35.575638+00:00 info [p:2611192 t:2611192] eBPF program successfully compiled
2024-03-21 12:23:36.010724+00:00 error [p:2611192 t:2611192] Exception during BPFHandler initialization, closing connection: Only up to 128 cpus are currently supported

Failed to compile eBPF code for the Linux distro 'unknown' running kernel version 6.0.0-0.deb11.2-amd64.

troubleshoot item bpf_compilation_failed (os=Linux,flavor=unknown,headers_src=unknown,kernel=6.0.0-0.deb11.2-amd64): Only up to 128 cpus are currently supported

This usually means that kernel headers weren't installed correctly.

Please reach out to support and include this log in its entirety so we can diagnose and fix
the problem.
2024-03-21 12:23:36.010872+00:00 error [p:2611192 t:2611192] troubleshoot item bpf_compilation_failed (os=Linux,flavor=unknown,headers_src=unknown,kernel=6.0.0-0.deb11.2-amd64): Only up to 128 cpus are currently supported

Additional context

most likely BPF_MAX_CPUS constant should be increased here

opentelemetry-network/collector/kernel/bpf_src/render_bpf.h

Line 14 in 0e33f66

#define BPF_MAX_CPUS 128 // Maximum number of CPUs to support

The text was updated successfully, but these errors were encountered:

yonch · 2024-04-02T16:08:15Z

Yes there should be an artificial upper limit to the number of CPUs, that is used to allocate some static memory in the kernel collector. There is no inherent limitation to the number of CPUs that would be supported.

I think the current limitation would only manifest in:

the perf ring allocation when loading eBPF
when dequeueing events from the perf rings, iirc there is a fixed-size heap to sort incoming events

Happy to review a patch if you have the bandwidth!

cc @open-telemetry/network-maintainers if you remember anywhere else the CPU core count would manifest

yonch · 2024-04-09T16:38:58Z

After some further investigation:

PerfContainer holds the perf rings in readers_.
PerfContainer.entries_ is a heap that holds indexes of non-empty readers, so messages can be dequeued in-order. It is statically sized to BPF_MAX_CPUS.
PerfReader::update_when_not_in_entries() performs heap maintenance.

Looking at mentions of BPF_MAX_CPUS it seems like it is used in just a few places:

collector/kernel/perf_reader.h
99:  PerfEntry entries_[BPF_MAX_CPUS];
103:  std::bitset<BPF_MAX_CPUS> readers_in_entries_;

collector/kernel/bpf_src/render_bpf.h
14:#define BPF_MAX_CPUS 128              // Maximum number of CPUs to support

collector/kernel/bpf_src/tcp-processor/bpf_types.h
108:BPF_ARRAY(bpf_log_globals_per_cpu, struct BPF_LOG_GLOBALS, BPF_MAX_CPUS);
119:  if (cpu < 0 || cpu >= BPF_MAX_CPUS) {

collector/kernel/perf_reader.cc
24:  if (readers_.size() >= BPF_MAX_CPUS)
25:    throw std::runtime_error("Only up to " _STRINGIZE(BPF_MAX_CPUS) " cpus are currently supported");
33:  if (data_readers_.size() >= BPF_MAX_CPUS)
34:    throw std::runtime_error("Only up to " _STRINGIZE(BPF_MAX_CPUS) " cpus are currently supported");

Bottom line, I think just increasing BPF_MAX_CPUS should get you better coverage. I don't see a large memory requirement or reduction in performance.

level-a added the bug Something isn't working label Mar 21, 2024

level-a changed the title ~~kernel-collector doesn't work on systems with more than 128 logical CPUs~~ kernel-collector doesn't start on systems with more than 128 logical CPUs Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-collector doesn't start on systems with more than 128 logical CPUs #258

kernel-collector doesn't start on systems with more than 128 logical CPUs #258

level-a commented Mar 21, 2024

yonch commented Apr 2, 2024

yonch commented Apr 9, 2024

kernel-collector doesn't start on systems with more than 128 logical CPUs #258

kernel-collector doesn't start on systems with more than 128 logical CPUs #258

Comments

level-a commented Mar 21, 2024

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

eBPF Collector version

Environment information

Environment

eBPF Collector configuration

Log output

Additional context

yonch commented Apr 2, 2024

yonch commented Apr 9, 2024