-
Notifications
You must be signed in to change notification settings - Fork 63
Description
While working with a recent version of Omicron (commit 6e97f9e), I noticed sled-agent with a constant CPU usage of 15-17%.
rpz@kalm:~$ prstat -c 1
Please wait...
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
2419 root 52M 26M cpu7 0 0 15:29:54 17% sled-agent/16
6514 root 905M 456M sleep 59 0 0:28:54 0.1% cockroach/20
7350 root 853M 280M sleep 59 0 0:33:44 0.1% clickhouse/154
5 root 0K 0K sleep 98 -20 0:22:44 0.0% zpool-rpool/240
2391 root 0K 0K sleep 98 -20 0:04:54 0.0% zpool-oxp_d462a/240
8106 root 42M 29M sleep 59 0 0:01:53 0.0% oximeter/15
2603 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
327 root 2204K 804K sleep 59 0 0:00:00 0.0% utmpd/1
79 root 2588K 864K sleep 59 0 0:00:02 0.0% in.mpathd/1
2620 root 2816K 1848K sleep 59 0 0:00:00 0.0% init/1
51 netcfg 2668K 1384K sleep 59 0 0:00:00 0.0% netcfgd/2
38 root 2988K 2004K sleep 59 0 0:00:00 0.0% dlmgmtd/8
637 rpz 8156K 6644K sleep 59 0 0:00:00 0.0% bash/1
260 root 2680K 1580K sleep 59 0 0:00:00 0.0% pfexecd/3
63 netadm 4424K 2812K sleep 59 0 0:00:00 0.0% ipmgmtd/4
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
2419 root 52M 26M cpu1 0 0 15:29:56 17% sled-agent/16
6514 root 905M 456M sleep 59 0 0:28:54 0.1% cockroach/20
7350 root 853M 280M sleep 59 0 0:33:44 0.1% clickhouse/154
5 root 0K 0K sleep 98 -20 0:22:44 0.1% zpool-rpool/240
8106 root 42M 29M sleep 59 0 0:01:53 0.0% oximeter/15
2391 root 0K 0K sleep 98 -20 0:04:54 0.0% zpool-oxp_d462a/240
14722 rpz 4640K 3736K cpu9 54 0 0:00:00 0.0% prstat/1
2603 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
327 root 2204K 804K sleep 59 0 0:00:00 0.0% utmpd/1
79 root 2588K 864K sleep 59 0 0:00:02 0.0% in.mpathd/1
2620 root 2816K 1848K sleep 59 0 0:00:00 0.0% init/1
51 netcfg 2668K 1384K sleep 59 0 0:00:00 0.0% netcfgd/2
38 root 2988K 2004K sleep 59 0 0:00:00 0.0% dlmgmtd/8
637 rpz 8156K 6644K sleep 59 0 0:00:00 0.0% bash/1
260 root 2680K 1580K sleep 59 0 0:00:00 0.0% pfexecd/3
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
2419 root 52M 26M cpu0 0 0 15:29:58 17% sled-agent/16
6514 root 905M 457M sleep 59 0 0:28:54 0.1% cockroach/20
7350 root 853M 280M sleep 59 0 0:33:44 0.1% clickhouse/154
5 root 0K 0K sleep 98 -20 0:22:44 0.1% zpool-rpool/240
8106 root 42M 29M sleep 59 0 0:01:53 0.0% oximeter/15
2391 root 0K 0K sleep 98 -20 0:04:54 0.0% zpool-oxp_d462a/240
14722 rpz 4640K 3736K cpu9 59 0 0:00:00 0.0% prstat/1
636 rpz 11M 9416K sleep 59 0 0:00:21 0.0% tmux/1
2603 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
327 root 2204K 804K sleep 59 0 0:00:00 0.0% utmpd/1
79 root 2588K 864K sleep 59 0 0:00:02 0.0% in.mpathd/1
2620 root 2816K 1848K sleep 59 0 0:00:00 0.0% init/1
51 netcfg 2668K 1384K sleep 59 0 0:00:00 0.0% netcfgd/2
38 root 2988K 2004K sleep 59 0 0:00:00 0.0% dlmgmtd/8
637 rpz 8156K 6644K sleep 59 0 0:00:00 0.0% bash/1
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04
Checking the microstates shows 2 threads burning 100% of their time in userland.
rpz@kalm:~$ prstat -mLc -p 2419
Please wait...
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP
2419 root 14 0.0 0.0 0.0 0.0 86 0.1 0.0 0 36 0 0 sled-agent/12
2419 root 14 0.0 0.0 0.0 0.0 86 0.0 0.0 0 36 0 0 sled-agent/8
2419 root 0.0 0.0 0.0 0.0 0.0 36 64 0.0 2 0 5 0 sled-agent/13
2419 root 0.0 0.0 0.0 0.0 0.0 65 35 0.0 1 0 3 0 sled-agent/2
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/16
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 1 0 sled-agent/14
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/11
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/10
2419 root 0.0 0.0 0.0 0.0 0.0 99 0.7 0.0 0 0 0 0 sled-agent/9
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/7
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/6
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/5
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/4
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/3
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/1
Total: 1 processes, 16 lwps, load averages: 2.04, 2.09, 2.23
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP
2419 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 260 0 0 sled-agent/12
2419 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 273 0 0 sled-agent/8
2419 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 2 4 0 sled-agent/2
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 sled-agent/16
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/14
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/13
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/11
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/10
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/9
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/7
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/6
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/5
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/4
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/3
2419 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 sled-agent/1
Total: 1 processes, 16 lwps, load averages: 2.04, 2.09, 2.23
This seemed odd to me as there was nothing I did that I thought warranted such a load (and sure enough this CPU usage has remained steady for hours, so I don't believe it's some transient task). Next I figured I'd peek at the user-level stacks to see what sled-agent is up to.
pz@kalm:~/oxidecomputer/cli$ pfexec pstack 2419 | rustfilt
2419: /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
--------------------- thread# 1 / lwp# 1 ---------------------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9728, 21e9710, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9728, 21e9710) + 8a
fffffc7fef12ad04 cond_wait (21e9728, 21e9710) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9728, 21e9710) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001ca3956 tokio::park::thread::Inner::park () + 66
0000000001ca3ea8 <tokio::park::thread::CachedParkThread as tokio::park::Park>::park () + 28
0000000000eefe19 tokio::park::thread::CachedParkThread::block_on () + d9
0000000000f2b7e3 tokio::runtime::Runtime::block_on () + 63
0000000000e65ada sled_agent::main () + da
0000000000f346f6 std::sys_common::backtrace::__rust_begin_short_backtrace () + 6
0000000000eb63cc std::rt::lang_start::{{closure}} () + c
0000000001dcd3b7 std::rt::lang_start_internal () + 1d7
0000000000e65dc9 main () + 29
0000000000da67f7 _start_crt () + 87
0000000000da6758 _start () + 18
--------- thread# 2 / lwp# 2 [tokio-runtime-worker] ----------
fffffc7fef137aba ioctl (3, d001, fffffc7fedbff970)
0000000001cce242 mio::poll::Poll::poll () + 62
0000000001cac589 tokio::io::driver::Driver::turn () + 79
0000000001c8bcb8 <tokio::park::either::Either<A,B> as tokio::park::Park>::park_timeout () + 48
0000000001c6cb86 tokio::time::driver::Driver<P>::park_internal () + 186
0000000001c8baec <tokio::park::either::Either<A,B> as tokio::park::Park>::park () + 4c
0000000001c6ab8f <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 8f
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef50240) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 3 / lwp# 3 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e95d8, 21e95c0, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e95d8, 21e95c0) + 8a
fffffc7fef12ad04 cond_wait (21e95d8, 21e95c0) + 24
fffffc7fef12ad3b pthread_cond_wait (21e95d8, 21e95c0) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef50a40) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 4 / lwp# 4 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9568, 21e9550, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9568, 21e9550) + 8a
fffffc7fef12ad04 cond_wait (21e9568, 21e9550) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9568, 21e9550) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef51240) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 5 / lwp# 5 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e94f8, 21e94e0, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e94f8, 21e94e0) + 8a
fffffc7fef12ad04 cond_wait (21e94f8, 21e94e0) + 24
fffffc7fef12ad3b pthread_cond_wait (21e94f8, 21e94e0) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef51a40) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 6 / lwp# 6 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9488, 21e9470, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9488, 21e9470) + 8a
fffffc7fef12ad04 cond_wait (21e9488, 21e9470) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9488, 21e9470) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef52240) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 7 / lwp# 7 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9418, 21e9400, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9418, 21e9400) + 8a
fffffc7fef12ad04 cond_wait (21e9418, 21e9400) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9418, 21e9400) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef52a40) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 8 / lwp# 8 [tokio-runtime-worker] ----------
0000000001c64c67 futures_core::task::__internal::atomic_waker::AtomicWaker::register () + 57
000000000114c274 futures_util::stream::stream::StreamExt::poll_next_unpin () + 174
000000000100c4f6 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll () + 4b6
0000000001185c80 tokio::runtime::task::harness::Harness<T,S>::poll () + 70
0000000001cb4baa std::thread::local::LocalKey<T>::with () + 5a
0000000001caa46a tokio::runtime::thread_pool::worker::Context::run_task () + 25a
0000000001ca95ce tokio::runtime::thread_pool::worker::Context::run () + 4be
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef53240) + 6c
fffffc7fef131160 _lwp_start ()
--------- thread# 9 / lwp# 9 [tokio-runtime-worker] ----------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e92c8, 21e92b0, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e92c8, 21e92b0) + 8a
fffffc7fef12ad04 cond_wait (21e92c8, 21e92b0) + 24
fffffc7fef12ad3b pthread_cond_wait (21e92c8, 21e92b0) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef53a40) + 6c
fffffc7fef131160 _lwp_start ()
-------- thread# 10 / lwp# 10 [tokio-runtime-worker] ---------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9338, 21e9320, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9338, 21e9320) + 8a
fffffc7fef12ad04 cond_wait (21e9338, 21e9320) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9338, 21e9320) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef54240) + 6c
fffffc7fef131160 _lwp_start ()
-------- thread# 11 / lwp# 11 [tokio-runtime-worker] ---------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9258, 21e9240, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9258, 21e9240) + 8a
fffffc7fef12ad04 cond_wait (21e9258, 21e9240) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9258, 21e9240) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef54a40) + 6c
fffffc7fef131160 _lwp_start ()
-------- thread# 12 / lwp# 12 [tokio-runtime-worker] ---------
fffffc7fef091ae9 memcpy () + 1379
000000000114c314 futures_util::stream::stream::StreamExt::poll_next_unpin () + 214
000000000100c4f6 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll () + 4b6
0000000001185c80 tokio::runtime::task::harness::Harness<T,S>::poll () + 70
0000000001cb4baa std::thread::local::LocalKey<T>::with () + 5a
0000000001caa46a tokio::runtime::thread_pool::worker::Context::run_task () + 25a
0000000001ca95ce tokio::runtime::thread_pool::worker::Context::run () + 4be
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef55240) + 6c
fffffc7fef131160 _lwp_start ()
-------- thread# 13 / lwp# 13 [tokio-runtime-worker] ---------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e9178, 21e9160, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e9178, 21e9160) + 8a
fffffc7fef12ad04 cond_wait (21e9178, 21e9160) + 24
fffffc7fef12ad3b pthread_cond_wait (21e9178, 21e9160) + b
0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef55a40) + 6c
fffffc7fef131160 _lwp_start ()
-------------------- thread# 14 / lwp# 14 --------------------
fffffc7fef1311a7 lwp_park (0, 0, 0)
fffffc7fef12a65b cond_wait_queue (21e67f0, 21e5520, 0) + 5b
fffffc7fef12ac9a __cond_wait (21e67f0, 21e5520) + 8a
fffffc7fef12ad04 cond_wait (21e67f0, 21e5520) + 24
fffffc7fef12ad3b pthread_cond_wait (21e67f0, 21e5520) + b
0000000001dce0a1 std::thread::park () + b1
0000000000f67b95 crossbeam_channel::context::Context::with::{{closure}} () + 395
0000000000f68248 crossbeam_channel::flavors::array::Channel<T>::recv () + 308
0000000000ed935c crossbeam_channel::channel::Receiver<T>::recv () + 4c
0000000000f345de std::sys_common::backtrace::__rust_begin_short_backtrace () + 21e
0000000000f7c6ed core::ops::function::FnOnce::call_once{{vtable.shim}} () + 8d
0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
fffffc7fef130e4c _thrp_setup (fffffc7feef56240) + 6c
fffffc7fef131160 _lwp_start ()
-------------------- thread# 16 / lwp# 16 --------------------
fffffc7fef1311a7 lwp_park (0, fffffc7feedaee60, 0)
fffffc7fef12a65b cond_wait_queue (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaee60) + 5b
fffffc7fef12aaad cond_wait_common (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaee60) + 1dd
fffffc7fef12adb5 __cond_timedwait (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaef50) + 65
fffffc7fef12ae71 cond_timedwait (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaef50) + 31
fffffc7fef259dd6 umem_update_thread (0) + 1c6
fffffc7fef130e4c _thrp_setup (fffffc7feef57240) + 6c
fffffc7fef131160 _lwp_start ()
-------------------- thread# 23 / lwp# 23 --------------------
0000000001df4dc0 std::sys::unix::thread::Thread::new::thread_start(), exit value = 0x0000000000000000
** zombie (exited, not detached, not yet joined) **
Most threads are parked (so doing nothing), but I did notice one thread was in memcpy, which got me curious: is there a lot of memcpy going on? So I used DTrace to determine 1) how many times is memcpy called in 1 second, 2) what is the distribution of the memcpy sizes, and 3) what are the unique userland stacks and their frequency?
rpz@kalm:~$ pfexec dtrace -p 2419 -qn 'pid$target::memcpy:entry { @[probefunc] = quantize(arg2); @s[ustack()] = count(); } tick-1s { exit(0); }' | rustfilt
memcpy
value ------------- Distribution ------------- count
64 | 0
128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1024615
256 | 0
512 | 1
1024 | 0
libc.so.1`memcpy
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x92c
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x64
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
1
libc.so.1`memcpy
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x93c
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x64
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
1
libc.so.1`memcpy [13/1593]
sled-agent`tokio_tungstenite::compat::cvt+0x168
sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x223
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
256153
libc.so.1`memcpy
sled-agent`tungstenite::protocol::WebSocket<Stream>::read_message+0x159e
sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x214
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
256153
libc.so.1`memcpy
sled-agent`tungstenite::protocol::WebSocket<Stream>::read_message+0x1589
sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x214
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
256154
libc.so.1`memcpy
sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4e0
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
sled-agent`std::thread::local::LocalKey<T>::with+0x5a
sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
libc.so.1`_thrp_setup+0x6c
libc.so.1`_lwp_start
256154
As this output shows, memcpy is called over 1 million times a second to poll futures. On @davepacheco's recommendation, I also profiled the user stacks on CPU and generated a flamegraph, attached to this issue.
The flamegraph was generated from the following DTrace one-liner.
rpz@kalm:~$ pfexec dtrace -x ustackframes=100 -n 'profile-97 /pid == 2419/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o sled-agent-stacks.out
It shows memcpy and AtomicWaker::register() dominating the samples. As Dave pointed out in chat, the total number of samples lines up with 2 cores being burned for an entire second, given my 12-core machine.
dap> trying to calculate the % utilization reflected in the flame graph
97 Hz for 60s would be 5820 samples per core, if we were pegged on CPU
You've got 11627 samples here
so almost 2 cores' worth
or 16% -- or about what you said above!
although from the prstat output above it looks like it's all in two threads
Asking around in chat no one else piped up about a sled-agent eating CPU like mine. The only additional data point I can offer is that at one point I did cause an error in sled-agent while trying to create a new instance. I updated to the latest version of xde, but did not update the sled-agent code to be able to talk to it. This led to sled-agent sending a serialized buffer that was too small for the CreateXde request (this is described in oxidecomputer/opte#161). This led to the following sled-agent error:
{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2022-06-23T19:32:58.149139933Z","hostname":"kalm","pid":2419,"uri":"/instances/d9356749-e1af-4856-94e4-fd1b827fe5dc","method":"PUT","req_id":"f56b57b4-db98-4975-8585-fac2dc0374c5","remote_addr":"[fd00:1122:3344:101::3]:56544","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing instances: Instance error: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: DeserCmdReq(\"Hit the end of buffer, expected more data\")","response_code":"500"}
I cannot say if sled-agent starting burning CPU because of this event, or if it was doing so the entire time, as I didn't check before and after.