Skip to content

gratuitous CPU usage by sled-agent #1264

@rzezeski

Description

@rzezeski

While working with a recent version of Omicron (commit 6e97f9e), I noticed sled-agent with a constant CPU usage of 15-17%.

rpz@kalm:~$ prstat -c 1
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  2419 root       52M   26M cpu7     0    0  15:29:54  17% sled-agent/16
  6514 root      905M  456M sleep   59    0   0:28:54 0.1% cockroach/20
  7350 root      853M  280M sleep   59    0   0:33:44 0.1% clickhouse/154
     5 root        0K    0K sleep   98  -20   0:22:44 0.0% zpool-rpool/240
  2391 root        0K    0K sleep   98  -20   0:04:54 0.0% zpool-oxp_d462a/240
  8106 root       42M   29M sleep   59    0   0:01:53 0.0% oximeter/15
  2603 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
   327 root     2204K  804K sleep   59    0   0:00:00 0.0% utmpd/1
    79 root     2588K  864K sleep   59    0   0:00:02 0.0% in.mpathd/1
  2620 root     2816K 1848K sleep   59    0   0:00:00 0.0% init/1
    51 netcfg   2668K 1384K sleep   59    0   0:00:00 0.0% netcfgd/2
    38 root     2988K 2004K sleep   59    0   0:00:00 0.0% dlmgmtd/8
   637 rpz      8156K 6644K sleep   59    0   0:00:00 0.0% bash/1
   260 root     2680K 1580K sleep   59    0   0:00:00 0.0% pfexecd/3
    63 netadm   4424K 2812K sleep   59    0   0:00:00 0.0% ipmgmtd/4
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  2419 root       52M   26M cpu1     0    0  15:29:56  17% sled-agent/16
  6514 root      905M  456M sleep   59    0   0:28:54 0.1% cockroach/20
  7350 root      853M  280M sleep   59    0   0:33:44 0.1% clickhouse/154
     5 root        0K    0K sleep   98  -20   0:22:44 0.1% zpool-rpool/240
  8106 root       42M   29M sleep   59    0   0:01:53 0.0% oximeter/15
  2391 root        0K    0K sleep   98  -20   0:04:54 0.0% zpool-oxp_d462a/240
 14722 rpz      4640K 3736K cpu9    54    0   0:00:00 0.0% prstat/1
  2603 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
   327 root     2204K  804K sleep   59    0   0:00:00 0.0% utmpd/1
    79 root     2588K  864K sleep   59    0   0:00:02 0.0% in.mpathd/1
  2620 root     2816K 1848K sleep   59    0   0:00:00 0.0% init/1
    51 netcfg   2668K 1384K sleep   59    0   0:00:00 0.0% netcfgd/2
    38 root     2988K 2004K sleep   59    0   0:00:00 0.0% dlmgmtd/8
   637 rpz      8156K 6644K sleep   59    0   0:00:00 0.0% bash/1
   260 root     2680K 1580K sleep   59    0   0:00:00 0.0% pfexecd/3
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  2419 root       52M   26M cpu0     0    0  15:29:58  17% sled-agent/16
  6514 root      905M  457M sleep   59    0   0:28:54 0.1% cockroach/20
  7350 root      853M  280M sleep   59    0   0:33:44 0.1% clickhouse/154
     5 root        0K    0K sleep   98  -20   0:22:44 0.1% zpool-rpool/240
  8106 root       42M   29M sleep   59    0   0:01:53 0.0% oximeter/15
  2391 root        0K    0K sleep   98  -20   0:04:54 0.0% zpool-oxp_d462a/240
 14722 rpz      4640K 3736K cpu9    59    0   0:00:00 0.0% prstat/1
   636 rpz        11M 9416K sleep   59    0   0:00:21 0.0% tmux/1
  2603 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
   327 root     2204K  804K sleep   59    0   0:00:00 0.0% utmpd/1
    79 root     2588K  864K sleep   59    0   0:00:02 0.0% in.mpathd/1
  2620 root     2816K 1848K sleep   59    0   0:00:00 0.0% init/1
    51 netcfg   2668K 1384K sleep   59    0   0:00:00 0.0% netcfgd/2
    38 root     2988K 2004K sleep   59    0   0:00:00 0.0% dlmgmtd/8
   637 rpz      8156K 6644K sleep   59    0   0:00:00 0.0% bash/1
Total: 205 processes, 2256 lwps, load averages: 2.04, 2.04, 2.04

Checking the microstates shows 2 threads burning 100% of their time in userland.

rpz@kalm:~$ prstat -mLc -p 2419
Please wait...
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
  2419 root      14 0.0 0.0 0.0 0.0  86 0.1 0.0   0  36   0   0 sled-agent/12
  2419 root      14 0.0 0.0 0.0 0.0  86 0.0 0.0   0  36   0   0 sled-agent/8
  2419 root     0.0 0.0 0.0 0.0 0.0  36  64 0.0   2   0   5   0 sled-agent/13
  2419 root     0.0 0.0 0.0 0.0 0.0  65  35 0.0   1   0   3   0 sled-agent/2
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/16
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   1   0 sled-agent/14
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/11
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/10
  2419 root     0.0 0.0 0.0 0.0 0.0  99 0.7 0.0   0   0   0   0 sled-agent/9
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/7
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/6
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/5
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/4
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/3
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/1
Total: 1 processes, 16 lwps, load averages: 2.04, 2.09, 2.23
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP   
  2419 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0 260   0   0 sled-agent/12
  2419 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0 273   0   0 sled-agent/8
  2419 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   3   2   4   0 sled-agent/2
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   1   0   1   0 sled-agent/16
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/14
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/13
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/11
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/10
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/9
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/7
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/6
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/5
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/4
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/3
  2419 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 sled-agent/1
Total: 1 processes, 16 lwps, load averages: 2.04, 2.09, 2.23

This seemed odd to me as there was nothing I did that I thought warranted such a load (and sure enough this CPU usage has remained steady for hours, so I don't believe it's some transient task). Next I figured I'd peek at the user-level stacks to see what sled-agent is up to.

pz@kalm:~/oxidecomputer/cli$ pfexec pstack 2419 | rustfilt
2419:   /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.
--------------------- thread# 1 / lwp# 1 ---------------------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9728, 21e9710, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9728, 21e9710) + 8a
 fffffc7fef12ad04 cond_wait (21e9728, 21e9710) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9728, 21e9710) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001ca3956 tokio::park::thread::Inner::park () + 66
 0000000001ca3ea8 <tokio::park::thread::CachedParkThread as tokio::park::Park>::park () + 28
 0000000000eefe19 tokio::park::thread::CachedParkThread::block_on () + d9
 0000000000f2b7e3 tokio::runtime::Runtime::block_on () + 63
 0000000000e65ada sled_agent::main () + da
 0000000000f346f6 std::sys_common::backtrace::__rust_begin_short_backtrace () + 6
 0000000000eb63cc std::rt::lang_start::{{closure}} () + c
 0000000001dcd3b7 std::rt::lang_start_internal () + 1d7
 0000000000e65dc9 main () + 29
 0000000000da67f7 _start_crt () + 87
 0000000000da6758 _start () + 18
--------- thread# 2 / lwp# 2 [tokio-runtime-worker] ----------
 fffffc7fef137aba ioctl    (3, d001, fffffc7fedbff970)
 0000000001cce242 mio::poll::Poll::poll () + 62
 0000000001cac589 tokio::io::driver::Driver::turn () + 79
 0000000001c8bcb8 <tokio::park::either::Either<A,B> as tokio::park::Park>::park_timeout () + 48
 0000000001c6cb86 tokio::time::driver::Driver<P>::park_internal () + 186
 0000000001c8baec <tokio::park::either::Either<A,B> as tokio::park::Park>::park () + 4c
 0000000001c6ab8f <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 8f
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef50240) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 3 / lwp# 3 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e95d8, 21e95c0, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e95d8, 21e95c0) + 8a
 fffffc7fef12ad04 cond_wait (21e95d8, 21e95c0) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e95d8, 21e95c0) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef50a40) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 4 / lwp# 4 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9568, 21e9550, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9568, 21e9550) + 8a
 fffffc7fef12ad04 cond_wait (21e9568, 21e9550) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9568, 21e9550) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef51240) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 5 / lwp# 5 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e94f8, 21e94e0, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e94f8, 21e94e0) + 8a
 fffffc7fef12ad04 cond_wait (21e94f8, 21e94e0) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e94f8, 21e94e0) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef51a40) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 6 / lwp# 6 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9488, 21e9470, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9488, 21e9470) + 8a
 fffffc7fef12ad04 cond_wait (21e9488, 21e9470) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9488, 21e9470) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef52240) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 7 / lwp# 7 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9418, 21e9400, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9418, 21e9400) + 8a
 fffffc7fef12ad04 cond_wait (21e9418, 21e9400) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9418, 21e9400) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef52a40) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 8 / lwp# 8 [tokio-runtime-worker] ----------
 0000000001c64c67 futures_core::task::__internal::atomic_waker::AtomicWaker::register () + 57
 000000000114c274 futures_util::stream::stream::StreamExt::poll_next_unpin () + 174
 000000000100c4f6 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll () + 4b6
 0000000001185c80 tokio::runtime::task::harness::Harness<T,S>::poll () + 70
 0000000001cb4baa std::thread::local::LocalKey<T>::with () + 5a
 0000000001caa46a tokio::runtime::thread_pool::worker::Context::run_task () + 25a
 0000000001ca95ce tokio::runtime::thread_pool::worker::Context::run () + 4be
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef53240) + 6c
 fffffc7fef131160 _lwp_start ()
--------- thread# 9 / lwp# 9 [tokio-runtime-worker] ----------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e92c8, 21e92b0, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e92c8, 21e92b0) + 8a
 fffffc7fef12ad04 cond_wait (21e92c8, 21e92b0) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e92c8, 21e92b0) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef53a40) + 6c
 fffffc7fef131160 _lwp_start ()
-------- thread# 10 / lwp# 10 [tokio-runtime-worker] ---------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9338, 21e9320, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9338, 21e9320) + 8a
 fffffc7fef12ad04 cond_wait (21e9338, 21e9320) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9338, 21e9320) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef54240) + 6c
 fffffc7fef131160 _lwp_start ()
-------- thread# 11 / lwp# 11 [tokio-runtime-worker] ---------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9258, 21e9240, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9258, 21e9240) + 8a
 fffffc7fef12ad04 cond_wait (21e9258, 21e9240) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9258, 21e9240) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef54a40) + 6c
 fffffc7fef131160 _lwp_start ()
-------- thread# 12 / lwp# 12 [tokio-runtime-worker] ---------
 fffffc7fef091ae9 memcpy () + 1379
 000000000114c314 futures_util::stream::stream::StreamExt::poll_next_unpin () + 214
 000000000100c4f6 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll () + 4b6
 0000000001185c80 tokio::runtime::task::harness::Harness<T,S>::poll () + 70
 0000000001cb4baa std::thread::local::LocalKey<T>::with () + 5a
 0000000001caa46a tokio::runtime::thread_pool::worker::Context::run_task () + 25a
 0000000001ca95ce tokio::runtime::thread_pool::worker::Context::run () + 4be
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef55240) + 6c
 fffffc7fef131160 _lwp_start ()
-------- thread# 13 / lwp# 13 [tokio-runtime-worker] ---------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e9178, 21e9160, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e9178, 21e9160) + 8a
 fffffc7fef12ad04 cond_wait (21e9178, 21e9160) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e9178, 21e9160) + b
 0000000001cd592b parking_lot::condvar::Condvar::wait_until_internal () + 55b
 0000000001c6ac46 <tokio::runtime::thread_pool::park::Parker as tokio::park::Park>::park () + 146
 0000000001caa690 tokio::runtime::thread_pool::worker::Context::park_timeout () + 170
 0000000001ca9e32 tokio::runtime::thread_pool::worker::Context::run () + d22
 0000000001ca8f9e tokio::runtime::thread_pool::worker::run () + 6e
 0000000001cb3a18 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () + 48
 0000000001c6db20 tokio::runtime::task::harness::Harness<T,S>::poll () + a0
 0000000001c91d60 tokio::runtime::blocking::pool::Inner::run () + c0
 0000000001c7372b std::sys_common::backtrace::__rust_begin_short_backtrace () + 7b
 0000000001c78df9 core::ops::function::FnOnce::call_once{{vtable.shim}} () + 79
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef55a40) + 6c
 fffffc7fef131160 _lwp_start ()
-------------------- thread# 14 / lwp# 14 --------------------
 fffffc7fef1311a7 lwp_park (0, 0, 0)
 fffffc7fef12a65b cond_wait_queue (21e67f0, 21e5520, 0) + 5b
 fffffc7fef12ac9a __cond_wait (21e67f0, 21e5520) + 8a
 fffffc7fef12ad04 cond_wait (21e67f0, 21e5520) + 24
 fffffc7fef12ad3b pthread_cond_wait (21e67f0, 21e5520) + b
 0000000001dce0a1 std::thread::park () + b1
 0000000000f67b95 crossbeam_channel::context::Context::with::{{closure}} () + 395
 0000000000f68248 crossbeam_channel::flavors::array::Channel<T>::recv () + 308
 0000000000ed935c crossbeam_channel::channel::Receiver<T>::recv () + 4c
 0000000000f345de std::sys_common::backtrace::__rust_begin_short_backtrace () + 21e
 0000000000f7c6ed core::ops::function::FnOnce::call_once{{vtable.shim}} () + 8d
 0000000001df4de7 std::sys::unix::thread::Thread::new::thread_start () + 27
 fffffc7fef130e4c _thrp_setup (fffffc7feef56240) + 6c
 fffffc7fef131160 _lwp_start ()
-------------------- thread# 16 / lwp# 16 --------------------
 fffffc7fef1311a7 lwp_park (0, fffffc7feedaee60, 0)
 fffffc7fef12a65b cond_wait_queue (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaee60) + 5b
 fffffc7fef12aaad cond_wait_common (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaee60) + 1dd
 fffffc7fef12adb5 __cond_timedwait (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaef50) + 65
 fffffc7fef12ae71 cond_timedwait (fffffc7fef2a6820, fffffc7fef2a6840, fffffc7feedaef50) + 31
 fffffc7fef259dd6 umem_update_thread (0) + 1c6
 fffffc7fef130e4c _thrp_setup (fffffc7feef57240) + 6c
 fffffc7fef131160 _lwp_start ()
-------------------- thread# 23 / lwp# 23 --------------------
 0000000001df4dc0 std::sys::unix::thread::Thread::new::thread_start(), exit value = 0x0000000000000000
        ** zombie (exited, not detached, not yet joined) **

Most threads are parked (so doing nothing), but I did notice one thread was in memcpy, which got me curious: is there a lot of memcpy going on? So I used DTrace to determine 1) how many times is memcpy called in 1 second, 2) what is the distribution of the memcpy sizes, and 3) what are the unique userland stacks and their frequency?

rpz@kalm:~$ pfexec dtrace -p 2419 -qn 'pid$target::memcpy:entry { @[probefunc] = quantize(arg2); @s[ustack()] = count(); } tick-1s { exit(0); }' | rustfilt 
                                                                                 
  memcpy                                             
           value  ------------- Distribution ------------- count    
              64 |                                         0        
             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1024615  
             256 |                                         0                                                                                                       
             512 |                                         1                                                                                                       
            1024 |                                         0                                                                                                       
                                                                                 
                                                                                 
              libc.so.1`memcpy                                                                                                                                     
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x92c
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x64
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a                                                                                                
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e                                                                                             
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start                                                                                                                                 
                1                                                                                                                                                  
                                                                                                                                                                   
              libc.so.1`memcpy                                                   
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x93c
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x64                                                                                    
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a             
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be                                                                                   
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e         
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0                                                                                    
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0                                                                                           
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start
                1             

              libc.so.1`memcpy                                                                                                                            [13/1593]
              sled-agent`tokio_tungstenite::compat::cvt+0x168      
              sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x223
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a                                                                                                
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e                                                                                             
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start
           256153

              libc.so.1`memcpy
              sled-agent`tungstenite::protocol::WebSocket<Stream>::read_message+0x159e
              sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x214
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start
           256153

              libc.so.1`memcpy
              sled-agent`tungstenite::protocol::WebSocket<Stream>::read_message+0x1589
              sled-agent`futures_util::stream::stream::StreamExt::poll_next_unpin+0x214
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4b6
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start
           256154

              libc.so.1`memcpy
              sled-agent`<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll+0x4e0
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0x70
              sled-agent`std::thread::local::LocalKey<T>::with+0x5a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run_task+0x25a
              sled-agent`tokio::runtime::thread_pool::worker::Context::run+0x4be
              sled-agent`tokio::runtime::thread_pool::worker::run+0x6e
              sled-agent`<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x48
              sled-agent`tokio::runtime::task::harness::Harness<T,S>::poll+0xa0
              sled-agent`tokio::runtime::blocking::pool::Inner::run+0xc0
              sled-agent`std::sys_common::backtrace::__rust_begin_short_backtrace+0x7b
              sled-agent`core::ops::function::FnOnce::call_once{{vtable.shim}}+0x79
              sled-agent`std::sys::unix::thread::Thread::new::thread_start+0x27
              libc.so.1`_thrp_setup+0x6c 
              libc.so.1`_lwp_start
           256154

As this output shows, memcpy is called over 1 million times a second to poll futures. On @davepacheco's recommendation, I also profiled the user stacks on CPU and generated a flamegraph, attached to this issue.

The flamegraph was generated from the following DTrace one-liner.

rpz@kalm:~$ pfexec dtrace -x ustackframes=100 -n 'profile-97 /pid == 2419/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o sled-agent-stacks.out

sled-agent-stacks-folded-rust

It shows memcpy and AtomicWaker::register() dominating the samples. As Dave pointed out in chat, the total number of samples lines up with 2 cores being burned for an entire second, given my 12-core machine.

dap> trying to calculate the % utilization reflected in the flame graph
97 Hz for 60s would be 5820 samples per core, if we were pegged on CPU
You've got 11627 samples here
so almost 2 cores' worth
or 16% -- or about what you said above!
although from the prstat output above it looks like it's all in two threads

Asking around in chat no one else piped up about a sled-agent eating CPU like mine. The only additional data point I can offer is that at one point I did cause an error in sled-agent while trying to create a new instance. I updated to the latest version of xde, but did not update the sled-agent code to be able to talk to it. This led to sled-agent sending a serialized buffer that was too small for the CreateXde request (this is described in oxidecomputer/opte#161). This led to the following sled-agent error:

{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2022-06-23T19:32:58.149139933Z","hostname":"kalm","pid":2419,"uri":"/instances/d9356749-e1af-4856-94e4-fd1b827fe5dc","method":"PUT","req_id":"f56b57b4-db98-4975-8585-fac2dc0374c5","remote_addr":"[fd00:1122:3344:101::3]:56544","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing instances: Instance error: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: DeserCmdReq(\"Hit the end of buffer, expected more data\")","response_code":"500"}

I cannot say if sled-agent starting burning CPU because of this event, or if it was doing so the entire time, as I didn't check before and after.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions