Skip to content

100% CPU spin on CancelAllRequests #1437

@raeanne-marks

Description

@raeanne-marks

Describe your environment. macOS

Steps to reproduce.
Customer hit it randomly, I have their spindump.

What is the expected behavior?
Not 100% CPU :)

What is the actual behavior?
100% CPU usage due to spin yield-loop in CancelAllRequests

Additional context.

Two problems:

  1. No lock on m_httpCallbacks — it reads m_httpCallbacks.empty() without holding m_httpCallbacksMtx, but every other access to m_httpCallbacks (in handleSendRequest and onHttpResponse) is guarded by that mutex. This is a data race (undefined behavior). On arm64 especially, the thread may never see the list become empty due to store/load ordering.

  2. The yield-loop can spin forever — CancelAllRequests() tells the HTTP client to cancel, but the callbacks are removed from m_httpCallbacks in onHttpResponse(), which is scheduled asynchronously via scheduleOnHttpResponse() → scheduleTask(). If the task dispatcher is blocked or shut down (e.g., PauseTransmission has stopped dispatching), the response handler never runs, m_httpCallbacks never empties, and this loop spins indefinitely — which is exactly what the spindump shows.

The call chain in the spindump confirms this: PauseTransmission() calls into cancelAllRequests(), but pausing transmission likely also stops or stalls the task dispatcher that would normally run onHttpResponse to drain the callback list. The yield-loop has no timeout or fallback, so it burns CPU forever.

A fix would need to either:
Use a condition variable signaled from onHttpResponse instead of polling
Add a timeout to the loop and forcibly clear callbacks after it expires
Ensure onHttpResponse can still execute even when transmission is paused (or drain synchronously)

Spindump:

CPU Time:         9.756s (42.6G cycles, 143.3G instructions, 0.30c/i, 4.37GHz avg)
Num threads:      7
Note:             2 idle work queue threads omitted

  Thread 0x3b1aee    1001 samples (1-1001)    priority 0-20 (base 20)    cpu time 9.755s (42.6G cycles, 143.3G instructions, 0.30c/i, 4.37GHz avg)
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]
<redacted>

        1001  sense::Telemetry::OneDsTelemetryClient::PauseTransmission() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/libtelemetry/1ds/OneDsTelemetryClient.cpp:348) [0x104c99f64]
          1001  Microsoft::Applications::Events::LogManagerImpl::PauseTransmission() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/LogManagerImpl.cpp:461) [0x104be8344]
            1001  std::__1::__function::__func<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2, std::__1::allocator<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2>, bool ()>::operator()() (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__functional/function.h:313) [0x104bda300]
              996   Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/http/HttpClientManager.cpp:153) [0x104c01174]
                801   swtch_pri + 8 (libsystem_kernel.dylib + 3208) [0x185a87c88] (running)
                193   swtch_pri + 8 (libsystem_kernel.dylib + 3208) [0x185a87c88]
                 *94    ??? (kernel.release.t6050 + 24180) [0xfffffe000b3dde74] (running)
                 *62    ??? (kernel.release.t6050 + 24180) [0xfffffe000b3dde74]
                   *59    ??? (kernel.release.t6050 + 2149104) [0xfffffe000b5e4af0] (running)
                   *3     ??? (kernel.release.t6050 + 2150656) [0xfffffe000b5e5100]
                     *2     ??? (kernel.release.t6050 + 2122228) [0xfffffe000b5de1f4]
                       *2     ??? (kernel.release.t6050 + 843264) [0xfffffe000b4a5e00]
                         *2     ??? (kernel.release.t6050 + 842980) [0xfffffe000b4a5ce4] (running)
                     *1     ??? (kernel.release.t6050 + 2122276) [0xfffffe000b5de224] (running)
                 *21    ??? (kernel.release.t6050 + 843532) [0xfffffe000b4a5f0c] (runnable)
                 *15    ??? (kernel.release.t6050 + 27000) [0xfffffe000b3de978] (running)
                 *1     ??? (kernel.release.t6050 + 27000) [0xfffffe000b3de978]
                   *1     ??? (kernel.release.t6050 + 843632) [0xfffffe000b4a5f70]
                     *1     ??? (kernel.release.t6050 + 2168000) [0xfffffe000b5e94c0]
                       *1     ??? (kernel.release.t6050 + 2148624) [0xfffffe000b5e4910]
                         *1     ??? (kernel.release.t6050 + 26064) [0xfffffe000b3de5d0]
                           *1     ??? (kernel.release.t6050 + 316424) [0xfffffe000b425408] (running)
                2     cthread_yield + 36 (libsystem_pthread.dylib + 15400) [0x185ac8c28] (running)
              3     cthread_yield + 44 (libsystem_pthread.dylib + 15408) [0x185ac8c30] (running)
              2     Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/http/HttpClientManager.cpp:152) [0x104c01168] (running)


  Thread 0x3b1aec    1001 samples (1-1001)    priority 20 (base 20)    last ran 1011.253s ago
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(void*), void*>>(void*) (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__thread/thread.h:209) [0x104c20c58]
        1001  Microsoft::Applications::Events::PlatformAbstraction::WorkerThread::threadFunc(void*) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/pal/WorkerThread.cpp:0) [0x104c20410]
          1001  Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/./pal/WorkerThread.hpp:55) [0x104c20924]
            1001  bool std::__1::condition_variable::wait_for[abi:ne190102]<long long, std::__1::ratio<1l, 1000l>, Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const::'lambda0'()>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l>> const&, Microsoft::Applications::Events::PlatformAbstraction::Event::wait(unsigned int) const::'lambda0'()) (in telemetryd_v2.dwarf) (/Applications/Xcode_16.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.5.sdk/usr/include/c++/v1/__condition_variable/condition_variable.h:205) [0x104c20a54]
              1001  std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 104 (libc++.1.dylib + 26864) [0x1859e18f0]
                1001  <patched truncated backtrace>
                  1001  __psynch_cvwait + 8 (libsystem_kernel.dylib + 17676) [0x185a8b50c]
                   *1001  psynch_cvcontinue + 0 (pthread + 17948) [0xfffffe000b316abc]

  Thread 0x3b1aed    1001 samples (1-1001)    priority 20 (base 20)    last ran 22618.041s ago
  1001  thread_start + 8 (libsystem_pthread.dylib + 7196) [0x185ac6c1c]
    1001  _pthread_start + 136 (libsystem_pthread.dylib + 27736) [0x185acbc58]
      1001  void* std::__1::__thread_proxy[abi:ne190102]
< redacted >
        1001  sense::Telemetry::OneDsTelemetryClient::LogEvent(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/libtelemetry/1ds/OneDsTelemetryClient.cpp:344) [0x104c99de8]
          1001  Microsoft::Applications::Events::Logger::LogEvent(Microsoft::Applications::Events::EventProperties const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/Logger.cpp:334) [0x104bedfac]
            1001  Microsoft::Applications::Events::Logger::submit(CsProtocol::Record&, Microsoft::Applications::Events::EventProperties const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/Logger.cpp:589) [0x104bf15cc]
              1001  Microsoft::Applications::Events::LogManagerImpl::sendEvent(Microsoft::Applications::Events::IncomingEventContext* const&) (in telemetryd_v2.dwarf) (/Users/cloudtest/vss/_work/1/s/src/edr/MAT/MAT/lib/api/LogManagerImpl.cpp:726) [0x104be8f10]
                1001  std::__1::recursive_mutex::lock() + 16 (libc++.1.dylib + 29536) [0x1859e2360]
                  1001  _pthread_mutex_firstfit_lock_slow + 216 (libsystem_pthread.dylib + 6368) [0x185ac68e0]
                    1001  <patched truncated backtrace>
                      1001  __psynch_mutexwait + 8 (libsystem_kernel.dylib + 14812) [0x185a8a9dc]
                       *1001  psynch_mtxcontinue + 0 (pthread + 9696) [0xfffffe000b314a80] (blocked by turnstile waiting for telemetryd_v2 [29655] [unique pid 129217] thread 0x3b1aee)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions