Skip to content

Fix runtime data races, memory leak, and shutdown safety#1429

Open
bmehta001 wants to merge 6 commits intomicrosoft:mainfrom
bmehta001:bhamehta/runtime-fixes
Open

Fix runtime data races, memory leak, and shutdown safety#1429
bmehta001 wants to merge 6 commits intomicrosoft:mainfrom
bmehta001:bhamehta/runtime-fixes

Conversation

@bmehta001
Copy link
Copy Markdown
Contributor

@bmehta001 bmehta001 commented Apr 28, 2026

Fixes several thread-safety and resource-management issues that affect normal SDK operation (not just reinit edge cases). Split out from #1415 per reviewer request so runtime behavior changes stay separate from CI/build/test fixes.

Fix HTTP client torn reads and scoped cancellation (3 files)

HttpClient_Apple.mm:

  • Remove [session getTasksWithCompletionHandler:] from Cancel(); the async block uses the shared static NSURLSession and can cancel tasks belonging to other HttpClient_Apple instances (or a newly initialized instance after teardown). Scope cancellation to [m_dataTask cancel] only.
  • Fix torn read on m_requests.empty() in CancelAllRequests(); std::map was read without holding m_requestsMtx while Erase/Add mutate it under the lock.

HttpClientManager.cpp:

  • Fix the same torn read on m_httpCallbacks.empty() in cancelAllRequests() by reading it under m_httpCallbacksMtx.

HttpResponseDecoder.cpp:

  • Add the missing delete ctx->httpResponse before setting it to nullptr in the Abort and RetryNetwork paths. The Accepted and RetryServer paths already handled this correctly.

Fix WorkerThread shutdown safety (1 file)

  • Add an explicit shutdown gate so late Queue() calls are rejected once teardown starts.
  • Enqueue the shutdown sentinel only once under the queue lock and clean up pending tasks under that same lock after a successful join().
  • Avoid a second detach() on repeated Join() calls when the worker thread is already non-joinable.

Make TransmissionPolicyManager scheduling consistently mutex-guarded (2 files)

  • Keep m_runningLatency, m_scheduledUploadTime, and m_isUploadScheduled under m_scheduledUploadMutex instead of mixing atomics with mutex-protected state.
  • Stop holding m_scheduledUploadMutex across DeferredCallbackHandle::Cancel(...) so stop/pause paths do not block against uploadAsync() on the same lock.
  • Use std::chrono::milliseconds(delayMs) for the bandwidth-controller reschedule call so ENABLE_BW_CONTROLLER builds cleanly.

Fix Logger destructor crash (1 file)

  • Remove LOG_TRACE from Logger::~Logger(); it triggers a static-destruction-order crash on iOS when the logging mutex has already been destroyed.

bmehta001 and others added 4 commits April 29, 2026 14:41
- HttpClient_Apple: scope Cancel() to m_dataTask only instead of
  blanket-cancelling every task on the shared session. Fix torn read
  on m_requests.empty() in CancelAllRequests spin loop.
- HttpClientManager: fix torn read on m_httpCallbacks.empty() in
  cancelAllRequests spin loop — read under lock.
- HttpResponseDecoder: add missing delete ctx->httpResponse before
  nullptr in Abort and RetryNetwork paths (memory leak).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Only delete queued tasks after successful join (not after detach,
  where the thread may still access them — undefined behavior)
- Replace catch(...) with std::system_error and std::exception handlers
  that log error code and message
- Log pending queue sizes in both join and detach paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both variables are read and written from different threads during
normal upload scheduling. Declare as std::atomic to eliminate data
races per the C++ memory model. Add .load() for variadic LOG_TRACE
calls. Add comment explaining why unlocked stores in uploadAsync
are safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove LOG_TRACE from Logger destructor — it triggers a crash on iOS
simulator when the recursive_mutex used by logging has already been
destroyed during static destruction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 force-pushed the bhamehta/runtime-fixes branch from 3f0289c to de46cb2 Compare April 29, 2026 21:42
Reject new worker-thread tasks once shutdown starts so queue cleanup
cannot race with late producers, and move the TPM scheduled-upload
state back under a single mutex so latency/next-upload decisions stay
consistent without mixed atomic and mutex access.

Files changed:
- lib/pal/WorkerThread.cpp
- lib/tpm/TransmissionPolicyManager.cpp
- lib/tpm/TransmissionPolicyManager.hpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets runtime correctness in the SDK by addressing thread-safety issues, shutdown safety, and a per-request memory leak in HTTP response handling.

Changes:

  • Tighten HTTP request cancellation scoping and fix torn reads in request/callback tracking loops.
  • Fix a SimpleHttpResponse leak on aborted/network-failure decode paths.
  • Rework worker-thread shutdown behavior to avoid unsafe queue cleanup after detach() and improve error logging.
  • Refactor TransmissionPolicyManager scheduling state synchronization (mutex + new cancelUploadTaskLocked() helper).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/tpm/TransmissionPolicyManager.hpp Changes upload-scheduling state fields and adds a locked cancellation helper declaration.
lib/tpm/TransmissionPolicyManager.cpp Moves upload-scheduling state access under a mutex and adjusts cancellation/scheduling flow.
lib/pal/WorkerThread.cpp Makes shutdown/join behavior safer and improves exception handling/logging during join/detach.
lib/http/HttpResponseDecoder.cpp Deletes ctx->httpResponse on Abort/RetryNetwork paths to prevent leaks.
lib/http/HttpClient_Apple.mm Limits cancellation to the instance’s task and fixes a torn read in the shutdown wait loop.
lib/http/HttpClientManager.cpp Fixes a torn read in the shutdown wait loop by locking around empty-check.
lib/api/Logger.cpp Removes destructor logging to avoid iOS static-destruction-order crash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/tpm/TransmissionPolicyManager.hpp
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Keep the scheduled-upload state mutex-based, but stop holding
m_scheduledUploadMutex across DeferredCallbackHandle::Cancel so
shutdown and pause paths do not block uploadAsync behind the same
lock. While touching the path, use std::chrono::milliseconds for the
bandwidth-controller reschedule call so ENABLE_BW_CONTROLLER builds
cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request May 1, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 self-assigned this May 3, 2026
@bmehta001 bmehta001 requested a review from Copilot May 3, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +152 to +160
while (true)
{
{
LOCKGUARD(m_httpCallbacksMtx);
if (m_httpCallbacks.empty())
break;
}
std::this_thread::yield();
}
Comment thread lib/pal/WorkerThread.cpp
Comment on lines +114 to +117
if (m_shuttingDown) {
LOG_WARN("Dropping queued task %p during shutdown", item);
delete item;
return;
Comment thread lib/pal/WorkerThread.cpp
Comment on lines +98 to +105
// Clean up any tasks remaining in the queues after shutdown.
// Only safe after join() — the thread has fully exited.
// After detach(), the thread still needs the shutdown item
// and may still be accessing the queues.
if (joined) {
for (auto task : m_queue) { delete task; }
m_queue.clear();
for (auto task : m_timerQueue) { delete task; }
Comment on lines +173 to +180
scheduledUploadLock.unlock();
if (!cancelUploadTask())
{
LOG_TRACE("Upload either hasn't been scheduled or already done.");
}
scheduledUploadLock.lock();
if (shouldSkipScheduling())
{
Comment on lines +173 to +180
scheduledUploadLock.unlock();
if (!cancelUploadTask())
{
LOG_TRACE("Upload either hasn't been scheduled or already done.");
}
scheduledUploadLock.lock();
if (shouldSkipScheduling())
{
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants