Add gRPC client and worker connection resiliency#135
Merged
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend worker resiliency coverage with an end-to-end silent-disconnect recovery test and an explicit reconnect backoff assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class gRPC connection resiliency to the Durable Task Python SDK (core durabletask/) and threads the new configuration through Azure Managed wrappers, with extensive regression tests and design docs.
Changes:
- Introduces
GrpcWorkerResiliencyOptions/GrpcClientResiliencyOptionsand shared internal resiliency helpers (backoff, failure tracking, transport-failure classification). - Updates worker stream loop and sync/async clients to detect transport-shaped failures and safely recreate/retire SDK-owned channels while preserving caller-owned channel semantics.
- Adds comprehensive unit tests (core + azuremanaged) plus docs/specs and changelog entries.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| durabletask/worker.py | Implements worker stream monitoring, backoff, failure tracking, and safe SDK-owned channel retirement with in-flight tracking. |
| durabletask/client.py | Adds sync/async unary invocation wrappers and SDK-owned channel recreation + retirement handling. |
| durabletask/grpc_options.py | Adds public resiliency option dataclasses with validation. |
| durabletask/internal/grpc_resiliency.py | Adds shared backoff, FailureTracker, and transport-failure classification helpers. |
| durabletask-azuremanaged/durabletask/azuremanaged/client.py | Forwards client resiliency options through Azure Managed client wrappers. |
| durabletask-azuremanaged/durabletask/azuremanaged/worker.py | Forwards worker resiliency options through Azure Managed worker wrapper. |
| tests/durabletask/test_worker_resiliency.py | New worker resiliency tests (silent disconnect, graceful close, recreation thresholds, in-flight close deferral). |
| tests/durabletask/test_grpc_resiliency.py | New tests for option validation, backoff, FailureTracker, and transport-failure classification. |
| tests/durabletask/test_client.py | Adds sync/async client channel recreation/retirement tests and wrapper verification. |
| tests/durabletask/test_worker_concurrency_loop.py | Updates tests to call prepare_for_run() before reusing the worker manager. |
| tests/durabletask/test_worker_concurrency_loop_async.py | Updates async loop tests to call prepare_for_run() before reusing the worker manager. |
| tests/durabletask-azuremanaged/test_azuremanaged_grpc_resiliency.py | New tests validating Azure Managed wrapper pass-through of resiliency options. |
| CHANGELOG.md | Documents new resiliency options and behavior changes in core SDK. |
| durabletask-azuremanaged/CHANGELOG.md | Documents pass-through resiliency options in Azure Managed package. |
| docs/superpowers/specs/2026-04-23-grpc-resiliency-design.md | Adds design spec for resiliency behavior and public API. |
| docs/superpowers/plans/2026-04-23-grpc-resiliency.md | Adds implementation plan document for the work. |
| .gitignore | Ignores .worktrees/ and normalizes coverage.lcov entry formatting. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
andystaples
reviewed
May 18, 2026
andystaples
reviewed
May 18, 2026
- Make FailureTracker thread-safe with an internal lock so multi-threaded sync clients can't race the consecutive-failure counter (review [3/10]). - Track _AsyncWorkerManager pool shutdown via an explicit _pool_is_shutdown flag instead of reading ThreadPoolExecutor._shutdown (CPython private API, review [4/10]). - Collapse identical wrap_execution/wrap_cancellation closures in the worker stream loop into a single wrap_with_release helper (review [5/10]). - Promote the retired-channel close delay and jitter exponent cap to named module-level constants (review [7/10]). - Key _InFlightChannelTracker on the channel object instead of id(channel) so the lifetime invariant is local to the tracker (review [9/10]). - Rename TaskHubGrpcWorker._can_recreate_channel() to the existing _owns_channel attribute used by the clients, so both files use the same name for the same concept (review [2/10]). - Add regression tests for FailureTracker concurrency and for thread-pool recreation after manager shutdown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Centralize client failure tracking and channel-recreate triggering in a `ClientResiliencyInterceptor` (sync) and `AsyncClientResiliencyInterceptor` (async) instead of the per-call `_invoke_unary` indirection. This addresses feedback [1/10] on PR #135: resiliency wiring now lives in one place and the call sites read as normal stub calls. - Add `ClientResiliencyInterceptor` and `AsyncClientResiliencyInterceptor` in `durabletask/internal/grpc_resiliency.py`. - Switch `LONG_POLL_METHODS` and `is_client_transport_failure` to use full gRPC method paths (`/TaskHubSidecarService/...`) so the interceptor can match the `method` field on `ClientCallDetails` directly. - Wire the resiliency interceptor into `TaskHubGrpcClient` and `AsyncTaskHubGrpcClient`: it is always prepended (defensive copy of any user interceptors) and re-applied on every channel recreate so all unary calls flow through it. - Remove both `_invoke_unary` methods and revert all 34 call sites to ordinary `self._stub.MethodName(req)` (or `await ...` for async). - Caller-owned channels (sync and async) deliberately bypass the resiliency interceptor since they are never recreated; this preserves the caller's exact channel reference and avoids `grpc.aio`'s lack of a public `intercept_channel` equivalent. - Add test shims (`_ResilientSyncTestStub`/`_ResilientAsyncTestStub` plus `install_resilient_test_stubs`) so tests that patch `stubs.TaskHubSidecarServiceStub` with `MagicMock` still observe the failure-tracking pipeline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
berndverst
commented
May 22, 2026
Replace the ad-hoc `hasattr(result, '__await__')` check in `AsyncClientResiliencyInterceptor._record_outcome` with the canonical `inspect.isawaitable` predicate, and tighten the `on_recreate` callback annotation to `Callable[[], Union[None, Awaitable[object]]]` so it reflects the actual contract (sync callbacks return None, async callbacks return an Awaitable that we await). Addresses the github-code-quality 'Statement has no effect' warning surfaced on PR #135 by making the awaitable check explicit and type-driven. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ceptor CodeQL's `py/ineffectual-statement` heuristic re-flagged `await result` in `AsyncClientResiliencyInterceptor._record_outcome` after the previous fix: the rule treats expression statements whose value is discarded as unused, and does not recognise that `await` is always a side-effecting suspension point (the whole purpose of the call is to run the async recreate callback to completion). Rewriting the line as `_ = await result` keeps the exact same runtime behaviour but documents the intent (return value intentionally discarded) and satisfies the linter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
andystaples
reviewed
May 22, 2026
…nterceptor exception handling Addresses three follow-up review comments on the resiliency interceptor refactor: [1/3] Channel recreate now runs fire-and-forget (daemon thread for sync, asyncio.create_task for async). The original RPC error propagates to the caller without being delayed by DNS, TLS handshake, or contention on _recreate_lock. A client-side single-flight guard avoids spawning duplicate work when many failures land in a burst; the existing cooldown still prevents thrash. close() waits for any in-flight recreate to finish so the teardown path stays deterministic. A _recreate_done_event (test seam) lets tests synchronise on completion without polling. [2/3] Hoisted _closing, _recreate_lock, _last_recreate_time, _retired_channels / _retired_channel_close_tasks above ClientResiliencyInterceptor construction in both __init__ methods so the bound recreate callback is safe to invoke at any time during construction. [3/3] AsyncClientResiliencyInterceptor now uses 'except Exception' (so asyncio.CancelledError, KeyboardInterrupt and SystemExit propagate unchanged) and mirrors the sync interceptor's policy by resetting the failure counter on non-AioRpcError exceptions. _record_outcome is now synchronous on both interceptors because the on_recreate callback no longer awaits the recreate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply three Python 3.10+ idioms to code introduced by this PR: - PEP 604 union syntax: replace Optional[X] with X | None in grpc_options.py and grpc_resiliency.py. - Dataclass tightening: add slots=True, kw_only=True to FailureTracker, GrpcRetryPolicyOptions, GrpcChannelOptions, GrpcWorkerResiliencyOptions, and GrpcClientResiliencyOptions. Update the two positional FailureTracker(...) call sites in client.py to use threshold=... kwargs. - PEP 617 parenthesized context managers: rewrite the 10 chained 'with patch(...), patch(...):' blocks added by this PR in test_client.py. Pre-existing chained sites are left untouched to keep the diff surgical. Internal-only change (no public API or behavior impact). 85/85 resiliency-focused tests pass; 232/8 passed/skipped in the broader non-e2e suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
andystaples
approved these changes
May 22, 2026
This was referenced May 26, 2026
berndverst
pushed a commit
that referenced
this pull request
May 26, 2026
Apply the same `__enter__`/`__exit__` typing pattern that PR #145 uses for `TaskHubGrpcClient` (and that `BlobPayloadStore` already follows) to three pre-existing context managers that were either untyped or shadowed a builtin: * `AsyncTaskHubGrpcClient.__aenter__` now returns the concrete type and `__aexit__` takes `*args: object` with `-> None`. * `TaskHubGrpcWorker.__enter__/__exit__` get the same treatment, also removing the `type` parameter that shadowed the builtin. * `EntityLock.__enter__/__exit__` get the same treatment; the file already has `from __future__ import annotations` so the return annotation is the bare class name. Behavior is unchanged: each `__exit__` still delegates to its existing teardown method (`close`/`stop`/`_exit_critical_section`), so the gRPC resiliency teardown added in PR #135 continues to flow through `TaskHubGrpcClient.close()` unchanged. No changelog entry: per the repo's contributor guidance, internal-only type-annotation refactors with no externally observable behavior change are excluded from the changelog. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
berndverst
added a commit
that referenced
this pull request
May 26, 2026
* Add context manager support to TaskHubGrpcClient (#134) Add `__enter__`/`__exit__` to the sync `TaskHubGrpcClient` so callers can use it with a `with` statement, mirroring the existing `AsyncTaskHubGrpcClient` async-context-manager support and the `TaskHubGrpcWorker` pattern. `DurableTaskSchedulerClient` inherits this behavior automatically. `__exit__` delegates to `close()`, so the resiliency-aware teardown introduced in #135 (in-flight recreate thread join, retired-channel timer cancellation, SDK-owned channel cleanup) runs unchanged through the new `with` path. Caller-owned channels remain untouched. Migrate every test and example callsite that previously instantiated `TaskHubGrpcClient(...)` and never closed it to the `with` form so the gRPC channel is deterministically released. Unit tests in `test_client.py` that intentionally test construction (with mocked stubs) are left unchanged. Add focused unit tests for the new context-manager behavior, including a regression test that exits a `with` block while a fire-and-forget channel recreate is pending and asserts the #135 resiliency invariants (retired-channel timers cancelled, recreate thread joined) still hold. Fixes #134 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address auto-generated code-quality review comments on PR #145 1. tests/durabletask/test_orchestration_e2e.py (test_suspend_and_resume): Drop the unused `state =` assignment around the expected-timeout `wait_for_orchestration_completion` call. The value is never read because the next line asserts False and the only non-failing path raises TimeoutError; `state` is reassigned a few lines down. Silences the "variable defined multiple times" warning that CodeQL flagged because this previously-untouched line was pulled into the diff by the indent change. 2. tests/durabletask/test_client.py (test_sync_client_context_manager_propagates_exception_and_calls_close): Replace the nested `with pytest.raises(...): with client: raise ...` pattern with an explicit try/except so CodeQL no longer reports the post-block assertions as unreachable. Test intent (exception propagation + cleanup verification) is preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Align __exit__ typing with repo convention (BlobPayloadStore) Use `*args: object` for the `__exit__` parameters instead of leaving them untyped. This matches the most recent context-manager class in the repo (`durabletask/extensions/azure_blob_payloads/blob_payload_store.py`), is more type-safe under Pylance/mypy, and avoids the parameter shadowing of the builtin `type` that exists in `TaskHubGrpcWorker.__exit__`. Behavior is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Align pre-existing context manager methods with repo idiom Apply the same `__enter__`/`__exit__` typing pattern that PR #145 uses for `TaskHubGrpcClient` (and that `BlobPayloadStore` already follows) to three pre-existing context managers that were either untyped or shadowed a builtin: * `AsyncTaskHubGrpcClient.__aenter__` now returns the concrete type and `__aexit__` takes `*args: object` with `-> None`. * `TaskHubGrpcWorker.__enter__/__exit__` get the same treatment, also removing the `type` parameter that shadowed the builtin. * `EntityLock.__enter__/__exit__` get the same treatment; the file already has `from __future__ import annotations` so the return annotation is the bare class name. Behavior is unchanged: each `__exit__` still delegates to its existing teardown method (`close`/`stop`/`_exit_critical_section`), so the gRPC resiliency teardown added in PR #135 continues to flow through `TaskHubGrpcClient.close()` unchanged. No changelog entry: per the repo's contributor guidance, internal-only type-annotation refactors with no externally observable behavior change are excluded from the changelog. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Bernd Verst <beverst@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test Plan