Skip to content

Add gRPC client and worker connection resiliency#135

Merged
berndverst merged 36 commits into
mainfrom
grpc-resiliency-pr708-impl
May 22, 2026
Merged

Add gRPC client and worker connection resiliency#135
berndverst merged 36 commits into
mainfrom
grpc-resiliency-pr708-impl

Conversation

@berndverst
Copy link
Copy Markdown
Member

@berndverst berndverst commented Apr 24, 2026

Summary

  • Add public gRPC resiliency option types and shared transport helpers, then wire them through core and Azure Managed constructors.
  • Harden worker, sync client, and async client connection recovery to better survive silent disconnects, transport failures, and channel recreation scenarios.
  • Add focused regression coverage plus user-facing changelog and design/plan updates for the new resiliency behavior.

Test Plan

  • Focused pytest run covering grpc resiliency, worker resiliency, worker concurrency loop, client resiliency, and Azure Managed wrapper wiring.
  • flake8 on all changed source and test files.

Bernd Verst and others added 26 commits April 23, 2026 17:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend worker resiliency coverage with an end-to-end silent-disconnect recovery test and an explicit reconnect backoff assertion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 24, 2026 09:35
Comment thread tests/durabletask/test_client.py Fixed
Comment thread tests/durabletask/test_client.py Fixed
Comment thread durabletask/client.py Fixed
Comment thread durabletask/worker.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class gRPC connection resiliency to the Durable Task Python SDK (core durabletask/) and threads the new configuration through Azure Managed wrappers, with extensive regression tests and design docs.

Changes:

  • Introduces GrpcWorkerResiliencyOptions / GrpcClientResiliencyOptions and shared internal resiliency helpers (backoff, failure tracking, transport-failure classification).
  • Updates worker stream loop and sync/async clients to detect transport-shaped failures and safely recreate/retire SDK-owned channels while preserving caller-owned channel semantics.
  • Adds comprehensive unit tests (core + azuremanaged) plus docs/specs and changelog entries.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
durabletask/worker.py Implements worker stream monitoring, backoff, failure tracking, and safe SDK-owned channel retirement with in-flight tracking.
durabletask/client.py Adds sync/async unary invocation wrappers and SDK-owned channel recreation + retirement handling.
durabletask/grpc_options.py Adds public resiliency option dataclasses with validation.
durabletask/internal/grpc_resiliency.py Adds shared backoff, FailureTracker, and transport-failure classification helpers.
durabletask-azuremanaged/durabletask/azuremanaged/client.py Forwards client resiliency options through Azure Managed client wrappers.
durabletask-azuremanaged/durabletask/azuremanaged/worker.py Forwards worker resiliency options through Azure Managed worker wrapper.
tests/durabletask/test_worker_resiliency.py New worker resiliency tests (silent disconnect, graceful close, recreation thresholds, in-flight close deferral).
tests/durabletask/test_grpc_resiliency.py New tests for option validation, backoff, FailureTracker, and transport-failure classification.
tests/durabletask/test_client.py Adds sync/async client channel recreation/retirement tests and wrapper verification.
tests/durabletask/test_worker_concurrency_loop.py Updates tests to call prepare_for_run() before reusing the worker manager.
tests/durabletask/test_worker_concurrency_loop_async.py Updates async loop tests to call prepare_for_run() before reusing the worker manager.
tests/durabletask-azuremanaged/test_azuremanaged_grpc_resiliency.py New tests validating Azure Managed wrapper pass-through of resiliency options.
CHANGELOG.md Documents new resiliency options and behavior changes in core SDK.
durabletask-azuremanaged/CHANGELOG.md Documents pass-through resiliency options in Azure Managed package.
docs/superpowers/specs/2026-04-23-grpc-resiliency-design.md Adds design spec for resiliency behavior and public API.
docs/superpowers/plans/2026-04-23-grpc-resiliency.md Adds implementation plan document for the work.
.gitignore Ignores .worktrees/ and normalizes coverage.lcov entry formatting.

Comment thread durabletask/internal/grpc_resiliency.py Outdated
Comment thread docs/superpowers/plans/2026-04-23-grpc-resiliency.md Outdated
Comment thread durabletask/worker.py Outdated
Comment thread durabletask/client.py Outdated
Comment thread durabletask/client.py Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread durabletask/worker.py
Comment thread durabletask/internal/grpc_resiliency.py Outdated
Comment thread durabletask/client.py
Bernd Verst and others added 2 commits May 21, 2026 16:06
- Make FailureTracker thread-safe with an internal lock so multi-threaded
  sync clients can't race the consecutive-failure counter (review [3/10]).
- Track _AsyncWorkerManager pool shutdown via an explicit _pool_is_shutdown
  flag instead of reading ThreadPoolExecutor._shutdown (CPython private API,
  review [4/10]).
- Collapse identical wrap_execution/wrap_cancellation closures in the worker
  stream loop into a single wrap_with_release helper (review [5/10]).
- Promote the retired-channel close delay and jitter exponent cap to named
  module-level constants (review [7/10]).
- Key _InFlightChannelTracker on the channel object instead of id(channel)
  so the lifetime invariant is local to the tracker (review [9/10]).
- Rename TaskHubGrpcWorker._can_recreate_channel() to the existing
  _owns_channel attribute used by the clients, so both files use the same
  name for the same concept (review [2/10]).
- Add regression tests for FailureTracker concurrency and for thread-pool
  recreation after manager shutdown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Centralize client failure tracking and channel-recreate triggering in a
`ClientResiliencyInterceptor` (sync) and `AsyncClientResiliencyInterceptor`
(async) instead of the per-call `_invoke_unary` indirection. This addresses
feedback [1/10] on PR #135: resiliency wiring now lives in one place and the
call sites read as normal stub calls.

- Add `ClientResiliencyInterceptor` and `AsyncClientResiliencyInterceptor`
  in `durabletask/internal/grpc_resiliency.py`.
- Switch `LONG_POLL_METHODS` and `is_client_transport_failure` to use full
  gRPC method paths (`/TaskHubSidecarService/...`) so the interceptor can
  match the `method` field on `ClientCallDetails` directly.
- Wire the resiliency interceptor into `TaskHubGrpcClient` and
  `AsyncTaskHubGrpcClient`: it is always prepended (defensive copy of any
  user interceptors) and re-applied on every channel recreate so all unary
  calls flow through it.
- Remove both `_invoke_unary` methods and revert all 34 call sites to
  ordinary `self._stub.MethodName(req)` (or `await ...` for async).
- Caller-owned channels (sync and async) deliberately bypass the resiliency
  interceptor since they are never recreated; this preserves the caller's
  exact channel reference and avoids `grpc.aio`'s lack of a public
  `intercept_channel` equivalent.
- Add test shims (`_ResilientSyncTestStub`/`_ResilientAsyncTestStub` plus
  `install_resilient_test_stubs`) so tests that patch
  `stubs.TaskHubSidecarServiceStub` with `MagicMock` still observe the
  failure-tracking pipeline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread durabletask/internal/grpc_resiliency.py Fixed
Comment thread durabletask/internal/grpc_resiliency.py
Comment thread durabletask/worker.py
Comment thread durabletask/internal/grpc_resiliency.py
Comment thread durabletask/worker.py
Comment thread durabletask/worker.py
Comment thread durabletask/worker.py
Comment thread durabletask/internal/grpc_resiliency.py
Comment thread durabletask/worker.py
Comment thread tests/durabletask/test_worker_resiliency.py
Comment thread durabletask-azuremanaged/durabletask/azuremanaged/client.py
Comment thread durabletask-azuremanaged/durabletask/azuremanaged/worker.py
Replace the ad-hoc `hasattr(result, '__await__')` check in
`AsyncClientResiliencyInterceptor._record_outcome` with the canonical
`inspect.isawaitable` predicate, and tighten the `on_recreate` callback
annotation to `Callable[[], Union[None, Awaitable[object]]]` so it reflects
the actual contract (sync callbacks return None, async callbacks return an
Awaitable that we await).

Addresses the github-code-quality 'Statement has no effect' warning surfaced
on PR #135 by making the awaitable check explicit and type-driven.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread durabletask/internal/grpc_resiliency.py Fixed
…ceptor

CodeQL's `py/ineffectual-statement` heuristic re-flagged `await result` in
`AsyncClientResiliencyInterceptor._record_outcome` after the previous fix:
the rule treats expression statements whose value is discarded as unused, and
does not recognise that `await` is always a side-effecting suspension point
(the whole purpose of the call is to run the async recreate callback to
completion).

Rewriting the line as `_ = await result` keeps the exact same runtime
behaviour but documents the intent (return value intentionally discarded) and
satisfies the linter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three items - large refactors look good, just a few remaining items to think about

Comment thread durabletask/internal/grpc_resiliency.py Outdated
Comment thread durabletask/client.py Outdated
Comment thread durabletask/internal/grpc_resiliency.py
Bernd Verst and others added 2 commits May 22, 2026 11:32
…nterceptor exception handling

Addresses three follow-up review comments on the resiliency interceptor refactor:

[1/3] Channel recreate now runs fire-and-forget (daemon thread for sync,
asyncio.create_task for async). The original RPC error propagates to the
caller without being delayed by DNS, TLS handshake, or contention on
_recreate_lock. A client-side single-flight guard avoids spawning duplicate
work when many failures land in a burst; the existing cooldown still
prevents thrash. close() waits for any in-flight recreate to finish so the
teardown path stays deterministic. A _recreate_done_event (test seam) lets
tests synchronise on completion without polling.

[2/3] Hoisted _closing, _recreate_lock, _last_recreate_time,
_retired_channels / _retired_channel_close_tasks above ClientResiliencyInterceptor
construction in both __init__ methods so the bound recreate callback is safe
to invoke at any time during construction.

[3/3] AsyncClientResiliencyInterceptor now uses 'except Exception' (so
asyncio.CancelledError, KeyboardInterrupt and SystemExit propagate
unchanged) and mirrors the sync interceptor's policy by resetting the
failure counter on non-AioRpcError exceptions. _record_outcome is now
synchronous on both interceptors because the on_recreate callback no
longer awaits the recreate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply three Python 3.10+ idioms to code introduced by this PR:

- PEP 604 union syntax: replace Optional[X] with X | None in grpc_options.py and grpc_resiliency.py.

- Dataclass tightening: add slots=True, kw_only=True to FailureTracker, GrpcRetryPolicyOptions, GrpcChannelOptions, GrpcWorkerResiliencyOptions, and GrpcClientResiliencyOptions. Update the two positional FailureTracker(...) call sites in client.py to use threshold=... kwargs.

- PEP 617 parenthesized context managers: rewrite the 10 chained 'with patch(...), patch(...):' blocks added by this PR in test_client.py. Pre-existing chained sites are left untouched to keep the diff surgical.

Internal-only change (no public API or behavior impact). 85/85 resiliency-focused tests pass; 232/8 passed/skipped in the broader non-e2e suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@berndverst berndverst merged commit 5758091 into main May 22, 2026
17 checks passed
@berndverst berndverst deleted the grpc-resiliency-pr708-impl branch May 22, 2026 19:01
berndverst pushed a commit that referenced this pull request May 26, 2026
Apply the same `__enter__`/`__exit__` typing pattern that PR #145 uses
for `TaskHubGrpcClient` (and that `BlobPayloadStore` already follows)
to three pre-existing context managers that were either untyped or
shadowed a builtin:

* `AsyncTaskHubGrpcClient.__aenter__` now returns the concrete type and
  `__aexit__` takes `*args: object` with `-> None`.
* `TaskHubGrpcWorker.__enter__/__exit__` get the same treatment, also
  removing the `type` parameter that shadowed the builtin.
* `EntityLock.__enter__/__exit__` get the same treatment; the file
  already has `from __future__ import annotations` so the return
  annotation is the bare class name.

Behavior is unchanged: each `__exit__` still delegates to its existing
teardown method (`close`/`stop`/`_exit_critical_section`), so the
gRPC resiliency teardown added in PR #135 continues to flow through
`TaskHubGrpcClient.close()` unchanged.

No changelog entry: per the repo's contributor guidance, internal-only
type-annotation refactors with no externally observable behavior change
are excluded from the changelog.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
berndverst added a commit that referenced this pull request May 26, 2026
* Add context manager support to TaskHubGrpcClient (#134)

Add `__enter__`/`__exit__` to the sync `TaskHubGrpcClient` so callers can
use it with a `with` statement, mirroring the existing
`AsyncTaskHubGrpcClient` async-context-manager support and the
`TaskHubGrpcWorker` pattern. `DurableTaskSchedulerClient` inherits this
behavior automatically.

`__exit__` delegates to `close()`, so the resiliency-aware teardown
introduced in #135 (in-flight recreate thread join, retired-channel
timer cancellation, SDK-owned channel cleanup) runs unchanged through
the new `with` path. Caller-owned channels remain untouched.

Migrate every test and example callsite that previously instantiated
`TaskHubGrpcClient(...)` and never closed it to the `with` form so the
gRPC channel is deterministically released. Unit tests in
`test_client.py` that intentionally test construction (with mocked
stubs) are left unchanged.

Add focused unit tests for the new context-manager behavior, including
a regression test that exits a `with` block while a fire-and-forget
channel recreate is pending and asserts the #135 resiliency invariants
(retired-channel timers cancelled, recreate thread joined) still hold.

Fixes #134

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address auto-generated code-quality review comments on PR #145

1. tests/durabletask/test_orchestration_e2e.py (test_suspend_and_resume):
   Drop the unused `state =` assignment around the expected-timeout
   `wait_for_orchestration_completion` call. The value is never read
   because the next line asserts False and the only non-failing path
   raises TimeoutError; `state` is reassigned a few lines down. Silences
   the "variable defined multiple times" warning that CodeQL flagged
   because this previously-untouched line was pulled into the diff by
   the indent change.

2. tests/durabletask/test_client.py
   (test_sync_client_context_manager_propagates_exception_and_calls_close):
   Replace the nested `with pytest.raises(...): with client: raise ...`
   pattern with an explicit try/except so CodeQL no longer reports the
   post-block assertions as unreachable. Test intent (exception
   propagation + cleanup verification) is preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Align __exit__ typing with repo convention (BlobPayloadStore)

Use `*args: object` for the `__exit__` parameters instead of leaving
them untyped. This matches the most recent context-manager class in
the repo (`durabletask/extensions/azure_blob_payloads/blob_payload_store.py`),
is more type-safe under Pylance/mypy, and avoids the parameter
shadowing of the builtin `type` that exists in
`TaskHubGrpcWorker.__exit__`. Behavior is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Align pre-existing context manager methods with repo idiom

Apply the same `__enter__`/`__exit__` typing pattern that PR #145 uses
for `TaskHubGrpcClient` (and that `BlobPayloadStore` already follows)
to three pre-existing context managers that were either untyped or
shadowed a builtin:

* `AsyncTaskHubGrpcClient.__aenter__` now returns the concrete type and
  `__aexit__` takes `*args: object` with `-> None`.
* `TaskHubGrpcWorker.__enter__/__exit__` get the same treatment, also
  removing the `type` parameter that shadowed the builtin.
* `EntityLock.__enter__/__exit__` get the same treatment; the file
  already has `from __future__ import annotations` so the return
  annotation is the bare class name.

Behavior is unchanged: each `__exit__` still delegates to its existing
teardown method (`close`/`stop`/`_exit_critical_section`), so the
gRPC resiliency teardown added in PR #135 continues to flow through
`TaskHubGrpcClient.close()` unchanged.

No changelog entry: per the repo's contributor guidance, internal-only
type-annotation refactors with no externally observable behavior change
are excluded from the changelog.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Bernd Verst <beverst@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants