Python: Fix MCPStreamableHTTPTool leaking asyncio.CancelledError when MCP server is unreachable#5687
Conversation
microsoft#5667) asyncio.CancelledError is a BaseException (not Exception) in Python 3.8+. When an MCP server is unreachable, the MCP library's internal anyio task group raises CancelledError, which escaped all three 'except Exception' handlers in _connect_on_owner(). This propagated through _run_lifecycle_owner -> _run_on_lifecycle_owner -> connect -> __aenter__, bypassing user except Exception blocks entirely. Fix: change the three except-Exception clauses in _connect_on_owner to 'except (Exception, asyncio.CancelledError)' so spurious CancelledErrors from the MCP transport layer are caught and wrapped in ToolException, consistent with the method's documented contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Python Test Coverage Report •
Python Unit Test Overview
|
|||||||||||||||||||||||||||||||||||
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 88%
✓ Correctness
The functional change correctly intercepts asyncio.CancelledError (a BaseException since Python 3.8) that the MCP library raises via anyio cancel scopes during transport creation, session creation, and session.initialize(). The inner_exception guard (isinstance(ex, Exception)) is necessary to stay within the Exception | None type contract of AgentFrameworkException. The exception chain via
from exis preserved in all three sites. The only minor trade-off is that when a CancelledError is caught, exc_info won't be attached to the debug log (inner_exception is None), but the error message string still includes str(ex). All other diff changes are cosmetic reformatting. No correctness bugs found.
✓ Security Reliability
The fix correctly addresses the real bug: the MCP library (using anyio cancel scopes) raises
asyncio.CancelledErrorfor connection failures, which escaped allexcept Exceptionhandlers. The change catchesCancelledErrorand converts it toToolException. The implementation is clean and tests verify the targeted scenarios. One reliability concern: the catch-all onCancelledErroralso silently suppresses legitimate asyncio task cancellation (e.g., when calers dotask.cancel()), converting it toToolExceptioninstead of properly propagating the cancellation signal. This is a known trade-off given Python 3.10 compatibility (notask.cancelling()available until 3.11), but calers should be aware the task will NOT honour cancellation duringconnect().
✓ Test Coverage
The PR correctly catches
asyncio.CancelledError(aBaseExceptionsubclass, notException) at all three connection stages and wraps it inToolException. The four new regression tests cover each injection point and the end-to-endasync withcatchability. The_skills.pyand__init__.pychanges are cosmetic (line reformatting, alphabetical sort) with no logic impact. Two minor coverage gaps exist: (1) none of the new tests assert that_safe_close_exit_stackis actually called whenCancelledErrorfires, unlike the existingtest_connect_cleanup_on_initialization_failurewhich assertsacloseis called; (2) all new tests useMCPStreamableHTPTool(nocommandattribute), so theif command:branch of the error-message formatting is never exercised forCancelledError. Neither gap is blocking — the primary regression is well-covered.
✗ Design Approach
The MCP fix addresses the intended regression for internally generated
CancelledErrors, but it currently broadens the catch so far that genuine caller-driven task cancellation duringconnect()is also converted intoToolException. That changes cancellation semantics in a way that conflicts with the repo's existing pattern of re-raisingasyncio.CanceledErrorto stop work promptly.
Flagged Issues
- Lines 658, 695, and 703 of
python/packages/core/agent_framework/_mcp.pycatchasyncio.CancelledErrorand unconditionally rethrowToolException, sotask.cancel()duringconnect()no longer propagates cancellation. This conflicts with the repo's existing pattern of re-raisingCancelledError(see_mcp.py:560-564,_mcp.py:574-576,_workflows/_runner.py:115-120). The catch should be narrowed so only MCP-internal anyio cancel-scope failures are wrapped while genuine caller-driven cancellation still propagates.
Suggestions
- When
exis aCancelledError,inner_exceptionisNone, so the debug log atexceptions.py:34will not includeexc_info. Consider explicitly logging theCancelledErrorbefore raisingToolExceptionif richer diagnostics are desired. - All four new tests use
MCPStreamableHTTPTool(nocommandattribute), leaving theif command:error-message branch at_mcp.py:661-662and_mcp.py:709-710unexercised forCancelledError. Adding a test withMCPStdioTool(which has acommandattribute) would confirm the'Failed to start MCP server …'/'MCP server … failed to initialize'message variants are correct underCancelledError.
Automated review by automated agents
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 89%
✓ Correctness
The PR correctly addresses the stated bug. The
except (Exception, asyncio.CanceledError)pattern is appropriate sinceCancelledErroris not a subclass ofExceptionin Python 3.9+. Theinner_exception=ex if isinstance(ex, Exception) else Noneconditional correctly satisfies theException | Nonetype annotation while preserving the causal chain viafrom ex. The_safe_close_exit_stackcleanup,asyncioimport, andToolExceptionclass hierarchy are all verified correct. The formatting changes in_skills.pyand test files are purely cosmetic and harmless.
✓ Security Reliability
The PR correctly addresses the CancelledError leakage issue. The exception tuple (Exception, asyncio.CancelledError) is appropriately scoped (won't catch KeyboardInterrupt/SystemExit), the inner_exception type guard is necessary given the Exception|None annotation in AgentFrameworkException.init, and the 'from ex' clause preserves the full exception chain for debugging. The approach is consistent with _safe_close_exit_stack (line 627) which already catches CancelledError. No security or reliability issues found.
✓ Test Coverage
The new tests cover the three CancelledError injection points and end-to-end catchability well. However, none of the new CancelledError tests verify that cleanup (
_safe_close_exit_stack) is actually invoked—a gap that matters because the whole bug is about CancelledError escaping handlers that perform cleanup. The existingtest_connect_cleanup_on_initialization_failureonly tests RuntimeError. Adding cleanup assertions to at least one CancelledError test would confirm the fix handles the resource-leak aspect, not just the wrapping aspect.
✗ Design Approach
I found one design-level issue in the MCP change: broadening the connection setup handlers to wrap every
asyncio.CanceledErrorasToolExceptionalso catches real task cancellation, which conflicts with the existing lifecycle-owner design that explicitly propagates cancellation to callers. The other changes in this diff are formatting-only and do not raise design concerns.
Automated review by moonbox3's agents
…ft#5667) On Python >= 3.11, check task.cancelling() > 0 before wrapping CancelledError as ToolException in the three except blocks inside _connect_on_owner(). When the current task is being cancelled by its caller, the CancelledError now propagates after cleanup, consistent with the existing pattern at _mcp.py:560-564 and _runner.py:115-120. On Python < 3.11 task.cancelling() is unavailable, so MCP-internal CancelledErrors still cannot be reliably distinguished from caller-driven cancellation; they continue to be wrapped as ToolException with a comment documenting the trade-off. Tests: - Add cleanup assertion to transport-creation CancelledError test - Add MCPStdioTool variants exercising the 'command' message branches for both transport-creation and initialize CancelledError paths - Add Python 3.11+-gated tests verifying genuine task cancellation propagates (and still cleans up) for transport and initialize stages Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eption (microsoft#5667) CancelledError inherits from BaseException (not Exception) on Python >= 3.8, so the 'inner_exception=ex if isinstance(ex, Exception) else None' guard always yields None for CancelledError. This means ToolException.__init__ calls logger.log(level, message, exc_info=None), dropping the traceback. Add an explicit logger.debug(error_msg, exc_info=ex) before each raise ToolException(...) in the three CancelledError handlers so the full traceback is preserved in debug logs when MCP-internal cancellation is wrapped rather than propagated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ling Issue regarding Python MCPStreamableHTTPTool Class
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 98% | Result: All clear
Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach
Automated review by automated agents
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a Python behavior where MCP connection failures could leak asyncio.CancelledError (a BaseException on Python 3.8+) instead of raising a catchable ToolException, especially when the MCP server is unreachable.
Changes:
- Catch
asyncio.CancelledErrorduring MCP transport/session/initialize steps and wrap failures asToolException. - Add regression tests covering CancelledError wrapping and ensuring genuine task cancellation still propagates (Py 3.11+).
- Minor test/code formatting cleanup in skills-related tests and error strings.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/core/agent_framework/_mcp.py | Wraps MCP connection CancelledError into ToolException while preserving genuine cancellation on Python 3.11+. |
| python/packages/core/tests/core/test_mcp.py | Adds targeted tests for the new CancelledError handling and cancellation propagation behavior. |
| python/packages/core/tests/core/test_skills.py | Formatting-only adjustments (line breaks/blank lines) in tests. |
| python/packages/core/agent_framework/_skills.py | Collapses some multi-line exception messages into single-line strings (no functional change). |
| python/packages/core/agent_framework/init.py | Reorders an export entry in the module’s public list. |
…d exc_info - Extract _should_propagate_cancelled_error() helper to eliminate duplicated genuine-cancellation detection logic across the three connect() except blocks - Fix session-creation ToolException message to include exception details (e.g. 'Failed to create MCP session: <ex>') matching the transport and initialize failure paths - Change exc_info=ex to exc_info=True in all three logger.debug() calls for idiomatic logging - Add tests for _should_propagate_cancelled_error helper - Add regression test asserting session error message includes exception text - Add test verifying logger.debug is called with exc_info=True Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_owner Addresses review comment on PR microsoft#5687: 1. Add _close_and_check_cancelled() helper method that combines _safe_close_exit_stack() + _should_propagate_cancelled_error() into a single await-able call. This eliminates the duplicated close-then-check pattern that appeared identically in all three connect phases (transport, session, initialize), reducing future drift risk. 2. Comments 2 and 3 (missing {ex} in session error message and non-idiomatic exc_info=ex) were already addressed in the current code: all error messages include {ex} and all logger.debug calls use exc_info=True. 3. Add test_connect_genuine_cancellation_during_session_creation_propagates to cover the previously untested genuine-cancellation path in the session-creation phase (transport and initialize phases already had tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation and Context
When an MCP server is unreachable, the underlying MCP/anyio transport raises
asyncio.CancelledErrorinternally. Becauseasyncio.CancelledErroris not a subclass ofExceptionin Python 3.8+, the existingexcept Exceptionhandlers inMCPTool.connect()silently let it escape, making it impossible for calers to catch the error with a normaltry/except Exceptionblock.Fixes #5667
Description
The root cause is that all three
except Exceptionclauses inMCPTool._connect_on_owner(covering transport creation, session creation, andsession.initialize()) do not catchBaseExceptionsubclasses likeasyncio.CancelledError. The fix widens each handler toexcept (Exception, asyncio.CancelledError)and wraps the caught error in aToolException, ensuring connection failures always surface as a catchable, meaningful exception regardless of what the MCP transport raises internally.Contribution Checklist