Skip to content

Python: Fix MCPStreamableHTTPTool leaking asyncio.CancelledError when MCP server is unreachable#5687

Merged
TaoChenOSU merged 7 commits into
microsoft:mainfrom
moonbox3:agent/fix-5667-1
May 7, 2026
Merged

Python: Fix MCPStreamableHTTPTool leaking asyncio.CancelledError when MCP server is unreachable#5687
TaoChenOSU merged 7 commits into
microsoft:mainfrom
moonbox3:agent/fix-5667-1

Conversation

@moonbox3
Copy link
Copy Markdown
Contributor

@moonbox3 moonbox3 commented May 7, 2026

Motivation and Context

When an MCP server is unreachable, the underlying MCP/anyio transport raises asyncio.CancelledError internally. Because asyncio.CancelledError is not a subclass of Exception in Python 3.8+, the existing except Exception handlers in MCPTool.connect() silently let it escape, making it impossible for calers to catch the error with a normal try/except Exception block.

Fixes #5667

Description

The root cause is that all three except Exception clauses in MCPTool._connect_on_owner (covering transport creation, session creation, and session.initialize()) do not catch BaseException subclasses like asyncio.CancelledError. The fix widens each handler to except (Exception, asyncio.CancelledError) and wraps the caught error in a ToolException, ensuring connection failures always surface as a catchable, meaningful exception regardless of what the MCP transport raises internally.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Note: PR autogenerated by an agent

microsoft#5667)

asyncio.CancelledError is a BaseException (not Exception) in Python 3.8+.
When an MCP server is unreachable, the MCP library's internal anyio task
group raises CancelledError, which escaped all three 'except Exception'
handlers in _connect_on_owner(). This propagated through
_run_lifecycle_owner -> _run_on_lifecycle_owner -> connect -> __aenter__,
bypassing user except Exception blocks entirely.

Fix: change the three except-Exception clauses in _connect_on_owner to
'except (Exception, asyncio.CancelledError)' so spurious CancelledErrors
from the MCP transport layer are caught and wrapped in ToolException,
consistent with the method's documented contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@moonbox3 moonbox3 added the python label May 7, 2026
@moonbox3
Copy link
Copy Markdown
Contributor Author

moonbox3 commented May 7, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _mcp.py6312695%153, 172, 310, 370–371, 500, 562, 575, 599–600, 613–616, 618–619, 623, 706–708, 1074, 1187, 1239–1240, 1243, 1714
   _skills.py5921098%522, 532, 1808–1809, 1882, 1887, 1936, 1941, 2144–2145
TOTAL33604390488% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
6567 30 💤 0 ❌ 0 🔥 1m 45s ⏱️

Copy link
Copy Markdown
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 88%

✓ Correctness

The functional change correctly intercepts asyncio.CancelledError (a BaseException since Python 3.8) that the MCP library raises via anyio cancel scopes during transport creation, session creation, and session.initialize(). The inner_exception guard (isinstance(ex, Exception)) is necessary to stay within the Exception | None type contract of AgentFrameworkException. The exception chain via from ex is preserved in all three sites. The only minor trade-off is that when a CancelledError is caught, exc_info won't be attached to the debug log (inner_exception is None), but the error message string still includes str(ex). All other diff changes are cosmetic reformatting. No correctness bugs found.

✓ Security Reliability

The fix correctly addresses the real bug: the MCP library (using anyio cancel scopes) raises asyncio.CancelledError for connection failures, which escaped all except Exception handlers. The change catches CancelledError and converts it to ToolException. The implementation is clean and tests verify the targeted scenarios. One reliability concern: the catch-all on CancelledError also silently suppresses legitimate asyncio task cancellation (e.g., when calers do task.cancel()), converting it to ToolException instead of properly propagating the cancellation signal. This is a known trade-off given Python 3.10 compatibility (no task.cancelling() available until 3.11), but calers should be aware the task will NOT honour cancellation during connect().

✓ Test Coverage

The PR correctly catches asyncio.CancelledError (a BaseException subclass, not Exception) at all three connection stages and wraps it in ToolException. The four new regression tests cover each injection point and the end-to-end async with catchability. The _skills.py and __init__.py changes are cosmetic (line reformatting, alphabetical sort) with no logic impact. Two minor coverage gaps exist: (1) none of the new tests assert that _safe_close_exit_stack is actually called when CancelledError fires, unlike the existing test_connect_cleanup_on_initialization_failure which asserts aclose is called; (2) all new tests use MCPStreamableHTPTool (no command attribute), so the if command: branch of the error-message formatting is never exercised for CancelledError. Neither gap is blocking — the primary regression is well-covered.

✗ Design Approach

The MCP fix addresses the intended regression for internally generated CancelledErrors, but it currently broadens the catch so far that genuine caller-driven task cancellation during connect() is also converted into ToolException. That changes cancellation semantics in a way that conflicts with the repo's existing pattern of re-raising asyncio.CanceledError to stop work promptly.

Flagged Issues

  • Lines 658, 695, and 703 of python/packages/core/agent_framework/_mcp.py catch asyncio.CancelledError and unconditionally rethrow ToolException, so task.cancel() during connect() no longer propagates cancellation. This conflicts with the repo's existing pattern of re-raising CancelledError (see _mcp.py:560-564, _mcp.py:574-576, _workflows/_runner.py:115-120). The catch should be narrowed so only MCP-internal anyio cancel-scope failures are wrapped while genuine caller-driven cancellation still propagates.

Suggestions

  • When ex is a CancelledError, inner_exception is None, so the debug log at exceptions.py:34 will not include exc_info. Consider explicitly logging the CancelledError before raising ToolException if richer diagnostics are desired.
  • All four new tests use MCPStreamableHTTPTool (no command attribute), leaving the if command: error-message branch at _mcp.py:661-662 and _mcp.py:709-710 unexercised for CancelledError. Adding a test with MCPStdioTool (which has a command attribute) would confirm the 'Failed to start MCP server …' / 'MCP server … failed to initialize' message variants are correct under CancelledError.

Automated review by automated agents

Comment thread python/packages/core/agent_framework/_mcp.py
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 89%

✓ Correctness

The PR correctly addresses the stated bug. The except (Exception, asyncio.CanceledError) pattern is appropriate since CancelledError is not a subclass of Exception in Python 3.9+. The inner_exception=ex if isinstance(ex, Exception) else None conditional correctly satisfies the Exception | None type annotation while preserving the causal chain via from ex. The _safe_close_exit_stack cleanup, asyncio import, and ToolException class hierarchy are all verified correct. The formatting changes in _skills.py and test files are purely cosmetic and harmless.

✓ Security Reliability

The PR correctly addresses the CancelledError leakage issue. The exception tuple (Exception, asyncio.CancelledError) is appropriately scoped (won't catch KeyboardInterrupt/SystemExit), the inner_exception type guard is necessary given the Exception|None annotation in AgentFrameworkException.init, and the 'from ex' clause preserves the full exception chain for debugging. The approach is consistent with _safe_close_exit_stack (line 627) which already catches CancelledError. No security or reliability issues found.

✓ Test Coverage

The new tests cover the three CancelledError injection points and end-to-end catchability well. However, none of the new CancelledError tests verify that cleanup (_safe_close_exit_stack) is actually invoked—a gap that matters because the whole bug is about CancelledError escaping handlers that perform cleanup. The existing test_connect_cleanup_on_initialization_failure only tests RuntimeError. Adding cleanup assertions to at least one CancelledError test would confirm the fix handles the resource-leak aspect, not just the wrapping aspect.

✗ Design Approach

I found one design-level issue in the MCP change: broadening the connection setup handlers to wrap every asyncio.CanceledError as ToolException also catches real task cancellation, which conflicts with the existing lifecycle-owner design that explicitly propagates cancellation to callers. The other changes in this diff are formatting-only and do not raise design concerns.


Automated review by moonbox3's agents

Comment thread python/packages/core/agent_framework/_mcp.py
Comment thread python/packages/core/tests/core/test_mcp.py
Copilot and others added 3 commits May 7, 2026 05:14
…ft#5667)

On Python >= 3.11, check task.cancelling() > 0 before wrapping
CancelledError as ToolException in the three except blocks inside
_connect_on_owner(). When the current task is being cancelled by its
caller, the CancelledError now propagates after cleanup, consistent
with the existing pattern at _mcp.py:560-564 and _runner.py:115-120.

On Python < 3.11 task.cancelling() is unavailable, so MCP-internal
CancelledErrors still cannot be reliably distinguished from
caller-driven cancellation; they continue to be wrapped as
ToolException with a comment documenting the trade-off.

Tests:
- Add cleanup assertion to transport-creation CancelledError test
- Add MCPStdioTool variants exercising the 'command' message branches
  for both transport-creation and initialize CancelledError paths
- Add Python 3.11+-gated tests verifying genuine task cancellation
  propagates (and still cleans up) for transport and initialize stages

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eption (microsoft#5667)

CancelledError inherits from BaseException (not Exception) on Python >= 3.8,
so the 'inner_exception=ex if isinstance(ex, Exception) else None' guard
always yields None for CancelledError. This means ToolException.__init__
calls logger.log(level, message, exc_info=None), dropping the traceback.

Add an explicit logger.debug(error_msg, exc_info=ex) before each
raise ToolException(...) in the three CancelledError handlers so the
full traceback is preserved in debug logs when MCP-internal cancellation
is wrapped rather than propagated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ling Issue regarding Python MCPStreamableHTTPTool Class
Copilot AI review requested due to automatic review settings May 7, 2026 05:22
Copy link
Copy Markdown
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 98% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by automated agents

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a Python behavior where MCP connection failures could leak asyncio.CancelledError (a BaseException on Python 3.8+) instead of raising a catchable ToolException, especially when the MCP server is unreachable.

Changes:

  • Catch asyncio.CancelledError during MCP transport/session/initialize steps and wrap failures as ToolException.
  • Add regression tests covering CancelledError wrapping and ensuring genuine task cancellation still propagates (Py 3.11+).
  • Minor test/code formatting cleanup in skills-related tests and error strings.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/packages/core/agent_framework/_mcp.py Wraps MCP connection CancelledError into ToolException while preserving genuine cancellation on Python 3.11+.
python/packages/core/tests/core/test_mcp.py Adds targeted tests for the new CancelledError handling and cancellation propagation behavior.
python/packages/core/tests/core/test_skills.py Formatting-only adjustments (line breaks/blank lines) in tests.
python/packages/core/agent_framework/_skills.py Collapses some multi-line exception messages into single-line strings (no functional change).
python/packages/core/agent_framework/init.py Reorders an export entry in the module’s public list.

Comment thread python/packages/core/agent_framework/_mcp.py Outdated
Comment thread python/packages/core/agent_framework/_mcp.py
Comment thread python/packages/core/agent_framework/_mcp.py Outdated
Copilot and others added 3 commits May 7, 2026 05:34
…d exc_info

- Extract _should_propagate_cancelled_error() helper to eliminate duplicated
  genuine-cancellation detection logic across the three connect() except blocks
- Fix session-creation ToolException message to include exception details
  (e.g. 'Failed to create MCP session: <ex>') matching the transport and
  initialize failure paths
- Change exc_info=ex to exc_info=True in all three logger.debug() calls
  for idiomatic logging
- Add tests for _should_propagate_cancelled_error helper
- Add regression test asserting session error message includes exception text
- Add test verifying logger.debug is called with exc_info=True

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_owner

Addresses review comment on PR microsoft#5687:

1. Add _close_and_check_cancelled() helper method that combines
   _safe_close_exit_stack() + _should_propagate_cancelled_error() into a
   single await-able call. This eliminates the duplicated close-then-check
   pattern that appeared identically in all three connect phases (transport,
   session, initialize), reducing future drift risk.

2. Comments 2 and 3 (missing {ex} in session error message and non-idiomatic
   exc_info=ex) were already addressed in the current code: all error messages
   include {ex} and all logger.debug calls use exc_info=True.

3. Add test_connect_genuine_cancellation_during_session_creation_propagates
   to cover the previously untested genuine-cancellation path in the
   session-creation phase (transport and initialize phases already had tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@moonbox3 moonbox3 self-assigned this May 7, 2026
@TaoChenOSU TaoChenOSU added this pull request to the merge queue May 7, 2026
Merged via the queue into microsoft:main with commit 2a9b68d May 7, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: Error Handling Issue regarding Python MCPStreamableHTTPTool Class

5 participants