Skip to content

Conversation

@davidroberts-merlyn
Copy link

@davidroberts-merlyn davidroberts-merlyn commented Oct 27, 2025

Motivation and Context

Problem

When deploying MCP servers across multiple instances (Kubernetes pods, Docker containers, worker processes), sessions are tied to the specific instance that created them. This requires sticky sessions at the load balancer level and prevents true horizontal scaling. Users are currently forced to choose between:

  1. Sticky sessions - Suboptimal load distribution, sessions lost on pod failure
  2. Single worker - Wastes resources, limits throughput
  3. Stateless mode - Loses session continuity and event replay

This limitation is documented in multiple issues: #520 (multi-worker sessions), #692 (session reuse across instances), #880 (horizontal scalability), and #1350 (sticky session problems).

Solution

This PR enables session roaming - allowing sessions to seamlessly move between server instances without requiring sticky sessions. The key insight is that EventStore already serves as proof of session existence.

When a request arrives with a session ID that's not in an instance's local memory, if an EventStore is configured, the instance can safely:

  1. Create a transport for that session ID (session roaming)
  2. Let EventStore replay any missed events (continuity)
  3. Handle the request seamlessly

What Changed

Modified streamable_http_manager.py (~50 lines):

  • Added session roaming logic in _handle_stateful_request()
  • When unknown session ID + EventStore exists → create transport (roaming!)
  • Extracted duplicate server task code into reusable methods
  • Updated docstrings to document session roaming capability

Added comprehensive tests (test_session_roaming.py, 510 lines):

  • Session roaming with EventStore
  • Rejection without EventStore
  • Concurrent request handling
  • Exception cleanup behavior
  • Fast path verification
  • Logging verification

Added production-ready example (simple-streamablehttp-roaming/, 13 files):

  • Complete working example with Redis EventStore
  • Multi-instance deployment support
  • Docker Compose configuration (3 instances + Redis + NGINX)
  • Kubernetes deployment example
  • Automated test script demonstrating roaming
  • Comprehensive documentation (README, QUICKSTART, implementation details)

Why This Approach

Previous Attempts

Eplored two other approaches before arriving at this solution:

  1. Custom Session Store (outside SDK) - Created own session validation in the application layer, but this didn't solve the core problem and required every user to implement their own solution, it also meant that as the dict that contained session in the sdk was unchanged it still required sticky sessions.

  2. SessionStore ABC (in SDK) - Added a new SessionStore interface requiring both EventStore + SessionStore parameters. While functional, this approach required two separate storage backends and was more complex than necessary. It also meant that if you did not supply one of the stores it was not really stateful.

Current Approach: EventStore-Only

The key insight: EventStore already proves sessions existed. If events exist for a session ID, that session must have existed to create those events. No separate SessionStore needed.

Benefits:

  • ✅ One store instead of two (simpler)
  • ✅ Reuses existing EventStore interface (no new APIs)
  • ✅ Impossible to misconfigure (EventStore = both events + proof)
  • ✅ Aligns with SEP-1359 (sessions are conversation context, not auth)
  • ✅ Minimal code changes (~50 lines)
  • ✅ 100% backward compatible (behavior enhancement only)

Usage

Before (Requires Sticky Sessions)

# Without EventStore - sessions in memory only
manager = StreamableHTTPSessionManager(app=app)
# Deployment: requires sticky sessions for multi-instance

After (No Sticky Sessions Needed)

# With EventStore - sessions roam freely
event_store = RedisEventStore(redis_url="redis://redis:6379")
manager = StreamableHTTPSessionManager(
    app=app,
    event_store=event_store  # Enables session roaming!
)
# Deployment: load balancer can route freely, no sticky sessions

How It Works

Client → Instance 1 (creates session "abc123", stores events in Redis)
Client → Instance 2 (with session "abc123")
  ↓
Instance 2 checks memory → not found
Instance 2 sees EventStore exists
Instance 2 creates transport for "abc123" (roaming!)
EventStore replays events from Redis
Session continues seamlessly ✅

How Has This Been Tested?

  • ✅ Existing test suite (no regressions)
  • ✅ 8 new tests for session roaming
  • ✅ Automated roaming test script in example
  • ✅ Testing within our existing infrastructure

The included example also demonstrates:

  • Multi-instance deployment with Docker Compose
  • Kubernetes manifests (3 replicas, no sessionAffinity needed)
  • NGINX load balancing without sticky sessions
  • Redis EventStore for shared state
  • Automated testing and verification with provided test script

Breaking Changes

None. This is a pure behavior enhancement:

  • ✅ Existing code works unchanged
  • ✅ No API changes
  • ✅ No new required parameters
  • ✅ Backward compatible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Related Issues

Closes #520, #692, #880, #1350

This implementation addresses the core limitation described in all these issues: the inability to run stateful MCP servers across multiple instances without sticky sessions.

Add session roaming support to StreamableHTTPSessionManager, allowing
sessions to move freely between server instances without requiring
sticky sessions. This enables true horizontal scaling and high
availability for stateful MCP servers.

When a request arrives with a session ID not found in local memory,
the presence of an EventStore allows creating a transport for that
session. EventStore serves dual purposes: storing events (existing)
and proving session existence (new). This eliminates the need for
separate session validation storage.

Changes:
- Add session roaming logic in _handle_stateful_request()
- Extract duplicate server task code into reusable methods
- Update docstrings to document session roaming capability
- Add 8 comprehensive tests for session roaming scenarios
- Add production-ready example with Redis EventStore
- Include Kubernetes and Docker Compose deployment examples

Benefits:
- One store instead of two (EventStore serves both purposes)
- No new APIs or interfaces required
- Minimal code changes (~50 lines in manager)
- 100% backward compatible
- Enables multi-instance deployments without sticky sessions

Example usage:
  event_store = RedisEventStore(redis_url="redis://redis:6379")
  manager = StreamableHTTPSessionManager(
      app=app,
      event_store=event_store  # Enables session roaming
  )

Github-Issue: modelcontextprotocol#520
Github-Issue: modelcontextprotocol#692
Github-Issue: modelcontextprotocol#880
Github-Issue: modelcontextprotocol#1350
Change single quotes to double quotes to comply with prettier formatting requirements.
- Add language specifiers to all code blocks
- Fix heading hierarchy (bold text to proper headings)
- Add blank lines after headings for better readability
- Escape underscores in file paths (__init__.py -> **init**.py)
The transport could be removed from _server_instances by the cleanup
task if it crashed immediately after being started. This caused a
KeyError when trying to access it from the dictionary.

Fixed by keeping a local reference to the transport instead of looking
it up again from the dictionary after starting the server task.
Use @contextlib.asynccontextmanager decorator instead of manual
__aenter__/__aexit__ implementation for mock_connect functions.

Fixes test failures in:
- test_transport_server_task_cleanup_on_exception
- test_transport_server_task_no_cleanup_on_terminated
Add AsyncIterator import and use proper return type annotation for
mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]]
instead of Any.
The tests were failing because AsyncMock(return_value=None) caused
app.run to complete immediately, which closed the transport streams
and triggered cleanup that removed transports from _server_instances
before assertions could check for them.

Now using mock_app_run that calls anyio.sleep_forever() and blocks
until the test context cancels it. This keeps transports alive during
the test assertions.
Copy link

@jacksteamdev jacksteamdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 There are a few extra .md files, but the logic looks sound.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file? It's nice information, but none of the other examples have it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this information seems to be in README.md

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the edge about this script. None of the examples have bash scripts, but it's a nice DX sanity check.

@davidroberts-merlyn davidroberts-merlyn marked this pull request as draft October 28, 2025 18:54
@felixweinberger felixweinberger added the pending publish Draft PRs need to be published for team to review label Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending publish Draft PRs need to be published for team to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP Server Session Lost in Multi-Worker Environment

3 participants