-
Notifications
You must be signed in to change notification settings - Fork 124
[FEATURE] WebSocket-based Concurrency Architecture #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release
Are you sure you want to change the base?
Conversation
…erver capabilities - Introduced WebSocketEnvClient for persistent sessions with multi-step interactions. - Updated HTTPEnvServer to support WebSocket connections and manage multiple concurrent environments. - Added WebSocket message types and responses for better communication. - Enhanced Environment interface with concurrency safety attributes.
|
@burtenshaw draft PR for the ws and concurrency. I have merged the #238 into this as well. Few notes, before #232 gets merged:
|
|
Amazing work @rycerzes . Thanks
I'll integrate this in a new PR for you to merge here.
I think we can leave this for a subsequent PR. Also, this env might be useful to you. It's basically just a benchmarking env that let's you test concurrency asynchronously like this. |
|
@rycerzes could you help me to understand this please:
What do you mean by old structure? afaik #232 |
|
Thanks for the clarification! You're absolutely right - I need to correct my earlier comment.
I must have run I just verified this by running
Thanks! That benchmark env would be perfect for testing the concurrency implementation. I'll take a look at it. |
Wauplin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this very important piece @rycerzes! I've left quite some comments on how I would do things but some parts are left to the maintainers' decisions 🤗 Especially:
- should we allow "instantiate a server by passing an env instead of an env factory" to keep backward compatibility? => I would say "no" since project is still in early phase
- should we maintain both a "HTTP-based interface" and a "websocket-based interface"? => same, I would say "no" at it means doubling the amount of work (2 paths in the http server and 2 very similar clients to maintain with same interface with different internal logic). Better to keep only 1 interface that is more robust for the future. End users should not be impacted by this decision (except for the breaking change to adapt).
Apart from that, I usually tend to advice to simplify logic by not adding too many optional features at first. More options usually means more internal logic and more maintenance burden on the long run. So if something is not explicitly required, let's keep it for later.
Note that I haven't run the code myself. Will give it a try soon!
| model_config = ConfigDict( | ||
| extra="forbid", | ||
| validate_assignment=True, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be factorized in the base config I mentioned above (same for other BaseModels)
| class ServerCapacityStatus(BaseModel): | ||
| """Status of server capacity for concurrent sessions.""" | ||
|
|
||
| model_config = ConfigDict( | ||
| extra="forbid", | ||
| validate_assignment=True, | ||
| ) | ||
|
|
||
| active_sessions: int = Field( | ||
| ge=0, | ||
| description="Number of currently active sessions", | ||
| ) | ||
| max_sessions: int = Field( | ||
| ge=1, | ||
| description="Maximum number of allowed sessions", | ||
| ) | ||
| available_slots: int = Field( | ||
| ge=0, | ||
| description="Number of available session slots", | ||
| ) | ||
| is_at_capacity: bool = Field( | ||
| description="Whether the server has reached maximum capacity", | ||
| ) | ||
|
|
||
| @classmethod | ||
| def from_counts(cls, active: int, max_sessions: int) -> "ServerCapacityStatus": | ||
| """Create status from active and max session counts.""" | ||
| available = max(0, max_sessions - active) | ||
| return cls( | ||
| active_sessions=active, | ||
| max_sessions=max_sessions, | ||
| available_slots=available, | ||
| is_at_capacity=active >= max_sessions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this class could be simplified with something like this:
class ServerCapacityStatus(BaseModel):
"""Status of server capacity for concurrent sessions."""
active_sessions: int = Field( ge=0, description="Number of currently active sessions")
max_sessions: int = Field(ge=1, description="Maximum number of allowed sessions")
@model_validator(mode="after")
def check_capacity_bounds(self) -> "ServerCapacityStatus":
if self.active_sessions > self.max_sessions:
raise ValueError(
f"active_sessions ({self.active_sessions}) cannot exceed "
f"max_sessions ({self.max_sessions})"
)
return self
@property
def available_slots(self) -> int:
"""Number of available session slots"""
return max_sessions - active_sessions
@property
def is_at_capacity(self) -> int: # Not sure this property is really necessary
"""Whether the server has reached maximum capacity"""
return self.available_slots == 0This way available_slots and is_at_capacity are inferred properties, not stored values. And we always validate that active and max sessions are coherent.
| # Register concurrency config endpoint | ||
| @app.get( | ||
| "/concurrency", | ||
| response_model=ConcurrencyConfig, | ||
| tags=["Environment Info"], | ||
| summary="Get concurrency configuration", | ||
| description=""" | ||
| Get the current concurrency configuration for this server. | ||
| Returns information about: | ||
| - **max_concurrent_envs**: Maximum number of concurrent WebSocket sessions | ||
| - **session_timeout_seconds**: Timeout for inactive sessions (None if no timeout) | ||
| - **reject_on_capacity**: Whether to reject or queue connections at capacity | ||
| """, | ||
| ) | ||
| async def get_concurrency_config() -> ConcurrencyConfig: | ||
| """Return concurrency configuration.""" | ||
| return self._concurrency_config | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not but not sure it's necessary ?
| msg_type = message_dict.get("type", "") | ||
|
|
||
| try: | ||
| if msg_type == "reset": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel logic could be simplified like this:
try:
match msg_type:
case "reset":
... # todo: implement
response = WSObservationResponse(...)
case "step":
... # todo: implement
response = WSObservationResponse(...)
case "state":
... # todo: implement
response = WSStateResponse(...)
case "close":
... # todo: implement
case _:
response = WSErrorResponse(
data={"message": f"Unknown message type: {msg_type}", "code": "UNKNOWN_TYPE"}
)
await websocket.send_text(response.model_dump_json())
except ValidationError as e:
error_resp = WSErrorResponse(
data={"message": "Invalid message", "code": "VALIDATION_ERROR", "errors": e.errors()}
)
await websocket.send_text(error_resp.model_dump_json())
except Exception as e:
error_resp = WSErrorResponse(
data={"message": str(e), "code": "EXECUTION_ERROR"}
)
await websocket.send_text(error_resp.model_dump_json())There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way you have clear logic based on msg_type value + the validation errors are all caught in the same place
|
|
||
| def create_app( | ||
| env: Environment, | ||
| env: Union[Environment, Callable[[], Environment], Type[Environment]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| env: Union[Environment, Callable[[], Environment], Type[Environment]], | |
| env: Callable[[], Environment], |
should be enough if we break backward compat'? (at least for now since we don't accept inputs for environment resets yet)
| try: | ||
| import websockets | ||
| from websockets.sync.client import connect as ws_connect | ||
| except ImportError: | ||
| websockets = None # type: ignore | ||
| ws_connect = None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since websockets is made a required dependency in pyproject.toml I think we should consider it as always available (simplifies a bit the logic)
| ws_url = base_url.rstrip("/") | ||
| if ws_url.startswith("http://"): | ||
| ws_url = "ws://" + ws_url[7:] | ||
| elif ws_url.startswith("https://"): | ||
| ws_url = "wss://" + ws_url[8:] | ||
| elif not ws_url.startswith("ws://") and not ws_url.startswith("wss://"): | ||
| ws_url = "ws://" + ws_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) could be a unit-tested helper (can be hard to track all specificities when updating this type of logic in the future)
|
@pankit-eng @zkwentz Can you validate these two backward compatibility points from @Wauplin on this PR . In short, should we go all in on websockets or maintain a http implementation?
Server side app will look like this: # Factory mode for concurrent sessions
app = create_app(
env=MyEnvironment, # Pass class, not instance
max_concurrent_envs=4
)
iiuc, it client code will only look like this: from envs.echo_env import EchoEnv, EchoAction
client = EchoEnv(base_url="ws://localhost:8000/ws")
result = await client.reset()
result = await client.step(EchoAction(...)) |
|
@rycerzes I tested out this branch and it worked well. I updated the PR description myself with a high-level before and after snippet and some benchmarking info. |
|
Thanks @Wauplin for the detailed review! Really appreciate all the feedback - the suggestions on simplifying the message types with discriminators, refactoring the capacity status, and cleaning up the validation logic make a lot of sense. I'll work through these and have them resolved by end of Friday. @burtenshaw Thanks for testing the branch and updating the PR description with the benchmarking info! |
Add WebSocket support with concurrent session management
Adds WebSocket endpoints for persistent environment sessions with configurable concurrency limits #194
High-level Diff
These are the results on the server side:
On the client side, it requires a change or url:
This leads to high concurrency with limited resources:
Changes
/wswith message protocol for reset/step/state/closeConcurrencyConfigfor setting max concurrent sessions, timeout, and capacity behaviorCONCURRENCY_SAFEflag on environments (defaults to False) with startup validationWebSocketEnvClientfor persistent connectionsAPI
New types:
ConcurrencyConfig(max_concurrent_envs, session_timeout_seconds, reject_on_capacity)SessionInfoandServerCapacityStatusfor session metadataWSResetMessage,WSStepMessage,WSStateMessage,WSCloseMessageWSObservationResponse,WSStateResponse,WSErrorResponseUsage:
Defaults to
max_concurrent_envs=1for backward compatibility. Environments must setCONCURRENCY_SAFE=Trueto allow higher concurrency.TODO
openenv initneeds the WebSocket code integrated into the template:reject_on_capacity=FalseCONCURRENCY_SAFE=True