Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge the gateway handlers into the standard handlers. #1261

Merged
merged 21 commits into from May 10, 2023

Conversation

ojarjur
Copy link
Contributor

@ojarjur ojarjur commented Apr 20, 2023

The two WebSocketChannelsHandler and GatewayResourceHandler classes are deprecated and their corresponding functionality is merged into the respective KernelWebsocketHandler and KernelSpecResourceHandler classes.

For the KernelSpecResourceHandler class, this change is rather straightforward as we can simply make the existing handler check if the kernel spec manager has a get_kernel_spec_resource method, and if so delegate to that method instead of trying to read resources from disk.

The KernelWebsocketHandler conversion is more complicated, though. The handling of websocket connections was generalized/extended in #1047 to allow the definition of a kernel_websocket_connection_class as an alternative to replacing the entire websocket handler.

This change builds on that by converting the GatewayWebSocketClient class to be an instance of the kernel websocket connection class, and accordingly renames it to GatewayWebSocketConnection.

When the gateway client is enabled, the default kernel_websocket_connection_class is changed to this GatewayWebSocketConnection class similarly to how the kernel and kernel spec manager default classes are updated.

The two `WebSocketChannelsHandler` and `GatewayResourceHandler` classes are removed and
their corresponding functionality is merged into the respective `KernelWebsocketHandler`
and `KernelSpecResourceHandler` classes.

For the `KernelSpecResourceHandler` class, this change is rather straightforward as
we can simply make the existing handler check if the kernel spec manager has a
`get_kernel_spec_resource` method, and if so delegate to that method instead of
trying to read resources from disk.

The `KernelWebsocketHandler` conversion is more complicated, though. The handling of
websocket connections was generalized/extended in jupyter-server#1047 to allow the definition of
a `kernel_websocket_connection_class` as an alternative to replacing the entire
websocket handler.

This change builds on that by converting the `GatewayWebSocketClient` class to be
an instance of the kernel websocket connection class, and accordingly renames it to
`GatewayWebSocketConnection`.

When the gateway client is enabled, the default `kernel_websocket_connection_class`
is changed to this `GatewayWebSocketConnection` class similarly to how the kernel
and kernel spec manager default classes are updated.
@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 20, 2023

@kevin-bates @Zsailer FYI, this is my attempt at first steps around what we had discussed for the "Kernel Providers" proposal.

The idea here is to first remove the gateway client specific handlers so that the subsequent work to multiplex between various kernel providers has fewer moving pieces to deal with.

This is almost certainly in need of more work before merging as I had multiple open questions I need to ask regarding the GatewayWebSocketConnection class:

  1. Should I move that to a different module or maybe rename the module it is in? The current handlers name doesn't seem to fit anymore, but I'm not sure what a good replacement would be.
  2. What is the appropriate kernel_ws_protocol value? Should this be a flag, some legacy value (as it is now), or maybe the existing flag should be moved out of the ZMQChannelsWebsocketConnection class so that it can be more easily reused?
  3. What is supposed to happen with the close_all classmethod? This was required by serverapp, but I don't know what it should actually do.

Thanks in advance for any advice and/or suggestions you might have here.

@Zsailer
Copy link
Member

Zsailer commented Apr 20, 2023

Thank you for the PR, @ojarjur! This is great stuff.

Should I move that to a different module or maybe rename the module it is in? The current handlers name doesn't seem to fit anymore, but I'm not sure what a good replacement would be.

I think renaming to kernel_connection.py seems appropriate, but I'll defer to @kevin-bates for his thoughts. I agree that handlers.py doesn't make sense anymore.

We may need to leave the old handlers around for a bit with a deprecation warning (set to 3.0), in case someone has subclassed these handlers somewhere in their own setup.

What is the appropriate kernel_ws_protocol value? Should this be a flag, some legacy value (as it is now), or maybe the existing flag should be moved out of the ZMQChannelsWebsocketConnection class so that it can be more easily reused?

Good point. Let me look at this a little bit closer and see what makes sense. I'll post back here when I can a chance too look at this closer

What is supposed to happen with the close_all classmethod? This was required by serverapp, but I don't know what it should actually do.

Ugh, this is annoyingly specific to the ZMQChannelsWebsocketConnection and something I hope we can remove in the near future. It might make sense to change the logic in ServerApp to only call this method if hasattr(self.kernel_websocket_connection_class, 'close_all') == True, because I don't think close_all should be a method on the BaseKernelWebsocketConnection. This method exists because ZMQChannelsWebsocketConnection has some issues where it can leak ZMQ socket clients. So we added the close_all method at the class level to catch any un-closed sockets across all instances to help Jupyter Server shutdown gracefully.

I would propose that we use this opportunity to do the hasattr(self.kernel_websocket_connection_class, 'close_all') check and don't worry about adding a close_all method to your new websocket connection class.

@Zsailer Zsailer self-assigned this Apr 20, 2023
1. Rename the package holding the GatewayWebSocketConnection class from
   `handlers` to `connections`.

2. Make the serverapp call to the `close_all` method on the websocket connection
   class more resilient.

   Specifically, the `self.kernel_websocket_connection_class.close_all()` method invocation
   relies on two different attribute lookups, either of which might fail:

   1. The `self.kernel_websocket_connection_class` lookup can raise an AttributeError
      if the method that sets the default value of that trait raises an AttributeError.
      That, in turn, occurs when the `self.gateway_config` attribute is not yet set
      because the `initialize` method has not yet been invoked.

      During server startup this scenario is gracefully handled, but during cleanup
      it was previously not.

   2. The `self.kernel_websocket_connection_class.close_all` lookup can fail if the
      provided class does not implement the `close_all` class method.

   For both cases, we catch the raised AttributeError and treat it as a no-op.

3. Remove the `close_all` method from the GatewayWebSocketConnection class since it
   was a no-op, and is no longer required by serverapp.
The previous version of the code used the wrong future callback for
messages when the underlying gateway websocket connection had not
yet been created. That resulted in messages received before the
end-to-end websocket creation completing getting dropped.
@kevin-bates
Copy link
Member

kevin-bates commented Apr 21, 2023

Hi @ojarjur - this is looking good - thank you.

I have built this branch and targeted a K8s cluster fronted by Jupyter Kernel Gateway. When creating a kernel for the first time, I see the following stack trace in the jupyter server log:

[I 2023-04-21 09:45:40.142 ServerApp] Connecting to kernel 68779338-feea-4fc0-9120-bb25f4796b28.
[I 2023-04-21 09:45:40.142 ServerApp] Connecting to ws://localhost:53434/api/kernels/68779338-feea-4fc0-9120-bb25f4796b28/channels
[D 2023-04-21 09:45:40.313 ServerApp] 200 GET /api/kernels?1682095540105 (f8a8acdb399c4c8e96b2612eb98b498c@::1) 205.90ms
[D 2023-04-21 09:45:40.357 ServerApp] 200 GET /api/kernelspecs?1682095540117 (f8a8acdb399c4c8e96b2612eb98b498c@::1) 236.19ms
[D 2023-04-21 09:45:40.372 ServerApp] Connection is ready: ws: <tornado.websocket.WebSocketClientConnection object at 0x11144d4c0>
[D 2023-04-21 09:45:40.385 ServerApp] Connection is ready: ws: <tornado.websocket.WebSocketClientConnection object at 0x111442730>
[D 2023-04-21 09:45:40.389 ServerApp] Connection is ready: ws: <tornado.websocket.WebSocketClientConnection object at 0x11133fe80>
[D 2023-04-21 09:45:40.455 ServerApp] 304 GET /static/favicons/favicon-busy-1.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.79ms
[D 2023-04-21 09:45:40.497 ServerApp] 304 GET /static/favicons/favicon.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.87ms
[D 2023-04-21 09:45:40.498 ServerApp] 304 GET /static/favicons/favicon-busy-1.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.83ms
[D 2023-04-21 09:45:40.501 ServerApp] 304 GET /static/favicons/favicon.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.76ms
[D 2023-04-21 09:45:40.521 ServerApp] 304 GET /static/favicons/favicon-busy-1.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.96ms
[D 2023-04-21 09:45:40.544 ServerApp] 304 GET /static/favicons/favicon.ico (f8a8acdb399c4c8e96b2612eb98b498c@::1) 0.75ms
[E 2023-04-21 09:45:40.559 ServerApp] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x10e9ec7c0>>, <Task finished name='Task-441' coro=<GatewayWebSocketConnection._read_messages() done, defined at /opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/jupyter_server/gateway/connections.py:80> exception=WebSocketClosedError()>)
    Traceback (most recent call last):
      File "/opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/tornado/ioloop.py", line 740, in _run_callback
        ret = callback()
      File "/opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/tornado/ioloop.py", line 764, in _discard_future_result
        future.result()
      File "/opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/jupyter_server/gateway/connections.py", line 95, in _read_messages
        self.handle_outgoing_message(
      File "/opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/jupyter_server/gateway/connections.py", line 125, in handle_outgoing_message
        self.websocket_handler.write_message(*args, **kwargs)
      File "/opt/miniconda3/envs/server-dev/lib/python3.9/site-packages/tornado/websocket.py", line 336, in write_message
        raise WebSocketClosedError()
    tornado.websocket.WebSocketClosedError

I've included preceding lines up to the informational messages in hopes that might provide richer context for troubleshooting.

Items of note:

  1. This only happens the first time the kernel is launched following Jupyter Server's startup. Subsequent starts of the same kernel do not show this stack.
  2. When I start a different kernel type (i.e., kernelspec), I see this stack trace (on first startup).
  3. I do not see this behavior using Jupyter Server 2.5.0.
  4. Python version == 3.9.13

Since this first time behavior is fairly consistent (there was one time (out of several) in which I didn't see the stack on the first attempt following server startup), I was wondering if it might have to do with the resource file acquisition, since that's an area of change and might only occur once in the life of a server's duration.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 21, 2023

@kevin-bates Thank you so much for the thorough testing and the detailed report.

I'm going to try to see if I can modify the existing test_gateway tests to reproduce this.

I was wondering if it might have to do with the resource file acquisition, since that's an area of change and might only occur once in the life of a server's duration.

I don't think so... from the stacktrace, it looks like the websocket connection to the backend is working and received a message, but the websocket connection to the client is closed.

I'll also compare against the existing code to see why the behavior between the two is different in this scenario.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 21, 2023

@kevin-bates I see the issue now.

The new handle_outgoing_message method is playing the same role as the previous WebSocketChannelsHandler.write_message method, but it is missing a check that method included to prevent exactly this issue.

Restoring that check is easy enough, but I also want to try to extend the tests to make sure we don't accidentally drop this again in the future.

@kevin-bates
Copy link
Member

I see the issue now.

Awesome!

Restoring that check is easy enough, but I also want to try to extend the tests to make sure we don't accidentally drop this again in the future.

Excellent idea - thank you. I'll look for the update and spin it again at that time.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 22, 2023

@kevin-bates I was eventually able to get a (minimal) test case that reproduces that error showing up in the logs, and fails if it sees that (or any) error in the logs.

I also replicated the corresponding logic from the existing codebase into the new connection class, and it makes that new test now pass.

Thanks for catching that!

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ojarjur!

I've confirmed behavior outside of the test framework as well.

@kevin-bates
Copy link
Member

I'd prefer @Zsailer to review as well prior to merging. @ojarjur - would you mind updating the branch in the meantime?

Comment on lines 26 to 31
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.ws = None
self.ws_future: Future = Future()
self.disconnected = False
self.retry = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a LoggingConfigurable, can we make these attributes into traits? In general, it's best to avoid calling __init__ on a subclass of traitlets.HasTraits object.

This would looks something like:

import tornado.websocket as tornado_websocket

...
class GatewayWebSocketConnection(BaseKernelWebsocketConnection):
    ...

    ws = Instance(klass=tornado_websocket.WebSocketHandler, allow_none=True)

    ws_future = Instance(default_value=Future(), klass=Future)

    disconnected = Bool(False)

    retry = Int(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance and explanation!

Done.

@Zsailer
Copy link
Member

Zsailer commented Apr 25, 2023

What is the appropriate kernel_ws_protocol value? Should this be a flag, some legacy value (as it is now), or maybe the existing flag should be moved out of the ZMQChannelsWebsocketConnection class so that it can be more easily reused?

I dug into this a bit. We do need to handle the subprotocol here in the GatewayWebSocketConnection class. The appropriate default value would be None, which falls back to the default WS Protocol value (see our documentation page here). But we should handle the case where a client asks for another subprotocol.

As you mentioned, we should move the kernel_ws_protocol flag and logic for serialization/deserialization into the BaseKernelWebsocketConnection class to make it generally available to all subclasses.

We may want to do this work in a separate PR, since the current PR should probably stay focused on absorbing the gateway handlers into the main handlers. That way, our changelog decouples these two changes.

Would you like to make the separate PR moving the subprotocol flag to the base handler? Ideally, that PR would get reviewed and merged first, then rebase here (though this isn't strictly necessary).

@Zsailer
Copy link
Member

Zsailer commented Apr 25, 2023

Otherwise, this PR looks good to me 👍

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 25, 2023

@Zsailer

I dug into this a bit. We do need to handle the subprotocol here in the GatewayWebSocketConnection class.

Thank you for digging into it!

Would you like to make the separate PR moving the subprotocol flag to the base handler?

I'll give it a shot, but this is well outside my area of experience and my bandwidth for the next week or so is quite limited.

If someone else manages to get that in before me, I would be happy about it.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 25, 2023

@Zsailer

I'll give it a shot, but this is well outside my area of experience and my bandwidth for the next week or so is quite limited.

Done: #1264

In retrospect, I don't know why I thought that would be difficult.

@blink1073 blink1073 enabled auto-merge (squash) May 10, 2023 16:05
blink1073 and others added 2 commits May 10, 2023 11:08
…socketConnection

This change removes a temporary work-around to the required `kernel_ws_protocol`
field not being defined in the `BaseKernelWebsocketConnection` base class.

Previously, that field was only defined in the ZMQChannelsWebsocketConnection class,
so we had to reproduce it in the `GatewayWebsocketConnection` class in order to
get that class to work correctly. However, we only provided a minimal default value
of `None` rather than properly supporting the user setting their preferred websocket
protocol.

This workaround was made unnecessary by jupyter-server#1264 which moved the `kernel_ws_protocol`
field from the ZMQChannelsWebsocketConnection class to the base class.

Accordingly, this commit removes that temporary work-around that is no longer needed.
We needed to provide a value for that,
auto-merge was automatically disabled May 10, 2023 16:23

Head branch was pushed to by a user without write access

@ojarjur
Copy link
Contributor Author

ojarjur commented May 10, 2023

Thanks for the reviews, All!

I finished up the clean-up on this PR needed to (semantically) merge in the changes from #1264, and pushed it, so the kernel_ws_protocol issue should be resolved now.

I don't know why some of the tests are failing now; GitHub isn't showing me any logs for them, which is a different behavior than I have seen before.

@ojarjur
Copy link
Contributor Author

ojarjur commented May 10, 2023

I don't know why some of the tests are failing now; GitHub isn't showing me any logs for them, which is a different behavior than I have seen before.

It looks like there is an ongoing GitHub incident that affects actions: https://www.githubstatus.com/incidents/pr3498h3qkfy

@blink1073 blink1073 merged commit 35ffe5d into jupyter-server:main May 10, 2023
29 of 34 checks passed
ojarjur added a commit to ojarjur/jupyter_server that referenced this pull request May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants