Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer messages when websocket connection is interrupted #2871

Merged
merged 5 commits into from Oct 6, 2017

Conversation

Projects
None yet
6 participants
@rgbkrk
Copy link
Member

rgbkrk commented Sep 27, 2017

This adds an (unbounded) queue to the ZMQChannelsHandler to replay messages when a user reconnects.

This assists those that lose a connection and keep their tab open. It does not help with the general problem of wanting a long running notebook to be saved in the background.

@rgbkrk rgbkrk requested a review from minrk Sep 27, 2017

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Sep 27, 2017

D'oh. Of course. This is per connection and this goes away when the connection is severed.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Sep 28, 2017

I'm really unsure of how I could work around this to buffer messages to replay to clients when they reconnect.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Sep 28, 2017

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

What we would need to do is:

  1. leave the zmq connections open after the websocket goes away (potentially with a timeout, but kernel shutdown is also a reasonable time to close)
  2. when clients close, start buffering
  3. persist transient state of cell:msg_id mapping to be restored on reconnect, so that the right handlers are hooked up to deal with the resuming messages

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Sep 28, 2017

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

Yeah I still want that in the long term, it's hard to keep promising that in the short term though.

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

Yikes, alright. Do we just keep one extra zmq socket around then in that case. I see now how I was spinning some wheels hopelessly -- the zmq connection is created here when the websocket connection is created.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Sep 28, 2017

Do we just keep one extra zmq socket around then in that case.

For the lost connection case (not new tab, new browser, etc.), then replay should be fine, and all we need to track is the session_id, which is what identifies a browser session. This should work:

  1. on websocket close, leave zmq stream open and start buffering instead of closing, as we do now.
  2. on websocket open, check session_id and reconnect to zmq_stream and replay if one already exists for that session_id. If session_id doesn't match, clear any buffers that may be open (i.e. don't allow multiple viewers to have this resume behavior on the same kernel).
@minrk

This comment has been minimized.

Copy link
Member

minrk commented Sep 28, 2017

Saving the hairy cell:msg_id problem for another task makes sense to me. We can focus on the lost connection case, which will improve several real use cases.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Sep 28, 2017

Based on conversations with Min, I'm going to take a new stab at this within the ZMQChannelsHandler and:

  • Attaching the buffer to the kernel object
  • There will only be one buffer per kernel
  • This is only intended to make reconnecting with the same tab work well (we aren't worried about multiple sessions since we don't work well with multiple tabs now anyways...)
@blink1073

This comment has been minimized.

Copy link
Member

blink1073 commented Sep 28, 2017

@rgbkrk rgbkrk referenced this pull request Sep 28, 2017

Merged

Add vdom-extension #37

1 of 1 task complete

@rgbkrk rgbkrk force-pushed the rgbkrk:buffer-messages branch from 9e4d902 to 569bb25 Sep 28, 2017

@rgbkrk rgbkrk changed the title [WIP] buffer messages in the ZMQStreamHandler [WIP] buffer messages in the ZMQChannelsHandler Oct 3, 2017

@alexgarel

This comment has been minimized.

Copy link

alexgarel commented Oct 3, 2017

Just as you implements this particular mechanism, in case you want, you may also think of multi concurrent user notebooks in this form:

  • multiple user are working on the same notebook
  • as soon as a user submit a cell, it's value is updated in the other user notebook (in case of conflict, interface propose something)
  • as soon as the result is ready, it is displayed in all notebooks
@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 3, 2017

We're going to tackle multi user concurrent notebooks with a server side state of the notebook, which is not done today. This is only a band aid. Check out https://github.com/jupyterlab/jupyterlab-google-drive for the current alpha for realtime.

@minrk minrk force-pushed the rgbkrk:buffer-messages branch from 047f22d to a182e0a Oct 3, 2017

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 3, 2017

I've pushed an implementation that works. It's still an unbounded list.

  • buffer is per-kernel
  • session_key is stored because only the same session that is buffered will resume properly
  • on any new connection to a kernel, buffer is flushed.
    If session_key matches, buffer is replayed.
    Otherwise, it is discarded (this will be refreshed pages, new browser windows).
  • buffer is an unbounded list for now
  • buffer is never discarded until the kernel shuts down

Things we could do:

  1. limit the lifetime of the buffer (e.g. expire and close after 10 minutes?)
  2. limit the size of the buffer (msg count and/or total bytes)
implement buffering of messages on last dropped connection
- buffer is per-kernel
- session_key is stored because only a single session can resume the buffer and we can't be sure
- on any new connection to a kernel, buffer is flushed.
  If session_key matches, it is replayed.
  Otherwise, it is discarded.
- buffer is an unbounded list for now

@minrk minrk force-pushed the rgbkrk:buffer-messages branch from a182e0a to 8f8363a Oct 3, 2017

restore actual zmq channels when resuming connection
rather than establishing new connections

fixes failure to resume shell channel

@minrk minrk force-pushed the rgbkrk:buffer-messages branch from 95f5b3f to 38224fb Oct 3, 2017

def open(self, kernel_id):
super(ZMQChannelsHandler, self).open()
self.kernel_manager.notify_connect(kernel_id)

# on new connections, flush the message buffer
replay_buffer = self.kernel_manager.stop_buffering(kernel_id, self.session_key)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 3, 2017

Author Member

Ok cool you made the kernel manager dictate stopping the buffering

self.log.info("Replaying %s buffered messages", len(replay_buffer))
for channel, msg_list in replay_buffer:
stream = self.channels[channel]
self._on_zmq_reply(stream, msg_list)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 3, 2017

Author Member

What should we do if we fail during the replay?

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 3, 2017

PASS ws_closed_error: kernel_idle.Kernel was triggered
Timeout for http://localhost:8888/a@b/
Is the notebook server running?
FAIL "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    type: uncaughtError
#    file: /home/travis/build/jupyter/notebook/notebook/tests/services/kernel.js
#    error: "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    stack: not provided
Captured console.log:

Should we skip this test for now?

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 4, 2017

Should we skip this test for now?

No, I think it's a real bug.

minrk added some commits Oct 4, 2017

hookup restart callbacks in open
instead of in `create_stream`, which is not called on reconnect
improve handling of restored connections in js
- dismiss 'connection lost' dialog on reconnect
- set busy status on reconnect (if not busy, idle will come soon after via kernel_ready)

@minrk minrk force-pushed the rgbkrk:buffer-messages branch from 65c9817 to db4ca5e Oct 4, 2017

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 4, 2017

Bug fixed. I also improved the dialog/status handling when connection is lost.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 4, 2017

Here's what it looks like now with a connection drop and reconnect in the middle of execution:

replay

The dialog is dismissed automatically when the connection is restored.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 4, 2017

This is working pretty well. Even for a flaky VPN + wifi dropoff, I'm only losing one message within a one second gap.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 4, 2017

Including some discussion from other communication channels: This is never going to be rigorous without adding things like ack/generation counters that we probably shouldn't get into before shifting document state entirely server-side for real-time. Messages can still be lost of the notebook server and the frontend disagree about when the disconnect happens (e.g. when there are more intermediate services than a localhost proxy). But it will still solve a real annoyance for a lot of people.

@rgbkrk what more do you think we should do here? Should we implement limits and/or expiry? We could also start with a bool flag to just turn it off in case it causes trouble.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 4, 2017

The only other things that I think we could add are:

  • Making this a memory bounded queue, though I don't think we have to do it in this PR
  • The boolean flag you suggest is probably a really good idea

I definitely agree this will solve a real annoyance for a lot of people, so it doesn't have to perfect this go-round. We've certainly got longer term plans for making this better with server side state, so we can only take this band-aid so far.

If we wanted to be a bit more crazy, imagine for a moment that we stored N messages before the disconnect happened and added those to the buffer once we realize it's disconnected. We could end up duplicating outputs, but at the same time we'd still end up with some of the missed ones.

@blink1073

This comment has been minimized.

Copy link
Member

blink1073 commented Oct 4, 2017

What if this were implemented as a response to a message from the frontend that said "hey, I was disconnected but I'm back, here's the last msg_id I got, please send the rest"?

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 4, 2017

here's the last msg_id I got, please send the rest

Oooooh, I like that. I mean, it will be a bit strange as we'll be putting a new API on top of the current API over the websockets though.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 5, 2017

@blink1073 that's definitely possible and would be more robust. To do that, we would need to maintain a cache of messages on all channels all the time, in case of dropped connections in the future, rather than only when connections are lost. Doing that points to a different mechanism, because it makes it inevitable that the cache gets huge, rather than possible. To do that, we need to spill to disk with an efficient culling mechanism. To me, that sounds like a job for sqlite.

If I were doing that, I would make last_msg_id part of the websocket connect request to trigger the replay.

@alexgarel

This comment has been minimized.

Copy link

alexgarel commented Oct 5, 2017

If you use sqlite for messages, I would suggest default sqlite database should be in memory database.
Of course persistent database would allow to gets past results even after a jupyter server restart ! That sounds cool. But you'll have to find a policy to remove dead kernel messages.

@rgbkrk rgbkrk changed the title [wip] buffer messages when websocket connection is interrupted buffer messages when websocket connection is interrupted Oct 6, 2017

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 6, 2017

I think this is probably good to go ahead with for the time being as further enchancements are going to require a lot more coordination between the frontends and the server.

@gnestor gnestor merged commit 43a9780 into jupyter:master Oct 6, 2017

4 checks passed

codecov/patch 69.33% of diff hit (target 0%)
Details
codecov/project 79.08% (-0.17%) compared to de09c12
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gnestor

This comment has been minimized.

Copy link
Contributor

gnestor commented Oct 6, 2017

Thanks @rgbkrk and @minrk!!

@rgbkrk rgbkrk deleted the rgbkrk:buffer-messages branch Oct 6, 2017

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 6, 2017

Thanks @gnestor and @minrk!

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Oct 7, 2017

@alexgarel I think it can't be in-memory by default because the main point of using sqlite would be to limit how much memory the message cache will take by spilling to disk.

@rgbkrk rgbkrk referenced this pull request Oct 13, 2017

Closed

Release 5.2 #2929

11 of 11 tasks complete

@gnestor gnestor added this to the 5.2 milestone Oct 13, 2017

@kevin-bates

This comment has been minimized.

Copy link
Member

kevin-bates commented Oct 9, 2018

@rgbkrk & @minrk - First, sorry for posting this to a closed PR but it has all the necessary context. Second, I apologize for my long windedness. TL;DR: In looking at the buffering code, it seems like buffer replay never comes into play (at this time).

I needed to look into buffering support relative to a recent NB2KG issue. Since NB2KG proxies the notebook's kernel management to a gateway server (Kernel Gateway or Enterprise Gateway), I was concerned that buffering replay wouldn't work. However, after looking into this more deeply, I find these behaviors between traditional notebook and nb2kg-enabled notebooks to be the same. I also find that when the network connection between the browser and notebook is dropped, the "replay" as indicated by the video above is not taking place from the buffering code, but, rather from ZMQ directly (at least that's my assumption). I say this because I don't see any debug entries about buffering messages and replaying or discarding them. Instead, those messages are produced only after one closes the notebook browser tab, then opens that same active notebook again (which Kyle mentions above as well in describing when the PR changes take place). As a result, I don't understand when the buffering replay will actually occur since it seems like re-opening an active notebook triggers the creation of a new session - which indirectly is part of the key (along with kernel_id) into the replayability of buffered messages.

Do I have the correct understanding with respect to the state of buffering and its replay? If not, under what circumstances does the implemented buffering get replayed?

Could you explain why session is part of the key for determining whether the buffered messages are to be replayed or not? I'm guessing it represents the "channel instances" since different connections of the same kernel instance will use different channels - is that right? If so, would it be possible to equate the channel "sets" from the previous (buffered) session to the new session and drop the use of session_key?

Thank you for your time and increasing my understanding.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 10, 2018

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

@kevin-bates

This comment has been minimized.

Copy link
Member

kevin-bates commented Oct 10, 2018

Thanks Kyle. So I think buffered message usage from NB2KG is doubly screwed (once replay is figured out) because the session_key generated in the Gateway server will be different than that generated in the (client) Notebook server where the messages are derived. Is this session key the same value as the session entry in each message (or can be derived from them)? If so, it would be nice if the message buffer initialized the session key in that manner.

Any hints on how to trigger buffer replays (and not discards)? If not, I'll post in gitter.

Thanks.

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Oct 10, 2018

I'm never in gitter, hopefully you'll find someone to help.

I'd have to sit down again with the raw session messages to know for sure. Happy to meet about this at some point in the coming weeks (I'm out of town at the moment).

@alexgarel

This comment has been minimized.

Copy link

alexgarel commented Oct 11, 2018

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

Could it be associated with a notebook session id (instead of connection), that would be generated on client side (as a document DOM attribute), and that is given at websocket opening ?

@kevin-bates

This comment has been minimized.

Copy link
Member

kevin-bates commented Oct 11, 2018

If the session id you're referring to is the one in the header (vs parent_header) that does appear to be consistent across the disconnection scenario. However, even if I remove the checks for session_key (so that replay is purely a function of matching kernel_id), the replayed buffer contents (just as with the current message results) have no where to go since the underlying websocket is different (at least that's my suspicion as to why nothing appears). That said, the js code is totally foreign to me and I think that's where the answers lie.

Seems like there needs to be some tolerance where the received messages can 'switch' to a different "channel set" in cases where the previous 'set' is no longer around.

@gnestor

This comment has been minimized.

Copy link
Contributor

gnestor commented Nov 7, 2018

Does this sound related to jupyter/jupyter#83?

@kevin-bates

This comment has been minimized.

Copy link
Member

kevin-bates commented Nov 8, 2018

Does this sound related to jupyter/jupyter#83?

Yes, I believe the issue you reference is the portion of this PR that is not addressed. That is, when a notebook is closed and re-opened, buffered messages received in the interim are discarded on the re-open due to it being a different WS connection. I took a run at this via #4105/#4110 but later learned the current implementation is working as designed. (I hadn't been able to get any messages to replay correctly, but Min's evidence on the corresponding PR showed its possible, so I dropped "my case".)

@rgbkrk

This comment has been minimized.

Copy link
Member Author

rgbkrk commented Nov 8, 2018

The only way you can get messages to replay is if you lose connection, primarily on a remote server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.