Websocket ping timeout #1474

Open
Jeffalltogether opened this Issue May 21, 2016 · 20 comments

Comments

Projects
None yet
@Jeffalltogether

Jeffalltogether commented May 21, 2016

I followed these instructions to set-up a Jupyter Notebook server on an Amazon EC2 instance. All works great, except when I run a block of code that requires a long execution time, greater than 2 or 3 min. As the kernel is busy running this code (I can see code executing due to a simple progress bar feature) it will stop all the sudden and display a websocket ping timeout error. The following are the messages I receive:

[I 17:22:19.083 NotebookApp] Serving notebooks from local directory: /home/ubuntu/Notebooks
[I 17:22:19.084 NotebookApp] 0 active kernels
[I 17:22:19.084 NotebookApp] The IPython Notebook is running at: https://[all ip addresses on your system]:8888/
[I 17:22:19.084 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 17:22:32.645 NotebookApp] 302 GET / (174.47.174.222) 0.72ms
[I 17:22:32.735 NotebookApp] 302 GET /tree (174.47.174.222) 0.93ms
[I 17:22:37.245 NotebookApp] 302 POST /login?next=%2Ftree (174.47.174.222) 0.92ms
[I 17:22:42.437 NotebookApp] Kernel started: 7b436e11-118d-4c28-9777-ec63baec0b5f
[W 17:24:13.097 NotebookApp] WebSocket ping timeout after 90000 ms.
[E 01:12:29.968 NotebookApp] Uncaught exception GET /api/kernels/7ca196a9-e64b-40dd-bd12-d8bc1a323686/channels?session_id=E2631BE0F605403986ED7D8387A07E99 (174.47.174.222)
    HTTPServerRequest(protocol='https', host='ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', method='GET', uri='/api/kernels/7ca196a9-e64b-40dd-bd12-d8bc1a323686/channels?session_id=E2631BE0F605403986ED7D8387A07E99', version='HTTP/1.1', remote_ip='174.47.174.222', headers={'Origin': 'https://ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', 'Upgrade': 'Websocket', 'Sec-Websocket-Version': '13', 'Connection': 'Upgrade', 'Sec-Websocket-Key': 'KMt575Kx/659v0lUZdkytA==', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; LCTE; rv:11.0) like Gecko', 'Host': 'ec2-52-9-221-109.us-west-1.compute.amazonaws.com:8888', 'Cookie': 'username-ec2-52-9-221-109-us-west-1-compute-amazonaws-com-8888="2|1:0|10:1463773579|62:username-ec2-52-9-221-109-us-west-1-compute-amazonaws-com-8888|48:ZThmOTlkNWItZGQ1Yy00YjlmLWExNGEtMmEyYzJkODNiMjU2|f323ab003cf23bfc7dd105e73f48ee970b4b26807e5ddaaafe221f9123d1ea65"', 'Cache-Control': 'no-cache'})
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/web.py", line 1401, in _stack_context_handle_exception
        raise_exc_info((type, value, traceback))
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped
        ret = fn(*args, **kwargs)
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 184, in <lambda>
        self.on_recv(lambda msg: callback(self, msg), copy=copy)
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/notebook/base/zmqhandlers.py", line 188, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/tornado/websocket.py", line 215, in write_message
        raise WebSocketClosedError()
    WebSocketClosedError
[I 17:24:42.542 NotebookApp] Saving file at /Untitled.ipynb

When accessing the server in Chrome or Internet Explorer I get the same messages.

Additionally (albeit very strange), when code is executing on the server my local laptop's CPU utilization goes up about 50%.

Any thoughts in this websocket timeout?

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver May 21, 2016

Member

Websocket pinging is used because certain proxies close a websocket if there are no messages over it for 60 seconds: we send a ping message every 30 seconds, and the browser sends a pong back. This is part of the websocket protocol, so the browser should do it automatically. If we don't get the pong back after 90 seconds, we assume that the connection is lost and kill it.

I can't think why code executing would affect that, but it's suspicious that it causes high CPU usage on the client. What progress bar library is the code using? Can you disable the progress bar and see if the behaviour still occurs?

Member

takluyver commented May 21, 2016

Websocket pinging is used because certain proxies close a websocket if there are no messages over it for 60 seconds: we send a ping message every 30 seconds, and the browser sends a pong back. This is part of the websocket protocol, so the browser should do it automatically. If we don't get the pong back after 90 seconds, we assume that the connection is lost and kill it.

I can't think why code executing would affect that, but it's suspicious that it causes high CPU usage on the client. What progress bar library is the code using? Can you disable the progress bar and see if the behaviour still occurs?

@Jeffalltogether

This comment has been minimized.

Show comment
Hide comment
@Jeffalltogether

Jeffalltogether May 21, 2016

The progress bar is just part of the library's function. It's a neural network training protocol in Keras from keras.models import Sequential. As the network trains it shows progress through the training data with a bar in the Jupyter Notebook cell that looks like this:

Epoch 1/150
768/768 [========================> ] - 2s - loss: 0.6826 - acc: 0.6328

I don't think it has anything to do with this issue in particular as I have seen a number of other people with a similar issue.

It seems that when the notebook is in the (busy) state, the ping/pong messaging does not continue and essentially will only run code for as long as the websocket does not timeout. which is fine for short blocks of code.

I see that in the Jupyter Notebook code on Github in the file zmqhandlers.py there is a timeout if a message is not received after sending a ping, I believe this is what you mentioned in your reply. I am not familiar with this type of code, but is it possible to override this timeout when the notebook is in the "(busy)" state?

The progress bar is just part of the library's function. It's a neural network training protocol in Keras from keras.models import Sequential. As the network trains it shows progress through the training data with a bar in the Jupyter Notebook cell that looks like this:

Epoch 1/150
768/768 [========================> ] - 2s - loss: 0.6826 - acc: 0.6328

I don't think it has anything to do with this issue in particular as I have seen a number of other people with a similar issue.

It seems that when the notebook is in the (busy) state, the ping/pong messaging does not continue and essentially will only run code for as long as the websocket does not timeout. which is fine for short blocks of code.

I see that in the Jupyter Notebook code on Github in the file zmqhandlers.py there is a timeout if a message is not received after sending a ping, I believe this is what you mentioned in your reply. I am not familiar with this type of code, but is it possible to override this timeout when the notebook is in the "(busy)" state?

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver May 21, 2016

Member

is it possible to override this timeout when the notebook is in the "(busy)" state?

Not easily, and I'm pretty sure it's the wrong fix, in any case. The fact that the kernel is executing something shouldn't stop the browser from responding to websocket pings.

I have seen a number of other people with a similar issue.

That does look like the same issue that you're seeing, and @minrk is the person most likely to be able to work it out.

Member

takluyver commented May 21, 2016

is it possible to override this timeout when the notebook is in the "(busy)" state?

Not easily, and I'm pretty sure it's the wrong fix, in any case. The fact that the kernel is executing something shouldn't stop the browser from responding to websocket pings.

I have seen a number of other people with a similar issue.

That does look like the same issue that you're seeing, and @minrk is the person most likely to be able to work it out.

@Jeffalltogether

This comment has been minimized.

Show comment
Hide comment
@Jeffalltogether

Jeffalltogether May 21, 2016

Thanks I really appreciate your time in looking into this! Let's see if @minrk has any comments.

Thanks I really appreciate your time in looking into this! Let's see if @minrk has any comments.

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk May 22, 2016

Member

@Jeffalltogether what version of tornado do you have (might start with pip list or conda list to be safe)? It's odd that this timeout is raising an error in .get(), which should have returned before the timer started. That suggests that something isn't waiting the way we expect it to - perhaps because your tornado is too old, or perhaps it's a new version that changed something out from under us.

Member

minrk commented May 22, 2016

@Jeffalltogether what version of tornado do you have (might start with pip list or conda list to be safe)? It's odd that this timeout is raising an error in .get(), which should have returned before the timer started. That suggests that something isn't waiting the way we expect it to - perhaps because your tornado is too old, or perhaps it's a new version that changed something out from under us.

@Jeffalltogether

This comment has been minimized.

Show comment
Hide comment
@Jeffalltogether

Jeffalltogether May 22, 2016

Thanks for the prompt responses @minrk and @takluyver !!!

As it turns out, @takluyver was on the right track all along in asking about disabling the progress bar, because it turns out there is an error in the function I was using to train the model keras-team/keras#2110

I was not receiving the error message in the notebook configuration described in the original post. In an attempt to get it working, I changed the configuration to run behind a Nginx web server on the same EC2 instance, and got the I/O error that was discussed in the link above.

Once I disabled the progress bar, both notebook configurations work.

My tornado version is from an anaconda 2.7 python build showing: tornado 4.3 py27_0 defaults

Regarding the local CPU issue. When the progress bar is disabled on that 'model.fit' function, the local CPU is not affected. However, in the Nginx + Jupyter configuration, the local CPU is not bothered with or without the progress bar disabled.

If anyone is interested in how I set-up the Nginx + Jupyter configuration on the EC2 instance, let me know.

Thanks again for addressing my issue!

Thanks for the prompt responses @minrk and @takluyver !!!

As it turns out, @takluyver was on the right track all along in asking about disabling the progress bar, because it turns out there is an error in the function I was using to train the model keras-team/keras#2110

I was not receiving the error message in the notebook configuration described in the original post. In an attempt to get it working, I changed the configuration to run behind a Nginx web server on the same EC2 instance, and got the I/O error that was discussed in the link above.

Once I disabled the progress bar, both notebook configurations work.

My tornado version is from an anaconda 2.7 python build showing: tornado 4.3 py27_0 defaults

Regarding the local CPU issue. When the progress bar is disabled on that 'model.fit' function, the local CPU is not affected. However, in the Nginx + Jupyter configuration, the local CPU is not bothered with or without the progress bar disabled.

If anyone is interested in how I set-up the Nginx + Jupyter configuration on the EC2 instance, let me know.

Thanks again for addressing my issue!

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver May 22, 2016

Member

I think that I/O operation on closed file error is one we've seen before, though I forget if it got resolved. I'm still puzzled as to how an error in the kernel could cause the symptoms you describe, and why putting it behind nginx helped.

Member

takluyver commented May 22, 2016

I think that I/O operation on closed file error is one we've seen before, though I forget if it got resolved. I'm still puzzled as to how an error in the kernel could cause the symptoms you describe, and why putting it behind nginx helped.

@Jeffalltogether

This comment has been minimized.

Show comment
Hide comment
@Jeffalltogether

Jeffalltogether May 23, 2016

If it helps, I have provided the code I was running in the .ipynb. It's a relatively small standard data set available for anyone to use. Source of the code is: Deep Learning with Python

from keras.models import Sequential
from keras.layers import Dense
import numpy
import pandas

seed = 7
numpy.random.seed(seed)

#Load Pima Indians dataset with Pandas from url
url = "https://goo.gl/vhm1eU"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names = names)
dataset = dataframe.values
print(dataset)

X = dataset[:,0:8]
Y = dataset[:,8]
model = Sequential()
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#fit the model
# adding more epochs increase the number of times the training data is fed into the model and
# increases the time it takes to train.
model.fit(X,Y, nb_epoch=200, batch_size = 10, verbose = 0)

scores = model.evaluate(X, Y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

If it helps, I have provided the code I was running in the .ipynb. It's a relatively small standard data set available for anyone to use. Source of the code is: Deep Learning with Python

from keras.models import Sequential
from keras.layers import Dense
import numpy
import pandas

seed = 7
numpy.random.seed(seed)

#Load Pima Indians dataset with Pandas from url
url = "https://goo.gl/vhm1eU"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names = names)
dataset = dataframe.values
print(dataset)

X = dataset[:,0:8]
Y = dataset[:,8]
model = Sequential()
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#fit the model
# adding more epochs increase the number of times the training data is fed into the model and
# increases the time it takes to train.
model.fit(X,Y, nb_epoch=200, batch_size = 10, verbose = 0)

scores = model.evaluate(X, Y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Jun 2, 2016

I have the same problem on MAC , but even without running code, sessoin is closed after very short time
and I do use very simple code and edit run it continuisly
I do use c3.8large AWS instance
output from my screen:
[I 14:31:04.007 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb
[I 14:44:26.956 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb
sandercode/.Timeout, server 54.82.67.170 not responding.
[W 15:05:27.008 NotebookApp] WebSocket ping timeout after 119965 ms.

may be it is MAC poblem per this comment:

interesting comment : Well I just tried connection to my notebook server on AWS from a desktop browser and it works perfectly.. Looks like its some issue with the iOS browsers not allowing a connection with a self-signed certificate (https). I also tested that it works fine with just http. From jupyter/help#23

I have the same problem on MAC , but even without running code, sessoin is closed after very short time
and I do use very simple code and edit run it continuisly
I do use c3.8large AWS instance
output from my screen:
[I 14:31:04.007 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb
[I 14:44:26.956 NotebookApp] Saving file at /sandercode/S_may31_try1.ipynb
sandercode/.Timeout, server 54.82.67.170 not responding.
[W 15:05:27.008 NotebookApp] WebSocket ping timeout after 119965 ms.

may be it is MAC poblem per this comment:

interesting comment : Well I just tried connection to my notebook server on AWS from a desktop browser and it works perfectly.. Looks like its some issue with the iOS browsers not allowing a connection with a self-signed certificate (https). I also tested that it works fine with just http. From jupyter/help#23

@gnestor gnestor added this to the no action milestone Sep 14, 2016

@zweicoder

This comment has been minimized.

Show comment
Hide comment
@zweicoder

zweicoder Nov 4, 2016

Similarly when running a remote notebook, accessed via an ssh tunnel, closing the ssh tunnel will cause a timeout and stop all computation. (Related Stackoverflow post)

I was hoping to run computationally intensive training operations remotely and reconnect after a while but it seems that this still doesn't work?

Similarly when running a remote notebook, accessed via an ssh tunnel, closing the ssh tunnel will cause a timeout and stop all computation. (Related Stackoverflow post)

I was hoping to run computationally intensive training operations remotely and reconnect after a while but it seems that this still doesn't work?

@ashishsingal1

This comment has been minimized.

Show comment
Hide comment
@ashishsingal1

ashishsingal1 Dec 28, 2016

I was getting this error repeatedly when running a long script that iterated through about 30k loops, each time printing out a completed message. When I commented out the print, I did not get the timeout error -- potential temporary solution.

I was getting this error repeatedly when running a long script that iterated through about 30k loops, each time printing out a completed message. When I commented out the print, I did not get the timeout error -- potential temporary solution.

@nateGeorge

This comment has been minimized.

Show comment
Hide comment
@nateGeorge

nateGeorge Jan 6, 2017

I'm having the same problem with running a Keras model over ssh, and it's super annoying. Could we change the timeout for cells as in here: http://nbconvert.readthedocs.io/en/stable/execute_api.html ?

You can also send the output to a file like so:

import sys
sys.stdout = open('keras_output.txt', 'w')
history = model.fit(X, y_cat, batch_size=128, nb_epoch=200, verbose=1)
sys.stdout = sys.__stdout__

That worked for me.
http://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python

Or you could turn the verbosity option to 0

nateGeorge commented Jan 6, 2017

I'm having the same problem with running a Keras model over ssh, and it's super annoying. Could we change the timeout for cells as in here: http://nbconvert.readthedocs.io/en/stable/execute_api.html ?

You can also send the output to a file like so:

import sys
sys.stdout = open('keras_output.txt', 'w')
history = model.fit(X, y_cat, batch_size=128, nb_epoch=200, verbose=1)
sys.stdout = sys.__stdout__

That worked for me.
http://stackoverflow.com/questions/4675728/redirect-stdout-to-a-file-in-python

Or you could turn the verbosity option to 0

@domluna domluna referenced this issue in udacity/sdc-issue-reports Jan 6, 2017

Closed

Keras code crashes ipython notebook #325

@007 007 referenced this issue in keras-team/keras Jan 24, 2017

Closed

Decrease default progress interval to 5/second #5165

@brianlan

This comment has been minimized.

Show comment
Hide comment
@brianlan

brianlan Feb 11, 2017

@nateGeorge Thanks, the redirect solution worked for me.

@nateGeorge Thanks, the redirect solution worked for me.

@simonm3

This comment has been minimized.

Show comment
Hide comment
@simonm3

simonm3 Jun 19, 2017

A lot of people run jupyter with keras and this bug is over a year old now. Any way of getting this fixed as it a big pain running keras models with no progress bars? I note it says "needs info" but not sure what that means.

simonm3 commented Jun 19, 2017

A lot of people run jupyter with keras and this bug is over a year old now. Any way of getting this fixed as it a big pain running keras models with no progress bars? I note it says "needs info" but not sure what that means.

@nateGeorge

This comment has been minimized.

Show comment
Hide comment

Could also use keras-tqdm maybe https://github.com/bstriner/keras-tqdm

@simonm3

This comment has been minimized.

Show comment
Hide comment
@simonm3

simonm3 Jun 19, 2017

simonm3 commented Jun 19, 2017

@simonm3

This comment has been minimized.

Show comment
Hide comment
@simonm3

simonm3 Jun 22, 2017

simonm3 commented Jun 22, 2017

@jlintusaari

This comment has been minimized.

Show comment
Hide comment
@jlintusaari

jlintusaari Jul 21, 2017

This WebSocket ping timeout also occurs frequently if you output a considerable amount of logging messages to the cell in a long running job (not with Keras).

jlintusaari commented Jul 21, 2017

This WebSocket ping timeout also occurs frequently if you output a considerable amount of logging messages to the cell in a long running job (not with Keras).

@dbl001 dbl001 referenced this issue in jupyterlab/jupyterlab Jan 10, 2018

Open

[ LabApp] WebSocket ping timeout after 133941 ms. #3598

@saksham789

This comment has been minimized.

Show comment
Hide comment
@saksham789

saksham789 Jan 28, 2018

Hey, I am using Jupyter Notebook on a local server.But it crashes very frequently stating this error

Websocket ping timeout after 1290004 ms

This happens with any random piece of code in the script and a potential solution I found was starting the browser again but it's very annoying going through the process time and again.Please help!!

saksham789 commented Jan 28, 2018

Hey, I am using Jupyter Notebook on a local server.But it crashes very frequently stating this error

Websocket ping timeout after 1290004 ms

This happens with any random piece of code in the script and a potential solution I found was starting the browser again but it's very annoying going through the process time and again.Please help!!

@kaiaeberli

This comment has been minimized.

Show comment
Hide comment
@kaiaeberli

kaiaeberli Jan 28, 2018

try disabling windows firewall for your private network (if you are running the notebook on localhost:8888), that solved it for me.

try disabling windows firewall for your private network (if you are running the notebook on localhost:8888), that solved it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment