Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

singleuser-server not reporting activity to JupyterHub some times #3099

Open
yuvipanda opened this issue Jun 22, 2020 · 25 comments
Open

singleuser-server not reporting activity to JupyterHub some times #3099

yuvipanda opened this issue Jun 22, 2020 · 25 comments
Labels

Comments

@yuvipanda
Copy link
Contributor

Bug description

In some cases, the jupyterhub-singleuser server doesn't seem to notify activity to JupyterHub in an interesting way that makes JupyterHub think the server has not stopped activity at all. If you look at the admin panel, it shows that the user has been active '2minutes ago' or similar. But if you look at the logs, you see:

[I 2020-06-19 09:08:46.556 SingleUserNotebookApp handlers:264] Restoring connection for b08322cf-2857-4ccd-8bea-a445a96aaa95:754fe556a8394153989802bcaba67fdf
[I 2020-06-20 00:43:13.737 SingleUserNotebookApp kernelmanager:217] Starting buffering for b08322cf-2857-4ccd-8bea-a445a96aaa95:754fe556a8394153989802bcaba67fdf
[I 2020-06-20 00:43:13.738 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[E 2020-06-20 00:46:50.280 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2020-06-20 00:47:10.327 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2020-06-20 00:47:31.896 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2020-06-20 00:47:31.897 SingleUserNotebookApp singleuser:548] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 546, in keep_activity_updated
        await self.notify_activity()
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 533, in notify_activity
        timeout=60,
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Failed to notify Hub of activity
[I 2020-06-20 00:48:21.519 SingleUserNotebookApp log:174] 302 GET /user/user-name/ -> /user/user-name/tree? (@10.20.2.55) 1.41ms
[I 2020-06-20 00:50:12.218 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.9) 107.51ms
[I 2020-06-20 00:50:12.218 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-20 19:34:42.219 SingleUserNotebookApp zmqhandlers:182] WebSocket ping timeout after 119837 ms.
[I 2020-06-20 19:34:47.222 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 19:34:59.632 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.64) 75.16ms
[I 2020-06-20 19:34:59.633 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-20 22:25:31.678 SingleUserNotebookApp handlers:252] Replacing stale connection: 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 22:25:36.681 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 22:25:36.683 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.65) 5104.34ms
[I 2020-06-20 22:25:36.684 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-20 22:30:36.685 SingleUserNotebookApp zmqhandlers:182] WebSocket ping timeout after 119842 ms.
[I 2020-06-20 22:30:41.687 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 22:30:54.565 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.41) 129.91ms
[I 2020-06-20 22:30:54.565 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-20 23:44:41.212 SingleUserNotebookApp handlers:252] Replacing stale connection: 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 23:44:46.213 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 23:44:46.215 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.41) 5080.02ms
[I 2020-06-20 23:44:46.215 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-21 20:06:18.366 SingleUserNotebookApp handlers:252] Replacing stale connection: 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-21 20:06:23.368 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-21 20:06:23.370 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.66) 5134.58ms
[I 2020-06-21 20:06:23.370 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[E 2020-06-21 21:24:39.536 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2020-06-21 21:24:59.674 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[I 2020-06-21 21:25:14.320 SingleUserNotebookApp log:174] 302 GET /user/user-name/ -> /user/user-name/tree? (@10.20.14.74) 0.89ms
[E 2020-06-21 21:25:20.806 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2020-06-21 21:25:20.806 SingleUserNotebookApp singleuser:548] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 546, in keep_activity_updated
        await self.notify_activity()
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 533, in notify_activity
        timeout=60,
      File "/srv/conda/envs/data100/lib/python3.7/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Failed to notify Hub of activity

This results in pods not being culled for days at a time, making autoscaling much more expensive.

Expected behaviour

The hub knows when the pod has no connections left, and terminates it.

Actual behaviour

The pod keeps running forever, basically.

How to reproduce

We see this in a small % of all our users - <1% is my guess. Can't figure out a way to reproduce this, unfortunately.

Your personal set up

Latest version of z2jh - full config can be found at github.com/berkeley-dsep-infra/datahub

@yuvipanda yuvipanda added the bug label Jun 22, 2020
@elgalu
Copy link
Contributor

elgalu commented Jun 22, 2020

+1 also can't reproduce it, non-deterministic so far

image

@yuvipanda
Copy link
Contributor Author

I'm looking at /proc/1/net/tcp of the 'stuck' pod to see what connections it's holding. The only interesting one I see is from its pod IP to the pod IP of the proxy. This might mean there's some kind of 'stuck' connection between the proxy pod and this? But am not sure exactly how to figure out how long this connection has been open yet.

@yuvipanda
Copy link
Contributor Author

There also does not currently seem to be a network issue between the pod and the hub:

curl -H "Authorization: token $JUPYTERHUB_API_TOKEN" -X POST $JUPYTERHUB_ACTIVITY_URL
{"status": 400, "message": "body must be a json dict"}

@elgalu
Copy link
Contributor

elgalu commented Jun 22, 2020

I don't follow why it uses the IP to connect, in K8s the Hub's pod can die every now and then and the new Hub pod will have a different K8s internal IP address but the single-user will be pointing to an old invalid IP.

It should always use K8s internal dns names, e.g. proxy-public.{{k8s__namespace}}.svc.cluster.local

@yuvipanda
Copy link
Contributor Author

@elgalu is all that for the same user or for different users?

@elgalu
Copy link
Contributor

elgalu commented Jun 22, 2020

That log was for the same user

@yuvipanda
Copy link
Contributor Author

@elgalu ah, I think it was mostly because I was looking at it at a fairly low level (/proc/1/net/tcp) where you see addresses after they are resolved by service IP -> pod IP.

@yuvipanda
Copy link
Contributor Author

The part that confuses me is - if the pod can't notify the hub of activity, shouldn't that get the pod killed sooner, not later? Confusion

@yuvipanda
Copy link
Contributor Author

Also the pod itself is much older than the proxy pod, so that puts doubt on my 'stuck connection to proxy' hypothesis.

@elgalu
Copy link
Contributor

elgalu commented Jun 22, 2020

So could be pointing to an old IP, should point to the service domain name instead
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Jun 22, 2020

@elgalu nah, ultimately the kernel only sees IPs, since that's the level of info I was looking at (check out /proc/1/net/tcp in your own linux host, or try man procfs). Don't think that's related. This is just running latest zero to jupyterhub.

I also verified that it is the correct IP

@minrk
Copy link
Member

minrk commented Jun 23, 2020

I think there are two issues here, and I'm not sure how related they are:

  • failure to connect to the hub to post activity
  • failure to identify an apparently idle client (or perhaps spurious activity is keeping them active, which I suspect is more likely since it's been a known and common issue)

Is it failing 100% of requests, or just some? My first hunch is that it's a problem with tornado's SimpleHTTPClient getting stuck and not sending new requests, in which case opting to use the curl client might resolve it.

But I'm also not sure that this is really the issue, since the logs you shared have long gaps, unless you snipped out:

[I 2020-06-20 23:44:46.215 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-21 20:06:18.366 SingleUserNotebookApp handlers:252] Replacing stale connection: 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e

that's 21 hours between log messages, so their last activity should be at least 21 hours earlier, at least from this source.

What's weird is that failure to notify of activity should have the opposite effect - that the user is culled prematurely due to a lack of activity updates, not that they would be kept active. The activity API doesn't allow posting inactivity, only updating more recent activity.

Are you using the traefik or CHP for the proxy? If it's CHP, activity is also logged at the traffic level, so maybe a stale connection is keeping it alive somehow.


Here's my debugging idea: check the Hub logs for any log statements about this user. There might be something left open that's polling and sending requests to the Hub (JupyterLab may still have some overzealous polling behavior).

@yuvipanda
Copy link
Contributor Author

@minrk this is using CHP, just standard z2jh.

I looked at hub logs, and see pretty constant 200 hits to /hub/api/users/<user-name>/activity. So it looks like the issue isn't jupyterhub-singleuser posting activity, but something else - possibly the connection to the proxy I saw left open.

@minrk
Copy link
Member

minrk commented Jun 23, 2020

Other debugging ideas:

If using CHP, you can check the activity state there in the routing table. What I think is most likely, since it's happened many times before, is that there's a lingering browser or something polling the backend. I don't know why the requests wouldn't be logged (maybe requests made to a plugin-provided or proxied endpoint that doesn't log, or possibly websocket messages which aren't all logged).

This repeating log pattern:

[I 2020-06-20 00:50:12.218 SingleUserNotebookApp handlers:264] Restoring connection for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[W 2020-06-20 19:34:42.219 SingleUserNotebookApp zmqhandlers:182] WebSocket ping timeout after 119837 ms.
[I 2020-06-20 19:34:47.222 SingleUserNotebookApp kernelmanager:217] Starting buffering for 5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6:5e747954b2bd40e2828015b244ad1f1e
[I 2020-06-20 19:34:59.632 SingleUserNotebookApp log:174] 101 GET /user/user-name/api/kernels/5ba4c551-e4b5-4bc0-b88f-bd30a77c83f6/channels?session_id=5e747954b2bd40e2828015b244ad1f1e (user-name@10.0.0.64) 75.16ms

means that there is indeed a held-open websocket connection. Connection is established at 2020-06-20T00:50 and the connection is only lost 20 hours later at 19:34, at which point it is promptly re-established 17 seconds later. There should be a log statement when the connection is actually closed, so the absence means it was not closed at any point during that time. So this is a classic case of left-open browsers connected to the server.

The internal idle culler we use in Binder is able to make more fine-grained decisions about whether to consider these merely open connections as 'active' or not.

@yuvipanda
Copy link
Contributor Author

Yeah, I think a 'jupyterlab / jupyter notebook' process in the background is most likely cause. I wonder if we can use the internal culler, even though we are using lots of non-jupyter interfaces (like rstudio). But so does binder, so it should be fine...

@minrk
Copy link
Member

minrk commented Jun 25, 2020

If you are using jupyter-server-proxy, I believe the internal culler still tracks requests there. I think I remember looking into this, but worth confirming.

@yuvipanda
Copy link
Contributor Author

@minrk awesome, I'll look into it.

I went and killed the users' running kernels, which seems to have made no difference. Which makes sense if the thing keeping it alive is the proxy connection

@chintanp
Copy link

I am facing the same issue, in a TLJH implementation on Ubuntu EC2 with around 8GB of RAM and a single user using less than 5GB.

Refreshing the browser usually resolves this, but it is inconvenient. Any way I can debug or track this further?

yuvipanda added a commit to yuvipanda/datahub-old-fork that referenced this issue Nov 18, 2020
The culler doesn't kill some user pods when it should. We now
get the notebook server to try and stop itself when it's got
no activity either. Not sure if this will help, but let's
try! This is what mybinder.org does.

See jupyterhub/jupyterhub#3099
for more info.
@kesiraa
Copy link

kesiraa commented Nov 26, 2020

Same issue here..
I am trying a hacky work around -> communicate with hub api, and if it is UnAuthorized, redirect the user to login page.

@manics
Copy link
Member

manics commented May 22, 2022

@yuvipanda Are you still hitting this problem with the latest JupyterHub?

@yuvipanda
Copy link
Contributor Author

I think so - I've started culling via notebook config as well, which has helped. So I don't notice if this still happens...

@vivian-rook
Copy link

vivian-rook commented Jul 1, 2022

we're seeing this in jupyterhub 2.3.0
Though the traceback points to mixins.py rather than singleuser.py, so perhaps different?

[E 2022-06-23 05:04:39.057 SingleUserLabApp mixins:578] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/paws/lib/python3.10/site-packages/jupyterhub/singleuser/mixins.py", line 576, in notify
        await client.fetch(req)
      File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
        future.result()
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden

What does it mean to be calling via notebook config?

A few more notes here https://phabricator.wikimedia.org/T310622

@minrk
Copy link
Member

minrk commented Jul 13, 2022

Culling via notebook config is using the notebook server's own more detailed idle-culling mechanisms , rather than relying on JupyterHub's coarse-grained activity information. You can see references to 'MappingKernelManager' in the idle-culler readme.

A 403 on the activity request should generally mean that the API token used by the server has been revoked or expired. They don't normally expire, so revocation is more likely. This normally happens when a server is shutdown, but it's possible some other transaction revoked tokens in bulk, or the token was manually revoked. My most likely guess is that server shutdown logic has been triggered even though the server is actually still running. This would likely mean orphaned pods, though. I'm not really sure how it could happen and would have a hard time tracking it down without relatively complete debug logs from the Hub and server in question.

@priyankade
Copy link

priyankade commented Oct 10, 2022

I am facing the same issue:

[E 2022-10-10 13:20:05.725 SingleUserLabApp mixins:574] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/jupyterhub/singleuser/mixins.py", line 572, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting

I am trying to spawn the single-user server on the nodes in docker swarm mode, and this is the error I get.
The docker container spawned on the node stays alive for 3-4 minutes and tries to connect to Hub and then fails with the above error.
Although, Hub API (using internal IP) is reachable from within the container in node:

(base) root@***container********:~# curl -v  http://10.XX.XX.XX:8003/hub/api
*   Trying 10.XX.XX.XX:8003...
* Connected to 10.XX.XX.XX (10.XX.XX.XX) port 8003 (#0)
> GET /hub/api HTTP/1.1
> Host: 10.XX.XX.XX:8003
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< server: TornadoServer/6.2
< content-type: application/json
..............................omitted
< content-length: 20
< connection: close
<
* Closing connection 0
{"version": "2.3.1"}

A docker network is created specifically for this swarm mode, which the container uses when it spawns on nodes.
Note: The setup works perfectly when the single-user server spawns on the master.
Not able to figure out where the network is getting blocked.

@yashali
Copy link

yashali commented Apr 19, 2023

@minrk we observe this issue intermittently on jupyterhub 3.0.0 as well. We will upgrade to 3.1.1 soon. But we are wondering if it will resolve this issue or not.

[E 2023-04-17 08:20:48.862 SingleUserNotebookApp mixins:574] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 572, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2023-04-17 08:20:48.866 SingleUserNotebookApp mixins:599] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 597, in keep_activity_updated
        await self.notify_activity()
      File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 579, in notify_activity
        await exponential_backoff(
      File "/opt/conda/lib/python3.8/site-packages/jupyterhub/utils.py", line 236, in exponential_backoff
        raise asyncio.TimeoutError(fail_message)
    asyncio.exceptions.TimeoutError: Failed to notify Hub of activity
[E 2023-04-17 08:52:55.956 SingleUserNotebookApp mixins:574] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 572, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout during request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants