New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
singleuser-server not reporting activity to JupyterHub some times #3099
Comments
I'm looking at |
There also does not currently seem to be a network issue between the pod and the hub:
|
I don't follow why it uses the IP to connect, in K8s the Hub's pod can die every now and then and the new Hub pod will have a different K8s internal IP address but the single-user will be pointing to an old invalid IP. It should always use K8s internal dns names, e.g. |
@elgalu is all that for the same user or for different users? |
That log was for the same user |
@elgalu ah, I think it was mostly because I was looking at it at a fairly low level ( |
The part that confuses me is - if the pod can't notify the hub of activity, shouldn't that get the pod killed sooner, not later? Confusion |
Also the pod itself is much older than the proxy pod, so that puts doubt on my 'stuck connection to proxy' hypothesis. |
So could be pointing to an old IP, should point to the service domain name instead |
@elgalu nah, ultimately the kernel only sees IPs, since that's the level of info I was looking at (check out I also verified that it is the correct IP |
I think there are two issues here, and I'm not sure how related they are:
Is it failing 100% of requests, or just some? My first hunch is that it's a problem with tornado's SimpleHTTPClient getting stuck and not sending new requests, in which case opting to use the curl client might resolve it. But I'm also not sure that this is really the issue, since the logs you shared have long gaps, unless you snipped out:
that's 21 hours between log messages, so their last activity should be at least 21 hours earlier, at least from this source. What's weird is that failure to notify of activity should have the opposite effect - that the user is culled prematurely due to a lack of activity updates, not that they would be kept active. The activity API doesn't allow posting inactivity, only updating more recent activity. Are you using the traefik or CHP for the proxy? If it's CHP, activity is also logged at the traffic level, so maybe a stale connection is keeping it alive somehow. Here's my debugging idea: check the Hub logs for any log statements about this user. There might be something left open that's polling and sending requests to the Hub (JupyterLab may still have some overzealous polling behavior). |
@minrk this is using CHP, just standard z2jh. I looked at hub logs, and see pretty constant 200 hits to |
Other debugging ideas: If using CHP, you can check the activity state there in the routing table. What I think is most likely, since it's happened many times before, is that there's a lingering browser or something polling the backend. I don't know why the requests wouldn't be logged (maybe requests made to a plugin-provided or proxied endpoint that doesn't log, or possibly websocket messages which aren't all logged). This repeating log pattern:
means that there is indeed a held-open websocket connection. Connection is established at The internal idle culler we use in Binder is able to make more fine-grained decisions about whether to consider these merely open connections as 'active' or not. |
Yeah, I think a 'jupyterlab / jupyter notebook' process in the background is most likely cause. I wonder if we can use the internal culler, even though we are using lots of non-jupyter interfaces (like rstudio). But so does binder, so it should be fine... |
If you are using jupyter-server-proxy, I believe the internal culler still tracks requests there. I think I remember looking into this, but worth confirming. |
@minrk awesome, I'll look into it. I went and killed the users' running kernels, which seems to have made no difference. Which makes sense if the thing keeping it alive is the proxy connection |
I am facing the same issue, in a TLJH implementation on Ubuntu EC2 with around 8GB of RAM and a single user using less than 5GB. Refreshing the browser usually resolves this, but it is inconvenient. Any way I can debug or track this further? |
The culler doesn't kill some user pods when it should. We now get the notebook server to try and stop itself when it's got no activity either. Not sure if this will help, but let's try! This is what mybinder.org does. See jupyterhub/jupyterhub#3099 for more info.
Same issue here.. |
@yuvipanda Are you still hitting this problem with the latest JupyterHub? |
I think so - I've started culling via notebook config as well, which has helped. So I don't notice if this still happens... |
we're seeing this in jupyterhub 2.3.0
What does it mean to be calling via notebook config? A few more notes here https://phabricator.wikimedia.org/T310622 |
Culling via notebook config is using the notebook server's own more detailed idle-culling mechanisms , rather than relying on JupyterHub's coarse-grained activity information. You can see references to 'MappingKernelManager' in the idle-culler readme. A 403 on the activity request should generally mean that the API token used by the server has been revoked or expired. They don't normally expire, so revocation is more likely. This normally happens when a server is shutdown, but it's possible some other transaction revoked tokens in bulk, or the token was manually revoked. My most likely guess is that server shutdown logic has been triggered even though the server is actually still running. This would likely mean orphaned pods, though. I'm not really sure how it could happen and would have a hard time tracking it down without relatively complete debug logs from the Hub and server in question. |
I am facing the same issue:
I am trying to spawn the single-user server on the nodes in docker swarm mode, and this is the error I get.
A docker network is created specifically for this swarm mode, which the container uses when it spawns on nodes. |
@minrk we observe this issue intermittently on jupyterhub 3.0.0 as well. We will upgrade to 3.1.1 soon. But we are wondering if it will resolve this issue or not.
|
Bug description
In some cases, the
jupyterhub-singleuser
server doesn't seem to notify activity to JupyterHub in an interesting way that makes JupyterHub think the server has not stopped activity at all. If you look at the admin panel, it shows that the user has been active '2minutes ago' or similar. But if you look at the logs, you see:This results in pods not being culled for days at a time, making autoscaling much more expensive.
Expected behaviour
The hub knows when the pod has no connections left, and terminates it.
Actual behaviour
The pod keeps running forever, basically.
How to reproduce
We see this in a small % of all our users - <1% is my guess. Can't figure out a way to reproduce this, unfortunately.
Your personal set up
Latest version of z2jh - full config can be found at github.com/berkeley-dsep-infra/datahub
The text was updated successfully, but these errors were encountered: