Turing unhealthy #2252

minrk · 2022-05-31T07:27:00Z

The turing cluster is unhealthy for a couple reasons, causing health check failures and various errors.

There appear to be at least two issues:

networking, with lots of internal DNS failures and connection timeouts.

Error: getaddrinfo EAI_AGAIN proxy-patches
 at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:69:26) {
errno: -3001,
code: 'EAI_AGAIN',
syscall: 'getaddrinfo',
hostname: 'proxy-patches'
}

evictions, apparently tied to disk pressure:

Warning  Evicted    26m   kubelet            The node was low on resource: ephemeral-storage. Container notebook was using 84Ki, which exceeds its request of 0.
Warning  Evicted    20s   kubelet            The node had condition: [DiskPressure].

I'm not sure if this is just an unhealthy node or system pod, but I haven't been able to track it down. For now, turing is removed from the federation-redirect while we figure it out. @callummole

The text was updated successfully, but these errors were encountered:

minrk · 2022-06-09T10:35:03Z

Since we've had evictions due to disk pressure on two nodes, I think the IOPS throttling issues discussed in Azure/AKS#1320 may be relevant. It could be worth checking the cluster logs in Azure to see if there's anything spamming warnings or errors that could fill up the disk with logs.

@callummole it may be worth checking out the quota issues mentioned here: Azure/AKS#1373 . It's possible that picking a new node configuration is the action to take, I'm not sure.

minrk · 2022-06-09T13:59:38Z

I think this may be related to the updates to the image-cleaner. Working on a possible fix

callummole · 2022-06-09T18:37:52Z

I've had a look on Azure at some of virtual machine scale set metrics shown in #1373. There are some spikes in use, but none of the %CPU or %IOPS reach levels one would be concerned about (CPU used max is about ~25%, IOPS consumed max is about 1.5%).

minrk · 2022-06-09T20:00:20Z

Strange. I can't think of what else could cause intermittent DNS issues. Recycling all nodes might help, but it hasn't seemed to yet.

minrk · 2022-06-13T13:46:41Z

I spent quite a bit of time poking around and found various issues, such as alan-turing-institute/hub23-deploy#445 and https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

I didn't deploy node-local-dns, but I doubt that would fix the problem since we have both DNS issues and connection timeout issues (connection timeout can explain DNS, but I don't think DNS can explain connection timeout).

I can't seem to track down what's causing the issues. The network just seems fundamentally unreliable under negligible load. There are no errors or warnings in any networking component that I can find, nothing restarting, etc.

The symptoms are exactly what's described at Azure/AKS#1320 (comment), but if we aren't eating up IOPS, I don't know what could be the issue.

I'm out of ideas other than 'start a new cluster' since even starting entirely new nodes didn't help. Does Turing have a support contract with Azure they can rely on?

callummole · 2022-06-14T09:16:55Z

I am considering a redeploy to UK south, since then we can take advantage of some reserved instances (and reduce cost).

So perhaps now would be a good time to do that. Perhaps with new node configs.

As for the IOPS issue, there are certainly spikes, it's just that the load is always low. Perhaps it doesn't take much to cause the issue.

minrk · 2022-06-14T10:52:03Z

OK, if deploying a new cluster is desirable for other reasons, let's maybe do that and hope the issue gets fixed and we never need to understand it.

callummole · 2022-08-15T16:29:45Z

Post deployment, we are still having intermittency issues, where turing.mybinder.org sometimes shows. 504 gateway timeout. My investigations:

The hub pod frequently has the event: Readiness probe failed: Get "http://10.244.0.25:8081/hub/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers).

This seems to be concurrent with the log entry:

[W 2022-08-15 16:11:30.253 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-08-15 16:11:40.385 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-08-15 16:12:00.601 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-08-15 16:12:00.606 JupyterHub ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f05db27bdf0>>, <Task finished name='Task-12128' coro=<JupyterHub.update_last_activity() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/app.py:2666> exception=TimeoutError('Repeated api_request to proxy path "" failed.')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2668, in update_last_activity
        routes = await self.proxy.get_all_routes()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
        result = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Repeated api_request to proxy path "" failed.

The binder pod frequently has the event: Readiness probe failed: Get "http://10.244.0.39:8585/versions": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
- less frequently, the liveness probe fails
Example binder log entry:

[E 220815 16:11:39 health:50] Error checking check_jupyterhub_api
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/binderhub/health.py", line 48, in wrapper
        res = await f(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/binderhub/health.py", line 26, in wrapper
        return await f(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/binderhub/health.py", line 140, in check_jupyterhub_api
        await AsyncHTTPClient().fetch(hub_url + "hub/api/health", request_timeout=3)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 3001 milliseconds with 0 bytes received
[W 220815 16:11:39 health:208] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 220815 16:11:39 log:135] 503 GET /health (anonymous@34.70.178.124) 3002.97ms
[E 220815 16:11:52 launcher:122] Error accessing Hub API (using https://hub.mybinder.turing.ac.uk/hub/api/users/quantstack-xeus-cling-t5c6vxn0): HTTP 599: Operation timed out after 20000 milliseconds with 0 bytes received

Example proxy pod logs

16:11:00.278 [ConfigProxy] error: Failed to get custom error page: Error: connect ETIMEDOUT 10.0.84.44:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1148:16) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '10.0.84.44',
  port: 80
}
16:11:00.278 [ConfigProxy] error: Failed to get custom error page: Error: connect ETIMEDOUT 10.0.84.44:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1148:16) {
  errno: -110,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '10.0.84.44',
  port: 80
}
16:11:00.278 [ConfigProxy] error: 503 GET /user/ines-spacy-course-j0l8zx6m/api/kernels/de027600-a0a4-4b2a-aace-361f98171436/channels connect ETIMEDOUT 10.244.1.92:8888
16:11:55.574 [ConfigProxy] error: 503 GET /user/ines-spacy-course-grkob94g/api/kernels/e1a7749a-5700-491b-b9e1-0ad8c3ed290d/channels read EHOSTUNREACH
16:11:57.622 [ConfigProxy] error: 503 GET /user/ines-spacy-course-0go4ikfz/api/kernels/7e7c12fb-e0f3-4796-9bbc-747f82e32efe/channels read EHOSTUNREACH
16:12:05.041 [ConfigProxy] error: 503 GET /user/holoviz-panel-hp8okeac/lab/extensions/@voila-dashboards/jupyterlab-preview/static/remoteEntry.ec13ba4c2bfbf9fe7e8a.js socket hang up
16:13:05.043 [ConfigProxy] error: 503 GET /user/holoviz-panel-hp8okeac/lab/extensions/@voila-dashboards/jupyterlab-preview/static/remoteEntry.ec13ba4c2bfbf9fe7e8a.js socket hang up
16:13:21.190 [ConfigProxy] error: 503 GET /user/ipython-ipython-in-depth-y9gpcg8f/static/base/images/favicon-busy-1.ico socket hang up

We are also getting the same errors as the @minrk's original post:

16:32:02.803 [ConfigProxy] error: 503 GET /user/gromacs-online---intro-tutorial-jlyknr1m/api/metrics/v1 Error: getaddrinfo EAI_AGAIN proxy-patches
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:69:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'proxy-patches'
}

Also these sort of errors appear in the ingress controller logs:

2022/08/15 16:24:41 [error] 433#433: *8198296 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.104.63.72, server: hub.mybinder.turing.ac.uk, request: "GET /user/hanlpbot-hanlp-binder-4l9ipomc/api/kernels/916acfec-876b-4d01-a5f3-2fd2d3ec7a95/channels?session_id=1c240255-84db-4c0e-97b8-ec5b3d4fe519&token=6Sy10PBeQOeDSEbbZwSjUw HTTP/1.1", upstream: "http://10.244.0.24:8000/user/hanlpbot-hanlp-binder-4l9ipomc/api/kernels/916acfec-876b-4d01-a5f3-2fd2d3ec7a95/channels?session_id=1c240255-84db-4c0e-97b8-ec5b3d4fe519&token=6Sy10PBeQOeDSEbbZwSjUw", host: "hub.mybinder.turing.ac.uk"

The nodes are now higher spec than before, so we can rule out that option.

callummole · 2022-08-15T17:15:02Z

After some googling, I've learned that CPU throttling could be a property of the container, rather than the node it is running on. So perhaps this might not show up in the azure metrics I looked at before? Similar issues are happening on the turing-staging cluster, so it could be some interaction between a container and azure. Not entirely sure what the next investigation step is there.

We could also try node-local-dns. I know you were skeptical, @minrk, but seems worth a try given that it seems to be suggested in a few cases ([e.g. 1,2)...and I'm currently not sure what else to try.

manics · 2022-08-15T18:59:59Z

What CNI are you using?
https://medium.com/asos-techblog/an-aks-performance-journey-part-2-networking-it-out-e253f5bb4f69
mentions problems with DNS race conditions and NodeJS, similar to your other links, but it also says

In January 2021 Microsoft released an update to the default settings for Azure CNI which changes the default to use Transparent Mode. This new default is applied to all new clusters, and is automatically applied to any existing clusters when they are upgraded.
This change places Pod to Pod communication over layer 3 (L3) instead of the old layer 2 (L2) bridge, using IP routes which the CNI adds. The benefits look very promising, the main ones related to our discussion being:

It provides mitigation for the conntrack race conditions we’ve discussed without the need to set up NodeLocal DNSCache.
It eliminates some additional 5s DNS latency which the old Bridge mode introduced due to a ‘just in time’ setup.

callummole · 2022-08-16T16:53:06Z

The cluster network uses kubenet, not Azure CNI: https://docs.microsoft.com/en-us/azure/aks/concepts-network#azure-virtual-networks.

From reading around it seems that specifying Azure CNI (where every pod gets allocated an IP) is preferred (but I don't see why that would cause our error).

callummole · 2022-08-17T09:21:04Z

Did some more unfruitful debugging today. It resulting in two changes, but neither of them seem to help. Logging it here in case it is useful.

ClusterRoles:

I was following the kubernetes docs on Debugging DNS resolution. I noticed that the coredns cluster role does not have the expected permissions 'nodes:get'.

➜  ~ k describe clusterrole system:coredns -n kube-system
Name:         system:coredns
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              kubernetes.io/bootstrapping=rbac-defaults
Annotations:  <none>
PolicyRule:
  Resources                        Non-Resource URLs  Resource Names  Verbs
  ---------                        -----------------  --------------  -----
  endpoints                        []                 []              [list watch]
  namespaces                       []                 []              [list watch]
  pods                             []                 []              [list watch]
  services                         []                 []              [list watch]
  endpointslices.discovery.k8s.io  []                 []              [list watch]

Example coredns clusterrole spec given here: https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/deployments/coredns-1.7.0.yaml

So I changed the clusterroles to looks as the docs suggest:

➜  ~ k describe clusterrole system:coredns -n kube-system
Name:         system:coredns
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              kubernetes.io/bootstrapping=rbac-defaults
Annotations:  <none>
PolicyRule:
  Resources                        Non-Resource URLs  Resource Names  Verbs
  ---------                        -----------------  --------------  -----
  nodes                            []                 []              [get]
  endpoints                        []                 []              [list watch]
  namespaces                       []                 []              [list watch]
  pods                             []                 []              [list watch]
  services                         []                 []              [list watch]
  endpointslices.discovery.k8s.io  []                 []              [list watch]

Though this hasn't helped.

Added subnet to aks network security group.

This microsoft ticket presents very similarly. Since our coredns logs are showing the following warnings:

[WARNING] No files matching import glob pattern: custom/*.override
[WARNING] No files matching import glob pattern: custom/*.override

Note - the warnings are likely due to missing imports, and unimportant, see coredns issue #3600

But there are also errors in the logs, like:

[ERROR] plugin/errors: 2 www.hanlp.com. A: read udp 10.244.1.236:58932->168.63.129.16:53: i/o timeout
W0816 15:22:54.664468       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0816 15:22:54.664464       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0816 15:22:54.664464       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.23.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
[ERROR] plugin/errors: 2 youtube.com. AAAA: read udp 10.244.1.236:45116->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. A: read udp 10.244.1.236:59947->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. AAAA: read udp 10.244.1.236:33466->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 youtube.com. A: read udp 10.244.1.236:45049->10.255.255.255:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:42698->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:53056->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 www.hanlp.com. AAAA: read udp 10.244.1.236:34372->168.63.129.16:53: i/o timeout

Following along with the ticket I noticed that in the Azure subnet the network security group is not associated with the relevant aks security group. Have changed this to match with the aks nsg.
.

callummole · 2022-08-17T10:24:09Z

FYI, I have raised a support ticket with azure about this.

consideRatio · 2022-10-20T10:45:34Z

Related, for reference

turing-staging: don't block production deploy #2338

manics · 2022-10-28T15:32:30Z

I regularly see intermittent issues with Turing, sometimes on launch, sometimes after the server is already running (JupyterLab shows the server connection error dialog). The GitHub workflows almost always return an error status for Turing staging and production- the deploy completes but the verification check fails.

Since this is likely to leave a poor impression on users who end up n Turing should we remove it from federationRedirect?

consideRatio · 2022-10-29T13:57:30Z

I think that would be good!

jupyterhub#2252

rgaiacs · 2023-03-30T14:22:20Z

I believe we can close this given #2396 and #2528.

minrk assigned callummole May 31, 2022

minrk mentioned this issue Jun 9, 2022

apply node selectors to turing #2266

Merged

This was referenced Jun 9, 2022

prune all images if no dangling images left jupyterhub/docker-image-cleaner#51

Merged

cert-manager-webhook intermittently reachable from turing's k8s api-server cause issues #2249

Closed

callummole mentioned this issue Jun 16, 2022

Redeploy Turing cluster to UK south jupyterhub/team-compass#529

Closed

consideRatio mentioned this issue Oct 20, 2022

Prod deploys are blocked due to turing-staging failure #2337

Closed

manics added a commit to manics/mybinder.org-deploy that referenced this issue Oct 30, 2022

Remove Turing from federation

3450f30

jupyterhub#2252

manics mentioned this issue Oct 30, 2022

Remove Turing from federation #2396

Merged

consideRatio mentioned this issue Nov 24, 2022

new OVH cluster #2407

Closed

6 tasks

manics closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turing unhealthy #2252

Turing unhealthy #2252

minrk commented May 31, 2022

minrk commented Jun 9, 2022

minrk commented Jun 9, 2022

callummole commented Jun 9, 2022

minrk commented Jun 9, 2022

minrk commented Jun 13, 2022

callummole commented Jun 14, 2022 •

edited

minrk commented Jun 14, 2022

callummole commented Aug 15, 2022 •

edited

callummole commented Aug 15, 2022

manics commented Aug 15, 2022

callummole commented Aug 16, 2022 •

edited

callummole commented Aug 17, 2022

callummole commented Aug 17, 2022

consideRatio commented Oct 20, 2022

manics commented Oct 28, 2022

consideRatio commented Oct 29, 2022

rgaiacs commented Mar 30, 2023

Turing unhealthy #2252

Turing unhealthy #2252

Comments

minrk commented May 31, 2022

minrk commented Jun 9, 2022

minrk commented Jun 9, 2022

callummole commented Jun 9, 2022

minrk commented Jun 9, 2022

minrk commented Jun 13, 2022

callummole commented Jun 14, 2022 • edited

minrk commented Jun 14, 2022

callummole commented Aug 15, 2022 • edited

callummole commented Aug 15, 2022

manics commented Aug 15, 2022

callummole commented Aug 16, 2022 • edited

callummole commented Aug 17, 2022

callummole commented Aug 17, 2022

consideRatio commented Oct 20, 2022

Related, for reference

manics commented Oct 28, 2022

consideRatio commented Oct 29, 2022

rgaiacs commented Mar 30, 2023

callummole commented Jun 14, 2022 •

edited

callummole commented Aug 15, 2022 •

edited

callummole commented Aug 16, 2022 •

edited