New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turing unhealthy #2252
Comments
Since we've had evictions due to disk pressure on two nodes, I think the IOPS throttling issues discussed in Azure/AKS#1320 may be relevant. It could be worth checking the cluster logs in Azure to see if there's anything spamming warnings or errors that could fill up the disk with logs. @callummole it may be worth checking out the quota issues mentioned here: Azure/AKS#1373 . It's possible that picking a new node configuration is the action to take, I'm not sure. |
I think this may be related to the updates to the image-cleaner. Working on a possible fix |
I've had a look on Azure at some of virtual machine scale set metrics shown in #1373. There are some spikes in use, but none of the %CPU or %IOPS reach levels one would be concerned about (CPU used max is about ~25%, IOPS consumed max is about 1.5%). |
Strange. I can't think of what else could cause intermittent DNS issues. Recycling all nodes might help, but it hasn't seemed to yet. |
I spent quite a bit of time poking around and found various issues, such as alan-turing-institute/hub23-deploy#445 and https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95 I didn't deploy node-local-dns, but I doubt that would fix the problem since we have both DNS issues and connection timeout issues (connection timeout can explain DNS, but I don't think DNS can explain connection timeout). I can't seem to track down what's causing the issues. The network just seems fundamentally unreliable under negligible load. There are no errors or warnings in any networking component that I can find, nothing restarting, etc. The symptoms are exactly what's described at Azure/AKS#1320 (comment), but if we aren't eating up IOPS, I don't know what could be the issue. I'm out of ideas other than 'start a new cluster' since even starting entirely new nodes didn't help. Does Turing have a support contract with Azure they can rely on? |
I am considering a redeploy to UK south, since then we can take advantage of some reserved instances (and reduce cost). So perhaps now would be a good time to do that. Perhaps with new node configs. As for the IOPS issue, there are certainly spikes, it's just that the load is always low. Perhaps it doesn't take much to cause the issue. |
OK, if deploying a new cluster is desirable for other reasons, let's maybe do that and hope the issue gets fixed and we never need to understand it. |
Post deployment, we are still having intermittency issues, where turing.mybinder.org sometimes shows.
This seems to be concurrent with the log entry:
We are also getting the same errors as the @minrk's original post:
Also these sort of errors appear in the ingress controller logs:
The nodes are now higher spec than before, so we can rule out that option. |
After some googling, I've learned that CPU throttling could be a property of the container, rather than the node it is running on. So perhaps this might not show up in the azure metrics I looked at before? Similar issues are happening on the We could also try node-local-dns. I know you were skeptical, @minrk, but seems worth a try given that it seems to be suggested in a few cases ([e.g. 1,2)...and I'm currently not sure what else to try. |
What CNI are you using?
|
The cluster network uses kubenet, not Azure CNI: https://docs.microsoft.com/en-us/azure/aks/concepts-network#azure-virtual-networks. From reading around it seems that specifying Azure CNI (where every pod gets allocated an IP) is preferred (but I don't see why that would cause our error). |
Did some more unfruitful debugging today. It resulting in two changes, but neither of them seem to help. Logging it here in case it is useful.
I was following the kubernetes docs on Debugging DNS resolution. I noticed that the coredns cluster role does not have the expected permissions 'nodes:get'.
Example coredns clusterrole spec given here: https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/deployments/coredns-1.7.0.yaml So I changed the clusterroles to looks as the docs suggest:
Though this hasn't helped.
This microsoft ticket presents very similarly. Since our coredns logs are showing the following warnings:
Note - the warnings are likely due to missing imports, and unimportant, see coredns issue #3600 But there are also errors in the logs, like:
Following along with the ticket I noticed that in the Azure subnet the network security group is not associated with the relevant aks security group. Have changed this to match with the aks nsg. |
FYI, I have raised a support ticket with azure about this. |
Related, for reference |
I regularly see intermittent issues with Turing, sometimes on launch, sometimes after the server is already running (JupyterLab shows the server connection error dialog). The GitHub workflows almost always return an error status for Turing staging and production- the deploy completes but the verification check fails. Since this is likely to leave a poor impression on users who end up n Turing should we remove it from |
I think that would be good! |
The turing cluster is unhealthy for a couple reasons, causing health check failures and various errors.
There appear to be at least two issues:
networking, with lots of internal DNS failures and connection timeouts.
evictions, apparently tied to disk pressure:
I'm not sure if this is just an unhealthy node or system pod, but I haven't been able to track it down. For now, turing is removed from the federation-redirect while we figure it out. @callummole
The text was updated successfully, but these errors were encountered: