-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: intermittent k8s 1.19 and 1.20 test failures #1962
Comments
I think this will close with a future patch version of k8s 1.20, see kubernetes/kubernetes#97998. |
Yikes! Now it is even worse!!!! At least k8s 1.19 and 1.20 is acting up now. I assume a patch release introduced the same issue in k8s 1.19 also. |
I've just started seeing this error too :( For me, when I follow the singleuser container logs everything seems ok and then it shuts down but I've no idea why
This is with AKS 1.20.2 |
Unexpected shutdowns could be a different error than what I observed, which was an expected shutdown that turned bad. If you would do
Note though that I have not observed these intermittent failures for quite a while now in our CI system, and since the CI system use the latest available k3s distribution of k8s, I think its safe to say that the issue has been patched in a patch version of k8s and have propegated all the way to the k3s distribution of k8s that we are using in the CI system. The solution in general I think would be to upgrade k8s past the patch version you currently have if you experience these issues - based on a comment in #2060 k8s 1.20.4 have the fix. I'll go ahead and close this issue as it is an upstream issue that has been resolved. |
I suspect there may be some other issue causing the shutdown but then I run into this one making it hard to debug further:
|
AKS has recently released 1.20.5 so I'm going to give that a go. Redeploying our AKS cluster will be a bit painful though :( I'll post back if that fixes it... |
@dhirschfeld does upgrading k8s on AKS means redeploying the entire cluster?!!??! |
I thought so, but it looks like it doesn't! 🎉 😅 |
...well, that fixed the k8s error:
I'm still none-the-wiser why it's shutting down (with exit code 0?). More debugging to do... Edit: The web-page says:
So maybe I need to increase a timeout somewhere. |
singleuser.startTimeout i think, i usually have 300 seconds. Isnt that the default in Z2JH? Hmm... |
Turns out it was just user error in my start script 😬 Whilst I did run into this issue it wasn't the cause of the unexpected termination. |
Thanks for the followup @dhirschfeld! ❤️ For any AKS users out there, don't use 1.20.2, upgrade to 1.20.5 or I think 1.20.4 may be enough as well. |
I saw this issue in the CI system, it was showing up in a k8s 1.20 test that was just added. I'm not sure what is the cause yet, but it would be nice to print more info than the status code as a first step and then debug further.
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1961/checks?check_run_id=1594410192
Update
The failure is intermittent, about 50% of the time in our ci tests. I added some logging and such allowing the failure to be inspected in this run.
The conclusion I draw is that there is some pod termination issues in k8s 1.20, and it sais...
To me, this is very confusing, but googling for k8s 1.20 and this message, I ended up in a k8s 1.20 issue. For reference, we are using k3s 1.20.0+k3s2 currently.
kubernetes/kubernetes#97288
The text was updated successfully, but these errors were encountered: