Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: intermittent k8s 1.19 and 1.20 test failures #1962

Closed
consideRatio opened this issue Dec 22, 2020 · 12 comments · Fixed by #2060
Closed

ci: intermittent k8s 1.19 and 1.20 test failures #1962

consideRatio opened this issue Dec 22, 2020 · 12 comments · Fixed by #2060

Comments

@consideRatio
Copy link
Member

consideRatio commented Dec 22, 2020

I saw this issue in the CI system, it was showing up in a k8s 1.20 test that was just added. I'm not sure what is the cause yet, but it would be nice to print more info than the status code as a first step and then debug further.

    @pytest.fixture(scope="function")
    def jupyter_user(api_request):
        """
        A temporary unique JupyterHub user
        """
        username = "testuser-" + str(uuid.uuid4())
        r = api_request.post("/users/" + username)
        assert r.status_code == 201
        yield username
        r = api_request.delete("/users/" + username)
>       assert r.status_code == 204
E       assert 400 == 204
E         +400
E         -204

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1961/checks?check_run_id=1594410192

Update

The failure is intermittent, about 50% of the time in our ci tests. I added some logging and such allowing the failure to be inspected in this run.

The conclusion I draw is that there is some pod termination issues in k8s 1.20, and it sais...

    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

To me, this is very confusing, but googling for k8s 1.20 and this message, I ended up in a k8s 1.20 issue. For reference, we are using k3s 1.20.0+k3s2 currently.

kubernetes/kubernetes#97288

@consideRatio consideRatio changed the title ci: intermittent failure to debug ci: k8s 1.20 test failure Dec 22, 2020
@consideRatio consideRatio changed the title ci: k8s 1.20 test failure ci: intermittent k8s 1.20 test failure Dec 22, 2020
@consideRatio
Copy link
Member Author

I think this will close with a future patch version of k8s 1.20, see kubernetes/kubernetes#97998.

@consideRatio
Copy link
Member Author

Yikes! Now it is even worse!!!!

At least k8s 1.19 and 1.20 is acting up now. I assume a patch release introduced the same issue in k8s 1.19 also.

@consideRatio consideRatio changed the title ci: intermittent k8s 1.20 test failure ci: intermittent k8s 1.19 and 1.20 test failures Feb 25, 2021
@dhirschfeld
Copy link
Contributor

dhirschfeld commented May 1, 2021

I've just started seeing this error too :(

For me, when I follow the singleuser container logs everything seems ok and then it shuts down but I've no idea why

[I 2021-05-01 10:05:30.852 SingleUserLabApp serverapp:2105] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
WARNING: Insecure writes have been enabled via environment variable 'JUPYTER_ALLOW_INSECURE_WRITES'! If this is not intended, remove the variable or set its value to 'False'.
[I 2021-05-01 10:05:30.932 SingleUserLabApp mixins:548] Updating Hub with activity every 300 seconds
[C 2021-05-01 10:05:56.958 SingleUserLabApp serverapp:1652] received signal 15, stopping
[I 2021-05-01 10:05:56.996 SingleUserLabApp serverapp:1889] Shutting down 0 kernels
[I 2021-05-01 10:05:56.997 SingleUserLabApp serverapp:1904] Shutting down 0 terminals
rpc error: code = NotFound desc = an error occurred when try to find container "842a52a6aade7aad35e3db54b193384288266fa9e0381897948e99735d1feee3": not found

This is with AKS 1.20.2

@consideRatio
Copy link
Member Author

consideRatio commented May 1, 2021

Unexpected shutdowns could be a different error than what I observed, which was an expected shutdown that turned bad. If you would do kubectl describe pod <pod name> and see the same message, then we have the same error I think.

    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137

Note though that I have not observed these intermittent failures for quite a while now in our CI system, and since the CI system use the latest available k3s distribution of k8s, I think its safe to say that the issue has been patched in a patch version of k8s and have propegated all the way to the k3s distribution of k8s that we are using in the CI system.

The solution in general I think would be to upgrade k8s past the patch version you currently have if you experience these issues - based on a comment in #2060 k8s 1.20.4 have the fix. I'll go ahead and close this issue as it is an upstream issue that has been resolved.

@dhirschfeld
Copy link
Contributor

I suspect there may be some other issue causing the shutdown but then I run into this one making it hard to debug further:

    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

@dhirschfeld
Copy link
Contributor

dhirschfeld commented May 1, 2021

AKS has recently released 1.20.5 so I'm going to give that a go. Redeploying our AKS cluster will be a bit painful though :(

I'll post back if that fixes it...

@consideRatio
Copy link
Member Author

@dhirschfeld does upgrading k8s on AKS means redeploying the entire cluster?!!??!

@dhirschfeld
Copy link
Contributor

@dhirschfeld does upgrading k8s on AKS means redeploying the entire cluster?!!??!

I thought so, but it looks like it doesn't! 🎉 😅

@dhirschfeld
Copy link
Contributor

dhirschfeld commented May 1, 2021

...well, that fixed the k8s error:

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 01 May 2021 13:44:02 +1000
      Finished:     Sat, 01 May 2021 13:44:31 +1000

I'm still none-the-wiser why it's shutting down (with exit code 0?). More debugging to do...

Edit: The web-page says:

Spawn failed: Server at http://10.231.64.156:8888/user/dhirschf/ didn't respond in 30 seconds

So maybe I need to increase a timeout somewhere.

@consideRatio
Copy link
Member Author

singleuser.startTimeout i think, i usually have 300 seconds. Isnt that the default in Z2JH? Hmm...

@dhirschfeld
Copy link
Contributor

Turns out it was just user error in my start script 😬

Whilst I did run into this issue it wasn't the cause of the unexpected termination.

@consideRatio
Copy link
Member Author

Thanks for the followup @dhirschfeld! ❤️

For any AKS users out there, don't use 1.20.2, upgrade to 1.20.5 or I think 1.20.4 may be enough as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants