ci: intermittent k8s 1.19 and 1.20 test failures #1962

consideRatio · 2020-12-22T12:33:37Z

I saw this issue in the CI system, it was showing up in a k8s 1.20 test that was just added. I'm not sure what is the cause yet, but it would be nice to print more info than the status code as a first step and then debug further.

    @pytest.fixture(scope="function")
    def jupyter_user(api_request):
        """
        A temporary unique JupyterHub user
        """
        username = "testuser-" + str(uuid.uuid4())
        r = api_request.post("/users/" + username)
        assert r.status_code == 201
        yield username
        r = api_request.delete("/users/" + username)
>       assert r.status_code == 204
E       assert 400 == 204
E         +400
E         -204

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1961/checks?check_run_id=1594410192

Update

The failure is intermittent, about 50% of the time in our ci tests. I added some logging and such allowing the failure to be inspected in this run.

The conclusion I draw is that there is some pod termination issues in k8s 1.20, and it sais...

    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

To me, this is very confusing, but googling for k8s 1.20 and this message, I ended up in a k8s 1.20 issue. For reference, we are using k3s 1.20.0+k3s2 currently.

kubernetes/kubernetes#97288

The text was updated successfully, but these errors were encountered:

consideRatio · 2021-01-16T07:40:31Z

I think this will close with a future patch version of k8s 1.20, see kubernetes/kubernetes#97998.

consideRatio · 2021-02-25T10:23:35Z

Yikes! Now it is even worse!!!!

At least k8s 1.19 and 1.20 is acting up now. I assume a patch release introduced the same issue in k8s 1.19 also.

dhirschfeld · 2021-05-01T00:20:10Z

I've just started seeing this error too :(

For me, when I follow the singleuser container logs everything seems ok and then it shuts down but I've no idea why

[I 2021-05-01 10:05:30.852 SingleUserLabApp serverapp:2105] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
WARNING: Insecure writes have been enabled via environment variable 'JUPYTER_ALLOW_INSECURE_WRITES'! If this is not intended, remove the variable or set its value to 'False'.
[I 2021-05-01 10:05:30.932 SingleUserLabApp mixins:548] Updating Hub with activity every 300 seconds
[C 2021-05-01 10:05:56.958 SingleUserLabApp serverapp:1652] received signal 15, stopping
[I 2021-05-01 10:05:56.996 SingleUserLabApp serverapp:1889] Shutting down 0 kernels
[I 2021-05-01 10:05:56.997 SingleUserLabApp serverapp:1904] Shutting down 0 terminals
rpc error: code = NotFound desc = an error occurred when try to find container "842a52a6aade7aad35e3db54b193384288266fa9e0381897948e99735d1feee3": not found

This is with AKS 1.20.2

consideRatio · 2021-05-01T00:40:51Z

Unexpected shutdowns could be a different error than what I observed, which was an expected shutdown that turned bad. If you would do kubectl describe pod <pod name> and see the same message, then we have the same error I think.

    State:          Waiting
      Reason:       PodInitializing
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137

Note though that I have not observed these intermittent failures for quite a while now in our CI system, and since the CI system use the latest available k3s distribution of k8s, I think its safe to say that the issue has been patched in a patch version of k8s and have propegated all the way to the k3s distribution of k8s that we are using in the CI system.

The solution in general I think would be to upgrade k8s past the patch version you currently have if you experience these issues - based on a comment in #2060 k8s 1.20.4 have the fix. I'll go ahead and close this issue as it is an upstream issue that has been resolved.

dhirschfeld · 2021-05-01T00:49:53Z

I suspect there may be some other issue causing the shutdown but then I run into this one making it hard to debug further:

    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

dhirschfeld · 2021-05-01T00:51:38Z

AKS has recently released 1.20.5 so I'm going to give that a go. Redeploying our AKS cluster will be a bit painful though :(

I'll post back if that fixes it...

consideRatio · 2021-05-01T00:58:15Z

@dhirschfeld does upgrading k8s on AKS means redeploying the entire cluster?!!??!

dhirschfeld · 2021-05-01T01:02:03Z

@dhirschfeld does upgrading k8s on AKS means redeploying the entire cluster?!!??!

I thought so, but it looks like it doesn't! 🎉 😅

dhirschfeld · 2021-05-01T03:47:31Z

...well, that fixed the k8s error:

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 01 May 2021 13:44:02 +1000
      Finished:     Sat, 01 May 2021 13:44:31 +1000

I'm still none-the-wiser why it's shutting down (with exit code 0?). More debugging to do...

Edit: The web-page says:

Spawn failed: Server at http://10.231.64.156:8888/user/dhirschf/ didn't respond in 30 seconds

So maybe I need to increase a timeout somewhere.

consideRatio · 2021-05-01T09:48:40Z

singleuser.startTimeout i think, i usually have 300 seconds. Isnt that the default in Z2JH? Hmm...

dhirschfeld · 2021-05-01T22:07:11Z

Turns out it was just user error in my start script 😬

Whilst I did run into this issue it wasn't the cause of the unexpected termination.

consideRatio · 2021-05-01T22:28:48Z

Thanks for the followup @dhirschfeld! ❤️

For any AKS users out there, don't use 1.20.2, upgrade to 1.20.5 or I think 1.20.4 may be enough as well.

consideRatio mentioned this issue Dec 22, 2020

singleuser-sample image: bump jupyerhub to 1.3.0 #1961

Merged

consideRatio changed the title ~~ci: intermittent failure to debug~~ ci: k8s 1.20 test failure Dec 22, 2020

consideRatio changed the title ~~ci: k8s 1.20 test failure~~ ci: intermittent k8s 1.20 test failure Dec 22, 2020

consideRatio mentioned this issue Dec 24, 2020

Singleuser containers stuck Terminating jupyterhub/jupyterhub#3317

Closed

consideRatio added the maintenance label Dec 27, 2020

This was referenced Jan 20, 2021

refactor: systematically prefer use of with in templates #2003

Merged

ci: accept k8s 1.20 failures until 1.20.3 is out #2004

Merged

consideRatio mentioned this issue Feb 23, 2021

ci: stop accepting test failures in k8s 1.20 #2060

Merged

consideRatio closed this as completed in #2060 Feb 23, 2021

consideRatio reopened this Feb 25, 2021

consideRatio changed the title ~~ci: intermittent k8s 1.20 test failure~~ ci: intermittent k8s 1.19 and 1.20 test failures Feb 25, 2021

consideRatio mentioned this issue Feb 25, 2021

Add prePuller.hook.pullOnlyOnChanges flag #2066

Merged

consideRatio closed this as completed May 1, 2021

mumoshu mentioned this issue Apr 29, 2022

Pods stuck in terminating state actions/actions-runner-controller#1369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: intermittent k8s 1.19 and 1.20 test failures #1962

ci: intermittent k8s 1.19 and 1.20 test failures #1962

consideRatio commented Dec 22, 2020 •

edited

Loading

consideRatio commented Jan 16, 2021

consideRatio commented Feb 25, 2021

dhirschfeld commented May 1, 2021 •

edited

Loading

consideRatio commented May 1, 2021 •

edited

Loading

dhirschfeld commented May 1, 2021

dhirschfeld commented May 1, 2021 •

edited

Loading

consideRatio commented May 1, 2021

dhirschfeld commented May 1, 2021

dhirschfeld commented May 1, 2021 •

edited

Loading

consideRatio commented May 1, 2021

dhirschfeld commented May 1, 2021

consideRatio commented May 1, 2021

ci: intermittent k8s 1.19 and 1.20 test failures #1962

ci: intermittent k8s 1.19 and 1.20 test failures #1962

Comments

consideRatio commented Dec 22, 2020 • edited Loading

Update

consideRatio commented Jan 16, 2021

consideRatio commented Feb 25, 2021

dhirschfeld commented May 1, 2021 • edited Loading

consideRatio commented May 1, 2021 • edited Loading

dhirschfeld commented May 1, 2021

dhirschfeld commented May 1, 2021 • edited Loading

consideRatio commented May 1, 2021

dhirschfeld commented May 1, 2021

dhirschfeld commented May 1, 2021 • edited Loading

consideRatio commented May 1, 2021

dhirschfeld commented May 1, 2021

consideRatio commented May 1, 2021

consideRatio commented Dec 22, 2020 •

edited

Loading

dhirschfeld commented May 1, 2021 •

edited

Loading

consideRatio commented May 1, 2021 •

edited

Loading

dhirschfeld commented May 1, 2021 •

edited

Loading

dhirschfeld commented May 1, 2021 •

edited

Loading