Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 w/ CRIO on GitHub actions: failed to initialize top level QOS containers: root container [kubepods] doesn't exist #9304

Closed
MOZGIII opened this issue Sep 22, 2020 · 7 comments · Fixed by #9508
Assignees
Labels
co/runtime/crio CRIO related issues kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@MOZGIII
Copy link

MOZGIII commented Sep 22, 2020

We're having odd issues at our CI at vectordotdev/vector#4055 with minikube 1.13.1 and CRI-O.

After we switched from minikube 1.11.0 to 1.13.1 we had failures at CRI-O setup at all K8s versions we test against. But only with CRI-O, the rest work.

Locally everything works as expected.

I'd be happy to test run new version at our CI if needed.

Steps to reproduce the issue:

  1. Run this script at Github Actions: https://github.com/timberio/vector/blob/a14a80abe0791ab89330d7a23860a971989f48ae/scripts/ci-setup-minikube.sh

I can't repro it locally.

More info is available at our CI at vector: https://github.com/timberio/vector/runs/1146423406?check_suite_focus=true

Full output of failed command:

Sorry too much data. Please check our CI for logs (multiple runs):

@MOZGIII MOZGIII changed the title Odd issues at our CI with minikube 1.13.1 and CRI-O Odd issues at Github Actions based CI with minikube 1.13.1 and CRI-O Sep 22, 2020
@tstromberg tstromberg added co/runtime/crio CRIO related issues kind/support Categorizes issue or PR as a support question. labels Sep 22, 2020
@tstromberg
Copy link
Contributor

tstromberg commented Sep 22, 2020

I see kubeadm exited with:

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster

CRIO basically has no logs that are useful:

* ==> CRI-O <==
* -- Logs begin at Mon 2020-09-21 21:52:04 UTC, end at Mon 2020-09-21 21:56:24 UTC. --
* Sep 21 21:52:25 minikube systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
* Sep 21 21:52:25 minikube systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
* 

kubelet (v1.18.3) mentions this fatal issue:

* Sep 21 21:56:23 minikube kubelet[26186]: E0921 21:56:23.584550   26186 cgroup_manager_linux.go:492] cgroup update failed failed to set supported cgroup subsystems for cgroup [kubepods]: failed to set config for supported subsystems : failed to write "2048" to "/sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpu.shares": open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpu.shares: no such file or directory
* Sep 21 21:56:23 minikube kubelet[26186]: F0921 21:56:23.584616   26186 kubelet.go:1386] Failed to start ContainerManager failed to initialize top level QOS containers: root container [kubepods] doesn't exist

I have a suspicion that this may be related to recent work in the entrypoint, but only comes up in unique Docker on Linux configurations, such as running minikube within a kubelet or Docker container.

Unfortunately, as you noted, it is not trivial to duplicate this locally - but I believe we can fix it if you provide the output of:

docker logs minikube

I only need the lines between:

+ fix_cgroup_mounts

and:

+ fix_machine_id

If it is possible to do so, this would also be useful:

minikube ssh "cat /proc/self/mountinfo"

Thank you!

@tstromberg tstromberg added the triage/needs-information Indicates an issue needs more information in order to work on it. label Sep 22, 2020
@MOZGIII
Copy link
Author

MOZGIII commented Sep 22, 2020

What would be the right time to invoke that command? Right after minikube start fails?

@tstromberg tstromberg changed the title Odd issues at Github Actions based CI with minikube 1.13.1 and CRI-O v1.13.x CRIO on GitHub actions: failed to initialize top level QOS containers: root container [kubepods] doesn't exist Sep 22, 2020
@tstromberg tstromberg changed the title v1.13.x CRIO on GitHub actions: failed to initialize top level QOS containers: root container [kubepods] doesn't exist v1.13 w/ CRIO on GitHub actions: failed to initialize top level QOS containers: root container [kubepods] doesn't exist Sep 22, 2020
@tstromberg
Copy link
Contributor

@MOZGIII - yes, after the failed start.

/cc @priyawadhwa who has investigated similar issues in the past.

@MOZGIII
Copy link
Author

MOZGIII commented Sep 22, 2020

I prepared this PR for experiments: vectordotdev/vector#4064
Added what you requested in that branch: vectordotdev/vector@a7ee0cd.
Here's the CI run: https://github.com/timberio/vector/actions/runs/267197623

UPD: another CI run with right debug commands https://github.com/timberio/vector/actions/runs/267219520

@MOZGIII
Copy link
Author

MOZGIII commented Oct 16, 2020

The problem persists with 1.14.0: https://github.com/timberio/vector/pull/4055/checks?check_run_id=1262088375

@tstromberg tstromberg added kind/bug Categorizes issue or PR as related to a bug. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it. labels Oct 21, 2020
@tstromberg
Copy link
Contributor

The root cause is that minikube has hardcoded the root directories it expects the cgroups to be mounted from, and this configuration doesn't share those expectations:

632 620 0:42 /actions_job/0924fbbcf7b18d2a00c171482b4600747afc367a9dfbeac9d6b14b35cda80399 /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:263 master:24 - cgroup cgroup rw,memory

This particular line is to blame:

cgroup_mounts=$(egrep -o '(/docker|libpod_parent|/kubepods).*/sys/fs/cgroup.*' /proc/self/mountinfo || true)

We've had one PR since that widened the regexp: #9092

I'll see about making it even wider.

@tstromberg tstromberg self-assigned this Oct 21, 2020
@tstromberg tstromberg added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 21, 2020
@MOZGIII
Copy link
Author

MOZGIII commented Oct 22, 2020

Ah, got it, thanks for the response!
I wonder why is this only happening with CRIO, and only started with 1.13. Seems rather odd to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
co/runtime/crio CRIO related issues kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants