Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet consistently timing out on attempting to Destroy cgroups #92766

Closed
haircommander opened this issue Jul 2, 2020 · 6 comments · Fixed by #92862
Closed

kubelet consistently timing out on attempting to Destroy cgroups #92766

haircommander opened this issue Jul 2, 2020 · 6 comments · Fixed by #92862
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@haircommander
Copy link
Contributor

haircommander commented Jul 2, 2020

What happened:
the CRI-O CI tests have been in a bad shape recently. In debugging, I have found that the kubelet logs are filled with:

Timed out while waiting for StopUnit(kubepods-besteffort-pod867fd309_03ba_4715_a044_29393f495cea.slice) completion signal from dbus. Continuing...
 grep 'Timed out'  /tmp/kubelet.log   | wc -l
352562

AFAICT, this is from a combination of bumping to go 1.14.4, and a03db63

as I ran two different PRs that dropped each of these commits, and there were no similar problems.

I am fairly certain this is NOT a problem with kubernetes directly, but rather some odd interaction between go 1.14 and either libcontainer, go-systemd, or godbus. but I figure it can be opened here to start the conversation

What you expected to happen:
StopUnit should not time out

How to reproduce it (as minimally and precisely as possible):
run a node with cgroupv1

build hyperkube with go 1.14.4 (as is now required)

run hack/local-up.sh

create and remove a pod and see the cgroup be failed to be torn down

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    master
  • Cloud provider or hardware configuration:
    aws
  • OS (e.g: cat /etc/os-release):
ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Cloud Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Cloud Edition"
VARIANT_ID=cloud

though this also happens on our RHEL 7 boxes

  • Kernel (e.g. uname -a):
uname -a
Linux ip-172-18-11-215.ec2.internal 5.6.13-100.fc30.x86_64 #1 SMP Fri May 15 00:36:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    build locally
  • Network plugin and version (if this is a network-related bug):
  • Others:
@haircommander haircommander added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 2, 2020
@haircommander
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 2, 2020
@haircommander
Copy link
Contributor Author

cc @giuseppe @kolyshkin @mrunalp

@haircommander
Copy link
Contributor Author

I have tried to set GODEBUG=asyncpreemtoff=1 here to no avail, thus far (I may not be doing the ansible correctly though)

@haircommander
Copy link
Contributor Author

related #92521

@giuseppe
Copy link
Member

giuseppe commented Jul 5, 2020

it is a regression in runc: opencontainers/runc#2503

@liggitt liggitt added this to the v1.19 milestone Jul 5, 2020
@liggitt liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 5, 2020
@giuseppe
Copy link
Member

giuseppe commented Jul 7, 2020

PR opened here: #92862

CRI-O tests: cri-o/cri-o#3858

giuseppe added a commit to giuseppe/kubernetes that referenced this issue Jul 9, 2020
when the systemd cgroup manager is used, controllers not handled by
systemd are created manually afterwards.
libcontainer didn't correctly cleanup these cgroups that were leaked
on cgroup v1.

Closes: kubernetes#92766

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
sharadg pushed a commit to sharadg/kubernetes that referenced this issue Oct 23, 2020
when the systemd cgroup manager is used, controllers not handled by
systemd are created manually afterwards.
libcontainer didn't correctly cleanup these cgroups that were leaked
on cgroup v1.

Closes: kubernetes#92766

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants