kubelet consistently timing out on attempting to Destroy cgroups #92766

haircommander · 2020-07-02T18:40:27Z

What happened:
the CRI-O CI tests have been in a bad shape recently. In debugging, I have found that the kubelet logs are filled with:

Timed out while waiting for StopUnit(kubepods-besteffort-pod867fd309_03ba_4715_a044_29393f495cea.slice) completion signal from dbus. Continuing...

 grep 'Timed out'  /tmp/kubelet.log   | wc -l
352562

AFAICT, this is from a combination of bumping to go 1.14.4, and a03db63

as I ran two different PRs that dropped each of these commits, and there were no similar problems.

ref which drops this
ref which drops this

I am fairly certain this is NOT a problem with kubernetes directly, but rather some odd interaction between go 1.14 and either libcontainer, go-systemd, or godbus. but I figure it can be opened here to start the conversation

What you expected to happen:
StopUnit should not time out

How to reproduce it (as minimally and precisely as possible):
run a node with cgroupv1

build hyperkube with go 1.14.4 (as is now required)

run hack/local-up.sh

create and remove a pod and see the cgroup be failed to be torn down

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
master
Cloud provider or hardware configuration:
aws
OS (e.g: cat /etc/os-release):

ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Cloud Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Cloud Edition"
VARIANT_ID=cloud

though this also happens on our RHEL 7 boxes

Kernel (e.g. uname -a):

uname -a
Linux ip-172-18-11-215.ec2.internal 5.6.13-100.fc30.x86_64 #1 SMP Fri May 15 00:36:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
build locally
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

haircommander · 2020-07-02T18:40:55Z

/sig node

haircommander · 2020-07-02T18:41:19Z

cc @giuseppe @kolyshkin @mrunalp

haircommander · 2020-07-02T18:43:06Z

I have tried to set GODEBUG=asyncpreemtoff=1 here to no avail, thus far (I may not be doing the ansible correctly though)

haircommander · 2020-07-02T18:43:53Z

related #92521

giuseppe · 2020-07-05T14:21:43Z

it is a regression in runc: opencontainers/runc#2503

giuseppe · 2020-07-07T12:46:50Z

PR opened here: #92862

CRI-O tests: cri-o/cri-o#3858

when the systemd cgroup manager is used, controllers not handled by systemd are created manually afterwards. libcontainer didn't correctly cleanup these cgroups that were leaked on cgroup v1. Closes: kubernetes#92766 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

haircommander added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 2, 2020

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 2, 2020

liggitt added this to the v1.19 milestone Jul 5, 2020

liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 5, 2020

giuseppe mentioned this issue Jul 7, 2020

vendor: update github.com/opencontainers/runc #92862

Merged

k8s-ci-robot closed this as completed in #92862 Jul 12, 2020

haircommander mentioned this issue Jan 27, 2021

REQUEST: New membership for haircommander kubernetes/org#2463

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet consistently timing out on attempting to Destroy cgroups #92766

kubelet consistently timing out on attempting to Destroy cgroups #92766

haircommander commented Jul 2, 2020 •

edited

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

giuseppe commented Jul 5, 2020

giuseppe commented Jul 7, 2020

kubelet consistently timing out on attempting to Destroy cgroups #92766

kubelet consistently timing out on attempting to Destroy cgroups #92766

Comments

haircommander commented Jul 2, 2020 • edited

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

haircommander commented Jul 2, 2020

giuseppe commented Jul 5, 2020

giuseppe commented Jul 7, 2020

haircommander commented Jul 2, 2020 •

edited