Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

koordlet not support cgroup v2 #1346

Closed
xuctom opened this issue Jun 2, 2023 · 2 comments · Fixed by #1353
Closed

koordlet not support cgroup v2 #1346

xuctom opened this issue Jun 2, 2023 · 2 comments · Fixed by #1353
Labels
area/koordlet kind/bug Create a report to help us improve
Milestone

Comments

@xuctom
Copy link

xuctom commented Jun 2, 2023

What happened:
install koordinator with helm.
https://koordinator.sh/docs/installation/
But koordlet pod keeps in Error and restart
What you expected to happen:
koordlet pod is running
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • App version: v1.2.0
  • Kubernetes version (use kubectl version):v1.23.0
  • docker/containerd version: podman 3.4.4
  • OS (e.g: cat /etc/os-release): Fedora Linux 35
  • Install details (e.g. helm install args):
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version: cri-o 1.23.0
    • Cgroup driver: cgroupfs
      Here is log of the koordlet pod.

I0602 02:42:49.381644 1554037 feature_gate.go:245] feature gates: &{map[Accelerators:true BECPUEvict:true BEMemoryEvict:true CgroupReconcile:true]}
I0602 02:42:49.381771 1554037 main.go:70] Setting up client for koordlet
I0602 02:42:49.392910 1554037 koordlet.go:76] NODE_NAME is master1.com,start time 1.685673769e+09
I0602 02:42:49.392934 1554037 koordlet.go:79] sysconf: &{CgroupRootDir:/host-cgroup/ CgroupKubePath:kubepods/ SysRootDir:/host-sys/ SysFSRootDir:/host-sys-fs/ ProcRootDir:/proc/ VarRunRootDir:/host-var-run/ NodeNameOverride: RuntimeHooksConfigDir:/host-etc-hookserver/ ContainerdEndPoint: DockerEndPoint:},agentMode:dsMode
I0602 02:42:49.392960 1554037 koordlet.go:80] kernel version INFO : {IsAnolisOS:false}
I0602 02:42:49.465764 1554037 koordlet.go:102] can not detect cgroup driver from 'kubepods' cgroup name
I0602 02:42:49.712760 1554037 cgroup_driver_linux.go:132] Cgroup driver is not specify in kubelet config file, use default: 'cgroupfs'
I0602 02:42:49.712785 1554037 koordlet.go:122] Node master1.com use 'cgroupfs' as cgroup driver
I0602 02:42:49.770023 1554037 callback_runner.go:95] states informer callback runtime-hooks-reconciler has registered for type RegisterTypeAllPods
I0602 02:42:49.770054 1554037 runtimehooks.go:125] runtime hook plugin GPUEnvInject enable false
I0602 02:42:49.770066 1554037 hooks.go:50] hook BatchResource is registered
I0602 02:42:49.770069 1554037 hooks.go:50] hook BatchResource is registered
I0602 02:42:49.770072 1554037 hooks.go:50] hook BatchResource is registered
I0602 02:42:49.770079 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (pod cpu shares) finished, info: level=pod, resourceType=cpu.shares, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770099 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (pod cfs quota) finished, info: level=pod, resourceType=cpu.cfs_quota_us, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770105 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (pod memory limit) finished, info: level=pod, resourceType=memory.limit_in_bytes, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770110 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (container cpu shares) finished, info: level=container, resourceType=cpu.shares, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770115 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (container cfs quota) finished, info: level=container, resourceType=cpu.cfs_quota_us, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770121 1554037 reconciler.go:200] register reconcile function set fundamental cgroups value for batch pod (container memory limit) finished, info: level=container, resourceType=memory.limit_in_bytes, filter=podQOS, conditions=[BE LS ]
I0602 02:42:49.770124 1554037 runtimehooks.go:125] runtime hook plugin BatchResource enable true
I0602 02:42:49.770129 1554037 hooks.go:50] hook GroupIdentity is registered
I0602 02:42:49.770164 1554037 bvt.go:74] update system supported info to false for plugin GroupIdentity, supported msg resource not found
I0602 02:42:49.770172 1554037 reconciler.go:200] register reconcile function reconcile pod level cpu bvt value finished, info: level=pod, resourceType=cpu.bvt_warp_ns, filter=none, conditions=[]
I0602 02:42:49.770178 1554037 reconciler.go:200] register reconcile function reconcile kubeqos level cpu bvt value finished, info: level=kubeqos, resourceType=cpu.bvt_warp_ns, filter=none, conditions=[]
I0602 02:42:49.770181 1554037 runtimehooks.go:125] runtime hook plugin GroupIdentity enable true
I0602 02:42:49.770185 1554037 hooks.go:50] hook CPUSetAllocator is registered
I0602 02:42:49.770188 1554037 hooks.go:50] hook CPUSetAllocator is registered
I0602 02:42:49.770191 1554037 hooks.go:50] hook CPUSetAllocator is registered
I0602 02:42:49.770196 1554037 reconciler.go:200] register reconcile function set container cpuset and unset container cpu quota if needed finished, info: level=container, resourceType=cpuset.cpus, filter=podQOS, conditions=[LSE LSR]
I0602 02:42:49.770207 1554037 reconciler.go:166] register reconcile function unset pod cpu quota if needed finished, info: level=pod, resourceType=cpu.cfs_quota_us, add conditions=[LSE LSR]
I0602 02:42:49.770211 1554037 runtimehooks.go:125] runtime hook plugin CPUSetAllocator enable true
I0602 02:42:49.770215 1554037 callback_runner.go:95] states informer callback runtime-hooks-rule-node-slo has registered for type RegisterTypeNodeSLOSpec
I0602 02:42:49.770219 1554037 callback_runner.go:95] states informer callback runtime-hooks-rule-node-topo has registered for type RegisterTypeNodeTopology
I0602 02:42:49.770406 1554037 main.go:101] Starting the koordlet daemon
I0602 02:42:49.770414 1554037 koordlet.go:154] Starting daemon
I0602 02:42:49.770416 1554037 main.go:91] Starting prometheus server on :9316
I0602 02:42:49.770461 1554037 states_informer.go:137] setup statesInformer
I0602 02:42:49.770480 1554037 states_informer.go:139] starting callback runner
I0602 02:42:49.770485 1554037 states_informer.go:143] starting informer plugins
I0602 02:42:49.770500 1554037 states_informer.go:131] plugin nodeTopoInformer has been setup
I0602 02:42:49.770530 1554037 states_informer.go:131] plugin nodeInformer has been setup
I0602 02:42:49.770535 1554037 states_informer.go:131] plugin podsInformer has been setup
I0602 02:42:49.770577 1554037 states_informer.go:131] plugin nodeMetricInformer has been setup
I0602 02:42:49.770590 1554037 states_informer.go:131] plugin nodeSLOInformer has been setup
I0602 02:42:49.770595 1554037 states_informer.go:179] starting informer plugin nodeSLOInformer
I0602 02:42:49.770600 1554037 states_informer.go:179] starting informer plugin nodeTopoInformer
I0602 02:42:49.770604 1554037 states_informer.go:179] starting informer plugin nodeInformer
I0602 02:42:49.770607 1554037 states_informer.go:179] starting informer plugin podsInformer
I0602 02:42:49.770610 1554037 states_informer.go:179] starting informer plugin nodeMetricInformer
I0602 02:42:49.770614 1554037 states_informer.go:148] waiting for informer syncing
I0602 02:42:49.770629 1554037 states_nodeslo.go:90] starting node slo informer
I0602 02:42:49.770637 1554037 states_nodeslo.go:92] node slo informer started
I0602 02:42:49.770661 1554037 states_nodemetric.go:168] starting nodeMetricInformer
I0602 02:42:49.770657 1554037 states_pods.go:90] starting pod informer
I0602 02:42:49.770687 1554037 reflector.go:219] Starting reflector *v1alpha1.NodeSLO (12h0m0s) from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.770696 1554037 states_node.go:89] starting node informer
I0602 02:42:49.770697 1554037 reflector.go:255] Listing and watching *v1alpha1.NodeSLO from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.770701 1554037 states_node.go:91] node informer started
I0602 02:42:49.770702 1554037 states_noderesourcetopology.go:111] starting node topo informer
I0602 02:42:49.770748 1554037 reflector.go:219] Starting reflector *v1.Node (12h0m0s) from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.770756 1554037 reflector.go:255] Listing and watching *v1.Node from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.770776 1554037 reflector.go:219] Starting reflector *v1alpha1.NodeMetric (12h0m0s) from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.770808 1554037 reflector.go:255] Listing and watching *v1alpha1.NodeMetric from pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/reflector.go:167
I0602 02:42:49.771044 1554037 metric_cache.go:867] expired metric data before 2023-06-02 02:12:49.770435351 +0000 UTC m=-1799.512586116 has been recycled, remaining in db size: nodeResCount=0, podResCount=0, containerResCount=0, beCPUResCount=0, podThrottledResCount=0, containerThrottledResCount=0, containerCPIResCount=0, containerPSIResCount=0, podPSIResCount=0
I0602 02:42:49.775295 1554037 states_nodeslo.go:125] update nodeSLO content: old null, new {"kind":"NodeSLO","apiVersion":"slo.koordinator.sh/v1alpha1","metadata":{"name":"master1.com","uid":"284efc4c-c385-4ff3-9b6a-e83043830453","resourceVersion":"596876358","generation":1,"creationTimestamp":"2023-06-01T10:35:59Z","managedFields":[{"manager":"koordinator-manager","operation":"Update","apiVersion":"slo.koordinator.sh/v1alpha1","time":"2023-06-01T10:35:59Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:cpuBurstStrategy":{".":{},"f:cfsQuotaBurstPercent":{},"f:cfsQuotaBurstPeriodSeconds":{},"f:cpuBurstPercent":{},"f:policy":{},"f:sharePoolThresholdPercent":{}},"f:extensions":{".":{},"f:Object":{}},"f:resourceQOSStrategy":{},"f:resourceUsedThresholdWithBE":{".":{},"f:cpuSuppressPolicy":{},"f:cpuSuppressThresholdPercent":{},"f:enable":{},"f:memoryEvictThresholdPercent":{}}}}}]},"spec":{"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{"lsrClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"lsClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"beClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}}},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50},"systemStrategy":{"minFreeKbytesFactor":100,"watermarkScaleFactor":150,"memcgReapBackGround":0},"extensions":{"Object":null}},"status":{}}
I0602 02:42:49.775332 1554037 states_nodeslo.go:66] create NodeSLO {"kind":"NodeSLO","apiVersion":"slo.koordinator.sh/v1alpha1","metadata":{"name":"master1.com","uid":"284efc4c-c385-4ff3-9b6a-e83043830453","resourceVersion":"596876358","generation":1,"creationTimestamp":"2023-06-01T10:35:59Z","managedFields":[{"manager":"koordinator-manager","operation":"Update","apiVersion":"slo.koordinator.sh/v1alpha1","time":"2023-06-01T10:35:59Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:cpuBurstStrategy":{".":{},"f:cfsQuotaBurstPercent":{},"f:cfsQuotaBurstPeriodSeconds":{},"f:cpuBurstPercent":{},"f:policy":{},"f:sharePoolThresholdPercent":{}},"f:extensions":{".":{},"f:Object":{}},"f:resourceQOSStrategy":{},"f:resourceUsedThresholdWithBE":{".":{},"f:cpuSuppressPolicy":{},"f:cpuSuppressThresholdPercent":{},"f:enable":{},"f:memoryEvictThresholdPercent":{}}}}}]},"spec":{"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50},"extensions":{"Object":null}},"status":{}}
I0602 02:42:49.775379 1554037 rule.go:81] applying 3 rules with new RegisterTypeNodeSLOSpec, detail: {"resourceUsedThresholdWithBE":{"enable":true,"cpuSuppressThresholdPercent":65,"cpuSuppressPolicy":"cpuset","memoryEvictThresholdPercent":70},"resourceQOSStrategy":{"lsrClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"lsClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}},"beClass":{"cpuQOS":{"enable":false,"groupIdentity":0},"memoryQOS":{"enable":false,"minLimitPercent":0,"lowLimitPercent":0,"throttlingPercent":0,"wmarkRatio":0,"wmarkScalePermill":50,"wmarkMinAdj":0,"priorityEnable":0,"priority":0,"oomKillGroup":0},"resctrlQOS":{"enable":false,"catRangeStartPercent":0,"catRangeEndPercent":100,"mbaPercent":100}}},"cpuBurstStrategy":{"policy":"none","cpuBurstPercent":1000,"cfsQuotaBurstPercent":300,"cfsQuotaBurstPeriodSeconds":-1,"sharePoolThresholdPercent":50},"systemStrategy":{"minFreeKbytesFactor":100,"watermarkScaleFactor":150,"memcgReapBackGround":0},"extensions":{"Object":null}}
I0602 02:42:49.775391 1554037 rule.go:58] runtime hook plugin BatchResource update rule true, new rule &{true}
I0602 02:42:49.775401 1554037 rule.go:100] rule BatchResource is updated, run update callback for all 0 pods
I0602 02:42:49.775406 1554037 rule.go:88] system unsupported for rule GroupIdentity, do nothing during UpdateRules
I0602 02:42:49.871133 1554037 shared_informer.go:270] caches populated
I0602 02:42:49.871491 1554037 states_pods.go:123] pod informer started
E0602 02:42:49.871523 1554037 pleg.go:150] failed to watch path /host-cgroup/kubepods, err inotify_add_watch /host-cgroup/kubepods: no such file or directory
F0602 02:42:49.871543 1554037 states_pods.go:119] Unable to run the pleg: %!(EXTRA *fs.PathError=inotify_add_watch /host-cgroup/kubepods: no such file or directory)
goroutine 293 [running]:
k8s.io/klog/v2.stacks(0x1)
/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:1026 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x3ade280, 0x3, {0x0, 0x0}, 0xc000ac61c0, 0x0, {0x2d94c74, 0xc001359f30}, 0x0, 0x0)
/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:975 +0x63d
k8s.io/klog/v2.(*loggingT).printf(0x32375b3a65706970, 0x34353638, {0x0, 0x0}, {0x0, 0x0}, {0x23cc119, 0x18}, {0xc001359f30, 0x1, ...})
/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:753 +0x1e5
k8s.io/klog/v2.Fatalf(...)
/go/pkg/mod/k8s.io/klog/v2@v2.10.0/klog.go:1514
github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer.(*podsInformer).Start.func2()
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer/states_pods.go:119 +0xcd
created by github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer.(*podsInformer).Start
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/statesinformer/states_pods.go:117 +0x4cf

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x278b210, 0xc001069300}, 0xc00114e318, 0x85740a)
/go/pkg/mod/k8s.io/apimachinery@v0.22.6/pkg/util/wait/wait.go:655 +0xe7
k8s.io/apimachinery/pkg/util/wait.poll({0x278b210, 0xc001069300}, 0x80, 0x856245, 0xc0004fe6d0)
/go/pkg/mod/k8s.io/apimachinery@v0.22.6/pkg/util/wait/wait.go:591 +0x9a
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext({0x278b210, 0xc001069300}, 0x20, 0xc000132c00)
/go/pkg/mod/k8s.io/apimachinery@v0.22.6/pkg/util/wait/wait.go:542 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil(0x2323760, 0xc001170000, 0xc000e5dd70)
/go/pkg/mod/k8s.io/apimachinery@v0.22.6/pkg/util/wait/wait.go:533 +0x7c
k8s.io/client-go/tools/cache.WaitForCacheSync(0x0, {0xc000010388, 0x1, 0x1})
/go/pkg/mod/k8s.io/client-go@v0.22.6/tools/cache/shared_informer.go:255 +0xae
github.com/koordinator-sh/koordinator/pkg/koordlet.(*daemon).Run(0xc001140420, 0xc00103e4e0)
/go/src/github.com/koordinator-sh/koordinator/pkg/koordlet/koordlet.go:169 +0x247
main.main()
/go/src/github.com/koordinator-sh/koordinator/cmd/koordlet/main.go:102 +0x5b9

@xuctom xuctom added the kind/bug Create a report to help us improve label Jun 2, 2023
@xuctom xuctom closed this as completed Jun 2, 2023
@xuctom xuctom reopened this Jun 2, 2023
@saintube
Copy link
Member

saintube commented Jun 2, 2023

/area koordlet
Thanks for the feedback. We will fix this issue soon.

@saintube
Copy link
Member

saintube commented Jun 5, 2023

@xuchenSingle Hi, the bug should be fixed by #1353. Please let us know if this issue remains.

@koordinator-bot koordinator-bot bot added this to the v1.3 milestone Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koordlet kind/bug Create a report to help us improve
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants