-
Notifications
You must be signed in to change notification settings - Fork 304
Description
What happened:
While migrating from 0.10 to 0.16.0:
- all node feature labels got removed
kubectl get nodefeatures -n node-feature-discoverywas unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)
What you expected to happen:
How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.
Anything else we need to know?:
There were extensive informer sync errors in nfd-master logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).
Attaching logs: nfd-master.log
My suspicion is that the nfd-master somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)
💡 We don't see the issue on much smaller clusters.
💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).
Environment:
- Kubernetes version (use
kubectl version): v1.23.17 - Cloud provider or hardware configuration: Bare-metal
- OS (e.g:
cat /etc/os-release): Not applicable - Kernel (e.g.
uname -a): Not applicable - Install tools: installed via Helm
- Network plugin and version (if this is a network-related bug): Not applicable
- Others: Not applicable