New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet waits for node lister to sync at least once #94087
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@sjenning idea is to do similar as seen here, just need to work out pro/con for where we wait for node lister to sync at least once. |
after following up with @deads2k seems like it would be a good time to move away from list/watch and to a node informer with filter, then we can just use regular HasSynced function. will iterate, just trying to determine the best location to wait for the node informer to sync at least once and not cause issues (do not want to do it in the kubelet_getters if possible) |
b510d22
to
5a39f16
Compare
this might also fix #93338 |
5a39f16
to
bdd6a2f
Compare
3de6020
to
7847919
Compare
@sjenning take a look, the idea is to move the predicate function to call GetNode, which if we have a kubeclient ensures we synced the node lister at least once (otherwise, it errors). I am still trying to think through an edge case for the scheduler scenario that we may miss, but this would ensure that we do not fall back to the kubelet "initialNode" and stop tripping up scheduler. |
@@ -245,7 +256,7 @@ func (kl *Kubelet) GetNode() (*v1.Node, error) { | |||
// zero capacity, and the default labels. | |||
func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) { | |||
if kl.kubeClient != nil { | |||
if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil { | |||
if n, err := kl.GetNode(); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure this whole function collapses to just return kl.GetNode()
since GetNode()
contains the standalone logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetNode does not guarantee a response in case where kubeclient is not nil, whereas getNodeAnyway does (it falls back to initial node).
pkg/kubelet/kubelet_getters.go
Outdated
@@ -235,6 +237,15 @@ func (kl *Kubelet) GetNode() (*v1.Node, error) { | |||
if kl.kubeClient == nil { | |||
return kl.initialNode(context.TODO()) | |||
} | |||
// if we have a valid kube client, we wait up to 5s for initial lister to sync | |||
if !kl.nodeHasSynced() { | |||
err := wait.PollImmediate(time.Second, 5*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
choose a weird number so we can find this later if it comes up. How about 8 seconds
524527b
to
ee3f26d
Compare
/retest |
i am fine removing my own hold on this as its possible what i saw was related to another unreliable aspect of CI at that time. /hold cancel |
…m-release-1.18 Cherry pick of #94087 upstream release 1.18
…87-upstream-release-1.19 Automated cherry pick of #94087: node sync at least once
…87-upstream-release-1.20 Automated cherry pick of #94087: node sync at least once
klog.Infof("kubelet nodes sync") | ||
return true | ||
} | ||
klog.Infof("kubelet nodes not sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this message is posted a lot in the logs, i think it can be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 447 klog.Infof("kubelet nodes sync")
is removed in #98137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still present at HEAD:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L452
// if we have a valid kube client, we wait for initial lister to sync | ||
if !kl.nodeHasSynced() { | ||
err := wait.PollImmediate(time.Second, maxWaitForAPIServerSync, func() (bool, error) { | ||
return kl.nodeHasSynced(), nil | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("nodes have not yet been read at least once, cannot construct node object") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this poll is problematic.
GetNode is called in a hot loop, which means that the poll is present on each call.
the sequence looks like:
- GetNode()
- Poll for nodeHasSynced()
- Poll ....
- GetNode()
- Poll for nodeHasSynced()
- Poll ....
if there is a valid client, arguably the informer sync check should be outside of this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough to check HasSynced
once before creating the Kubelet object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would only sync after the api server is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the API server is down, there is nothing to sync. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW we are discussion how to improve this in:
#99336
if the kubelet is managing the first server instance in the cluster as a static pod and if the HasSynced check is before the kubelet object creation this means for such a kubelet instance the single HasSynced check will always be false.
maybe that's what we want, given this is the first node in the cluster and there is no need to sync it on the first kubelet run (ever).
for subsequent runs from the same kubelet or additional kubelet the check should pass if there is an api server.
the Minikube maintainers (cc @medyagh) found that this change introduced a performance regression. i can see the informer wait as something desired but i think the logic here can be improved. we also backported this change in the support skew and since we don't have direct performance tests, nobody saw it until now. |
If these pods are re-scheduled automatically, can the problem be solved? @derekwaynecarr |
What type of PR is this?
/kind bug
What this PR does / why we need it:
If node has a kube client, it will wait to ensure the node lister has synced at least once before trusting its data.
Which issue(s) this PR fixes:
Fixes #92067
Special notes for your reviewer:
still under discussion where to best wait (in getter, or in main kubelet startup)
Does this PR introduce a user-facing change?: