-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet bootstrap: start hostNetwork pods before we have PodCIDR #35526
Conversation
This should fix the problem that needed a reversion of #33347 (static pods that fail admission because network-not-ready are not rescheduled). Should I apply this on top of #33347 and squash for review? Also not clear whether I should move the disk space check out of the admission check. I'd lean towards doing it separately. |
All regular pods got rejected whenever kubelet restarts was a equally serious problem :-)
I think we actually want to reject pods if disk is full, so not sure if we want to change that, and definitely not in this PR.
Sounds good. Thanks! |
Ooops... sorry! I rebased on top of the prior (now reverted) PR. Looks much simpler now! |
if errOuter := canRunPod(pod); errOuter != nil || pod.DeletionTimestamp != nil || apiPodStatus.Phase == api.PodFailed { | ||
errOuter := canRunPod(pod) | ||
if errOuter == nil { | ||
rs := kl.runtimeState.networkErrors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really distinguish network errors here. Would there be a case where even pods using the host network cannot run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @bprashanth I don't think this works with the CRI integration, in which case, we'll just sync all the pods and fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure what you're saying :-) . We now start synchronizing before the network is ready, so that we can see the hostNetwork pods. But we can't actually start pods that don't use hostNetwork, because the network plug is not ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think I understand...
Perhaps if we renamed to networkPluginErrors
, or podNetworkErrors
? The idea is that if the network plugin has a problem, we put it into this separate list. An error that prevents hostNetwork pods would presumably not be a network plugin error, and would go into the general errors list (which prevents the pod sync loop from even starting)
if errOuter == nil { | ||
rs := kl.runtimeState.networkErrors() | ||
if len(rs) != 0 && !podUsesHostNetwork(pod) { | ||
errOuter = fmt.Errorf("Network is not ready: %v", rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (errors should start with a small case letter): s/Network/network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to set errOuter
and get the running containers get killed in line 1321. Just returning an error here should be sufficient. In that case, no need to reuse the errOuter
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the error nit - thanks.
Can you explain when we do & don't want to kill the pod? Is it the case that we kill if the failure is permanent, but not if the failure is transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we only kill pods if the pod is getting deleted (DeletionTimestamp
is set), evicted, or if kubelet is not capable (in terms of security context) to run the pod at all. Although I am not sure how kubelet could've started the pod for the latter case to begin with...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the pod that kubelet determined not to run would be rejected at admission, and not entering syncpod at all.
Rebased. @yujuhong are you happy with this, if we we renamed to |
Jenkins GKE smoke e2e failed for commit 4b37edcb0ed5ef4e6e82ffd735dfdd4343a94d5f. Full PR test history. The magic incantation to run this job again is |
I'd like to get this PR in soon, since it's blocking the |
We do now admit pods (unlike the first attempt), but now we will stop non-hostnetwork pods from starting if the network is not ready. Issue kubernetes#35409
We no longer pass in a "dummy" pod-cidr (10.123.45.0/29), and rely on reconcile-cidr=true instead (which is the default).
Jenkins GCE e2e failed for commit 68c0b42. Full PR test history. The magic incantation to run this job again is |
@dchen1107 now with added unit-test :-) |
@justinsb if this passes all tests, when do you expect this fix to be available in kubelet? |
@zkmoney I think this is dependant on a Kubernetes member applying the LGTM label. FWIW, LGTM to me. :) |
Assuming it gets LGTM-ed before freeze, it should be in 1.5. I don't know if we would backport it .. we should probably add a flag if we did. |
Ah got it, thank you. I am eagerly awaiting the ability to remove a dummy, If there is any good solution to this that you know of in the meantime, I On Fri, Nov 4, 2016 at 2:06 PM Justin Santa Barbara <
|
Yes, I'm in the same boat of the dummy pod-cidr. And this is a problem for daemon-set pods because they are started on the master nodes. I can try to cherry-pick this for 1.4 if someone sets the appropriate label. |
@edevil FWIW for daemonsets where they don't actually need to be running on the masters I've used the |
@zkmoney This is for the logging component (Fluentd) so I do want it to run on every host. :) |
LGTM |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Can this be backported to 1.4? |
👍 For that @edevil |
Network readiness was checked in the pod admission phase, but pods that
fail admission are not retried. Move the check to the pod start phase.
Issue #35409
Issue #35521
This change is