-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 2999: adding new KEP for host network pod IPs #3000
Conversation
aojea
commented
Oct 4, 2021
- One-line PR description: adding new KEP
- Issue link: Host Network Pods IPs #2999
- Other comments:
/assign @danwinship @thockin @dcbw @khenidak |
should be approved by the remaining approvers and/or the owning SIG (or | ||
SIG Architecture for cross-cutting KEPs). | ||
--> | ||
# KEP-NNNN: Your short, descriptive title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^^
|
||
## Motivation | ||
|
||
Kubelet reporting the updated IP in the node.Status, but host network pods reporting an IP that no longer exist is misleading and create confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also some confusion / inconsistency about where host-network pod IPs come from; kubelet will fill them in from node.status.addresses
if CRI leaves them unset, but if CRI sets the pod's IPs itself, kubelet won't touch them. (This is bad because, eg, the runtime might not know whether the node is supposed to be dual-stack or not, so it may fill in a single-stack IP, and then kubelet won't add the second node IP to that.)
cri-o leaves the pod IPs unset for host-network pods but I'm not sure what other CRI implementations do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This is an excellent point. I do believe that PodIPs for host-net pods should be always match node.status.addresses
irrespective of what CRI reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry somewhat confused on what the final state of this proposal will be?
What will be the source of truth for Pod.Status.PodIP
and Pod.Status.PodIPs[]
? Will they be populated from the Kubelet? From CRI? From both?
If both... what is the reconciliation logic? How would we manage a delta?
If this is already implemented in the kubelet somewhere already I am assuming we just defer to that logic? Which begs the question, is that logic worth revisting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed implementation is here kubernetes/kubernetes#105410
@danwinship I can only see that the IPs are obtained from the Node.Status.Addresses
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/node/node.go#L99
but if CRI sets the pod's IPs itself, kubelet won't touch them.
how that can happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, but is not a goal of the KEP, the goal is only to update the PodIP whenever it changes, CRI or Kubelet
But right now, if CRI sets the pod IP itself, then kubelet never touches it. If we unconditionally change it when it changes then that means that with some runtimes, if you start a hostNetwork pod on a dual-stack host, it will be single-stack, but if you then change the IPv4 address of the node, then the pod would become dual-stack. That's obviously wrong, so either we need to say that kubelet won't update the pod IP of pods where it didn't initially set the pod IP itself (which would mean we would have to remember somewhere whether we had set it ourselves or not), or else (as Kal suggested) we should override CRI in this case if it did try to set the pod IP itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1522
If podIP was set by the runtime then kubelet never calls doesn't copy the host IP(s) to the pod IP(s)GetNodeHostIPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the CRI will set IPs on a hostnetwork pod? there are no CNI calls at all, that doesn't seem a realistic scenario, a hostNetwork pod doesn't report ips neither in CRIO or containerd
https://github.com/cri-o/cri-o/blob/8d4d158935929800c4300b82eb4b5a83ded400f4/server/sandbox_network.go#L34-L36
|
||
## Summary | ||
|
||
Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(line-wrap!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Pod IPs/PodIPs/g
(there is no standards, but i noticed that is how we usually ref them).
|
||
### Risks and Mitigations | ||
|
||
Applications running in a host-network pod that depend on the Downward API or the Pod.Status.PodIPs for auto-configuration should consider that it can change. However, most probably these applications would not be working correctly because of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are two cases here:
- a node IP is removed (which is what you're talking about)
- the node's "primary node IP(s)" change in some way but the old primary node IP(s) still exist
- A. the node goes from single-stack to dual-stack and so starts using dual-stack
podIPs
for new host-network pods. - B. the node goes from dual-stack to single-stack and so stops using dual-stack
podIPs
, although presumably the secondary-IP-family node IP still exists even though kubelet is ignoring it. - C. the node's
.Status.Addresses
gets reordered in a way that causes a different IP to become primary. (eg, kubelet gets restarted with a different--node-ip
value, or the user modifies the cloud configuration for the node to, eg, swap two virtual NICs)
- A. the node goes from single-stack to dual-stack and so starts using dual-stack
In all of the case 2 scenarios, the pod's old IP continues to be at least partially valid/usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, worth clarifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a missing user story around static pods and how they are racy. They can start before CloudProvider sets Node.Status.Addresses
.
I also want - for prosperity - to outline the idea of host-net should always listen to ::0
if they want to avoid node IP change issues.
|
||
## Summary | ||
|
||
Host-network pods inherit their Pod IPs from the node where they are running. Nodes may change their IPs during their lifecycle, and kubelet updates the node.Status accordingly, however, host-network pods IPs are not updated and keep reporting the old IPs until they are restarted/recreated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host-network pods inherit their Pod IPs from the node where they are running.
I have never found good documentation on this, but there are some considerations for a host with multiple NICs with regard to hostNet: true
pods. Most Linux network managers (tested: systemdnetworkd
and NetworkManager
) enable you to define a default route. So far I have only seen pods pull the same IP based on the default route. Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).
However I am wondering how this change would respond in the following situations given this observation.
Given a node with the following criteria
- eno1
10.0.0.100
(default route) - eno2
192.168.1.1
- What is the expected Pod behavior if the default route changes from
eno1
->eno2
? - What is the expected Pod behavior if
eno1
is disabled (deactivated) andeno2
has a valid connection, but is not the default route? - What is the expected Pod behavior if a new NIC is added, but the default route remains?
- What is the expected Pod behavior if
eno1
at10.0.0.100
remains the default route in the route table, but is assigned a NEW IP address172.0.100.100
? - What is the expected Pod behavior if
eno1
at10.0.0.100
remains the default route in the route table, but is assigned a SECONDARY IP address172.0.100.100
?
tldr; that default route though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two types of node ip changes, in context:
- Soft change This happens because the delta between what kubelet sees (default first IPv4 iirc) and thus Static Pods see (if they start before cloud provider set-ip step) and what a thing like cloud provider sets which can be dual-stacked, different order, or even different IP.
- Hard change Like the one you described or what @aojea described (e.g., DHCP resets) or to be more specific a reconfiguration on the networking stack of the node.
The first type has no baring on anything. It just says the pod is reachable using different set of addresses (assuming pod listens to ::0
or whatever equivalent in other languages). The node itself didn't change its own network stack configuration. To be more specific. The node only changed the IPs it reports to kubernetes from {<IP>}
to !{<IP>}
but the node owned and still owns the joined set of IPs. Please do note that srcIP for host-net pod does have to == pod IP, this is subject to how node networking stack is configured (similar to srcIP for pod-net pods which is subject CNI and its snat rules). This is the most common change we see.
For the second type also - arguably - should result into no change in pod (again assuming it listens to ::0
and it shares host network namespace). To present it differently let us assume the node retooled the entire networking stack (assuming CNI is smart enough to deal with this and no problems with pods in pod-net) what happens to kubelet? in practice kubelet is running as a pod in host-net (or as any other app in host default net namespace). Depending on the severity of change existing connection may need to be terminated and re-established but it does not require kubelet to restart (again assuming listening to ::0
).
Maybe we should limit the kep to first type? @aojea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can't back this up with paperwork. I haven't looked that far into the code (yet?).
the code is here if you want to have some fun :) https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/pkg/kubelet/nodestatus/setters.go#L60-L255
There are different scenarios, let's discard the cloud provider one and focus on bare metal/non cloud-provider, the code says
// 1) Use nodeIP if set (and not "0.0.0.0"/"::")
// 2) If the user has specified an IP to HostnameOverride, use it
// 3) Lookup the IP from node name by DNS
// 4) Try to get the IP from the network interface used as default gateway
//
// For steps 3 and 4, IPv4 addresses are preferred to IPv6
if we are in step 4, we execute https://github.com/kubernetes/kubernetes/blob/6ee6251fea608ff64ff69314624d26fc6b3b61ac/staging/src/k8s.io/apimachinery/pkg/util/net/interface.go#L463-L468
// ResolveBindAddress returns the IP address of a daemon, based on the given bindAddress:
// If bindAddress is unset, it returns the host's default IP, as with ChooseHostInterface().
// If bindAddress is unspecified or loopback, it returns the default IP of the same
// address family as bindAddress.
// Otherwise, it just returns bindAddress.
func ResolveBindAddress(bindAddress net.IP) (net.IP, error) {
It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.
@khenidak I narrowed the scope of the KEP in goals/non-goals to avoid any change in the kubelet. This is just to set the reported nodeIP by the kubelet in the PodIPs,
You can see Dan's comments too, the whole IP dance inside of the kubelet requires a big big big work if we want to make it perfect
#3000 (comment)
#3000 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to mention that these are host-network pods, so they have ALL the IPs in the host, they run in the host namespace, we are talking only about the pod.Status.PodIPs fields, there is no influence in the node network at all.
This is what I came here for.
I haven't always detected this behavior which is why I think I am confused. I might be able to offer reproduction steps when I find time. However I know I have had a node with both a publicly routable IP and an internal one as well and have seen the host net pods only route through the internal IP.
No more action needed. My comment is resolved (albeit it might be wise to add some more language about scope and IP changes like mentioned above). If I have other questions on the code I can bring it up later. Thanks for the great response.
/cc |
Are we going to push for this in 1.24 ? |
We have a simpler solution for the problem reported kubernetes/kubernetes#106715 The dynamic hostIPs use case is still a breaking change for the nodes behaviors without a strong use case that justify it. |
ping @aojea - deadline looming |
let's go with the bug fix and once we have more evidence and feedback we' ll revisit |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aojea The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Are we touching this for 1.25? |
I didn't hear more about this problem since we merged kubernetes/kubernetes#106715 I'm going to set it in draft mode just in case we need to revisit |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |