Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1664: Better Support for Dual-Stack Node Addresses #1665

Closed
wants to merge 4 commits into from

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Apr 3, 2020

There are problems with kubelet/cloud-provider communication regarding Node.Status.Addresses that currently break/complicate some IPv6 and dual-stack scenarios.

Additionally, Node.Status.Addresses has unclear and inconsistent semantics, and has a note attached to it in the documentation warning you to not use it with strategic merge patch because of an incorrect but unfixable annotation.

To resolve these problems, this KEP proposes deprecating the existing Node.Status.Addresses field and replacing it with two new fields, Node.Status.Hostname and Node.Status.IPs, with clearer and more consistent semantics.

This proposes a fix.


Enhancement: #1664

/cc @thockin @aojea @khenidak @lachie83
for the ipv6/dual-stack bits

/cc @liggitt @andrewsykim @cheftako
who were involved in the node and cloud bits of #79391 ("Don't use strategic merge patch on Node.Status.Addresses") and so might have thoughts on this. Feel free to uncc and/or cc other people. I'm not sure who the right people to look at this are in sig-node or sig-cloud-provider.


ftr I initially wrote this as "Clarify the semantics of Node.Status.Addresses and --node-ip", and in the "Alternatives" section, noted that we could maybe deprecate Node.Status.Addresses instead, and then I started thinking that that was really a better idea, so I made that the plan, and left not deprecating as an "Alternative". So anyway, that's also totally a possibility.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danwinship
To complete the pull request process, please assign dchen1107
You can assign the PR to them by writing /assign @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 3, 2020
```

The other information in `Node.Status.Addresses` will be moved to a
new `Node.Status.IPs` array of type `NodeIP`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that the PodIPs field from pods with HostNetwork=true will take the value from this field, or is this out of scope?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that property, but what value do we get from specifying it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other information in Node.Status.Addresses will be moved to a new Node.Status.IPs array of type NodeIP

Can you clarify what you mean by "moved", as long as we're still maintaining the status.addresses field in v1?

Copy link
Contributor Author

@danwinship danwinship Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt "moved" as in "moved from the perspective of clients who only use non-deprecated fields"? I can clarify...

@aojea @thockin I added a little bit below about HostIP / host network PodIP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Motivation

### Problems

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix this long standing issue kubernetes/kubernetes#42125

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a "Maybe Goal"; there are approaches we could take that would fix that, and approaches that wouldn't.

the new behavior. Perhaps this should warn and continue to use the old
behavior during alpha, then warn and use the new behavior in beta.

(Alternatively, since we need to extend `--node-ip` to support dual
Copy link
Member

@aojea aojea Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach taken for dual stack was to modify the flags behavior to accept comma separated lists of parameters
https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20180612-ipv4-ipv6-dual-stack.md#kubelet-startup-configuration-for-dual-stack-pod-cidrs . It can be an option here too

`Node.Status.Addresses` will eventually need to be updated to look at
the new fields instead.

### Version Skew Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question, can we have an issue with these new fields like we have with the IPFamily one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a more specific concern?

It seems to me that the problems around IPFamily mostly involve defaulting and unspecified interactions with other fields, which would not be relevant here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no specific, just afraid of possible unspecified interactions with other fields but you just clarified it, thanks.


// NodeIPPublic indicates an IP address that is expected to be directly
// reachable from anywhere, inside or outside the cluster.
NodeIPPublic NodeIPScope = "Public"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts about this NodeIPPublicoption, it has a lot of connotations :/,

an IP address that is expected to be directly reachable from anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing options have a lot of connotations too. eg, this is partly inspired by kubernetes/kubernetes#86918 (comment).

Note that it's "expected to be directly reachable from anywhere", not "required to be directly reachable from anywhere". And if the cloud provider doesn't know, it should just use Unknown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but IPv6 node IPs will most likely be Public, and bare-metal IPs will generally be Unknown

I think that continuing with the IPv6 analogy, we can merge Unknown and External in Global, taking a more minimalistic approach and avoiding the divergence on the use of IPs scopes between private/public cloud providers as you describe below. However, I don't know if the goal is exactly making that difference explicit ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversation in 86918 was specifically about the distinction between node IPs where traffic is expected to go directly to the node vs node IPs where traffic is expected to go to some piece of cloud infrastructure first, and a concern that we might accidentally end up preferring an IPv6 address of the latter type over an IPv4 address of the former type. The Public vs External distinction is exactly what was needed there.

I had also previously thought about just having some boolean flags on NodeIP instead of a type, like:

Public   bool // IP is assumed to be globally reachable
Indirect bool // Packets sent to this IP will be rewritten to another IP

So Internal would be !public && !indirect, External would be public && indirect and Public would be public && !indirect. But then that leaves !public && indirect unused, and it doesn't allow expressing unknown public-ness, which is really usually the case for autodetected bare metal IPs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeIP Scope Private Public
Belongs Cluster Internal Public
Does not belong Unknown External

👍 I got it 😄

Copy link
Contributor Author

@danwinship danwinship May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it

not quite; "Private" + "Does not belong" should be empty because there's no name that means that explicitly. Unknown means unknown. Eg, on bare metal, kubelet knows the local IPs aren't External but if they're not from a known private IP range, then it has no way of knowing if they're Internal (ie, firewalled off from the internet) or Public. So it would call them Unknown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're designing in a vacuum, hence why it's hard to pin down what semantics are useful. What are the use cases for anything other than "reachable by all other nodes within the cluster"?

(Assuming we can even find more than one use case) Perhaps we should have different fields for the different use cases rather than capturing all reachability/cost/policy semantics in a single intrinsic "type"? (ie: "use this address when doing foo-type operation", rather than "this address is intrinsically foo-type")

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would be simpler to add new address types for IPv6: InternalIPv6 and ExternalIPv6?

When using an external cloud provider, cloud-controller-manager is
responsible for setting `Node.Status.Addresses`, but it does not know
whether the cluster it is running in is supposed to be single-stack
IPv4, single-stack IPv6, or dual stack. If a node has both IPv4 and IPv6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more bang for our buck to just add new node address types for v6 and call it a day. If we add a InternalIPv6 and ExternalIPv6, users can configure the priority based on --kubelet-preferred-address-types. The strategic merge patch issue would still exist but I don't think it's a major issue since we have workarounds in both the kubelet and the cloud-controller-manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud-controller-manager sets Node.Status.Addresses directly, but it doesn't get any input from kubelet on how to do it. So to implement a --preferred-address-types arg in kubelet we need to improve kubelet/CCM communication as well. (Presumably via a Node annotation, which is how --node-ip gets passed.)

I should call this out more explicitly as a problem in the Motivation section.

Copy link
Member

@andrewsykim andrewsykim Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--kubelet-preferred-address-types is an (existing) apiserver flag btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more bang for our buck to just add new node address types for v6 and call it a day.

I was mostly only fiddling with NodeAddressType because if we're going to replace Node.Status.Addresses then we might as well try to fix all the problems with it at once. (Did I miss any?) If we aren't going replace it, I'm not sure it's super important to distinguish more address types. Calling globally-routable IPv6 addresses InternalIP is confusing but it works fine.

(And changing existing IPv6 addresses to InternalIPv6 will immediately break existing code that only looks at InternalIP and ExternalIP. This is another reason why adding the new fields is nice; because we can still leave everything as it was in the old fields.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And changing existing IPv6 addresses to InternalIPv6 will immediately break existing code that only looks at InternalIP and ExternalIP. This is another reason why adding the new fields is nice; because we can still leave everything as it was in the old fields.)

Good point, my assumption is that this only breaks when going from IPv6 -> Dual Stack both of which are alpha features.

If we aren't going replace it, I'm not sure it's super important to distinguish more address types. Calling globally-routable IPv6 addresses InternalIP is confusing but it works fine.

The real benefit is allowing users to configure the preferred family. Setting the type for an IPv6 address to InternalIP works for IPv6 only. In dual-stack it falls apart since order matters and the order is up to the cloud provider. It's not really feasible for the cloud provider to know the preferred order of address I think. This should be up to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of apiserver --kubelet-preferred-address-type. Need to think about how that fits into this.

This is basically for the case where the masters are on a separate network from the nodes and can't reach the "internal" IPs, right?

At first glance, it seems like we ought to be able to decide whether apiserver→node communication should be IPv4 or IPv6 based on whether the apiserver is advertising an IPv4 or an IPv6 IP to kubernetes.default. If node→apiserver communication is going to be over IPv6 then it seems to make sense to assume that apiserver→node communication should also be IPv6?

@liggitt
Copy link
Member

liggitt commented Apr 6, 2020

this KEP proposes deprecating the existing Node.Status.Addresses field and replacing it with two new fields, Node.Status.Hostname and Node.Status.IPs, with clearer and more consistent semantics.

Note that anything deprecated in v1 will not be removed until a hypothetical v2, so the existing field will need to continue to be populated/maintained in v1.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 6, 2020
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no major objections to this idea, but I don't understand what all this information is being used for any more, so all I see is a lot of mess :)

## Design Details

The `NodeAddress` of type `NodeHostName` will be moved to a new
`Node.Status.Hostname` field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of this field? Is it the hostname as the node knows itself (e.g. $(hostname) or $(hostname -f)) or is it any DNS that can be resolved to any one of the node's IPs? Or is it the name you get when you reverse a particular node IP? Where must it be DNS resolvable from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the answer to any of those questions... currently Addresses contains a single element of type NodeHostName and one or more IP/DNS elements. This proposal was just splitting the hostname out into a separate field from the IP/DNS stuff.

I had an "UNRESOLVED" section noting that I wasn't sure whether the hostname was currently required or optional. I guess I should check if it's even used...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a bunch of notes about what NodeHostName is used for...

```

The other information in `Node.Status.Addresses` will be moved to a
new `Node.Status.IPs` array of type `NodeIP`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that property, but what value do we get from specifying it?

@aojea
Copy link
Member

aojea commented Apr 13, 2020

+1

@uablrek
Copy link

uablrek commented Apr 13, 2020

Sorry, I found this KEP today so my comment is rather long;

About InternalIP

The InternalIPs are used as endpoint addresses for services when the PODs are in hostNetwork. This mean that the InternalIP addresses must have one propery;

  • The node must be reachable via the InternalIP addresses from every other node and from all PODs in the cluster.

In hind-sight EndpointIP may have been a better name, but it is appropriate to make that association in your mind.

        // NodeIPInternal indicates an IP address that exists only within the scope of
        // some private network (eg, an intranet or private cloud network).
        NodeIPInternal NodeIPScope = "Internal"

This definition is not restrictive enough. Say some special internal network is only reachable from 3 nodes out of 10, then the addresses are certainly "internal" but they can not be used as endpoints.

In dual-stack at least one InternalIP must be specified for each family to support services of both families. But I see no reason for a restiction of 2 addresses. If all have the property above it does not matter which one is used as endpoint address. There might be a future use-case for more addresses, e.g with some proxier using multi-homing.

About auto-detecting (guessing) the InternalIP(=EndpointIP) addresses

While this is convenient for the vast majority of K8s installstions it is totally impossible for some.

For instance nodes can have say 15 interfaces with 1-3 ipv4 and 3-10 ipv6 addresses each (exaggerated example, but it can happen).

It should be easy to satisfy booth needs, just leave an option to configure the InternalIP(=EndpointIP) addresses. The proposed --node-ips works fine.

For instance it can't be assumed that the default route points to a network that shall be used by traffic to services, we actually often separate external traffic to a different (less secure) network which has the default route. And I have tried to use multiple targets for the default route just to see if it works. It does not;

ip ro add default \
  nexthop via 192.168.1.201 \
  nexthop via 192.168.1.202

breaks all guess-by-default-route algorithms I have seen.

About separating ipv4 and ipv6 addresses in configuration

This should not be necessary. It is easy to separate them when parsing. The only pitfall I can think of is the ipv6-mapped ipv4 addresses. Some user may try to specify;

  --node-ips=192.168.1.1,::ffff:192.168.1.1

but this is the same (ipv4) address. The go function To4() will return non-nil for a ipv6-mapped ipv4 addresses like "::ffff:192.168.1.1".

@danwinship
Copy link
Contributor Author

danwinship commented Apr 13, 2020

  • The node must be reachable via the InternalIP addresses from every other node and from all PODs in the cluster.

Right. I did mention that in the description of "Unknown" ("It is at least reachable by all other nodes within the cluster, but may or may not be accessible beyond that") and at least in my head the "reachable by all other nodes" was implied for all other types too.

It should be easy to satisfy both needs, just leave an option to configure the InternalIP(=EndpointIP) addresses. The proposed --node-ips works fine.

There was no proposal to get rid of manually-overridden node IPs, only to make it work consistently across different deployment types. (eg, currently you can't override the default node IP when using an external cloud provider, which is bad, for the reasons you give)

The only pitfall I can think of is the ipv6-mapped ipv4 addresses.

Any user who puts ipv6-mapped ipv4 addresses in their config files deserves whatever sort of wacky undefined behavior they get.

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Apr 20, 2020
@danwinship danwinship force-pushed the node-ips branch 2 times, most recently from 1e33d13 to bdd3405 Compare April 20, 2020 23:47
@danwinship
Copy link
Contributor Author

I haven't pushed any updates to the KEP in a while, but kubernetes/kubernetes#95239 implements the subset that has been consistent through every iteration ("you should be able to specify dual-stack node IPs on the command line if you're using bare metal" and "hostnetwork pods on dual-stack nodes should have dual-stack pod IPs")

I'm currently kind of thinking about Tim's "--node-ip-mode" comments (#1665 (comment)), and about falling back to a solution where unless you specify --node-ip-mode SOMETHING, you get exactly the legacy behavior (which then solves all the problems around whether or not to include IPv6 addresses in node.status.addresses)

@andrewsykim
Copy link
Member

andrewsykim commented Oct 3, 2020

I think a reasonable step forward to make progress here would be to merge kubernetes/kubernetes#95239 (adds dual-stack support for --node-ip) as you suggested. This covers the providerless case. For clusters with cloud providers, we should defer to cloud-provider level config to enable dual-stack.

I would even go as far to say that maybe we shouldn't support dual-stack for in-tree cloud providers at all, since:

  • we are only a few releases from removing them (target release v1.23), at least at the kubelet level
  • adding dual-stack support to in-tree cloud provider will complicate the already complicated migration
  • this way external cloud providers can implement their own logic / configuration for dual-stack nodes that is optimized based on their own platform.
  • decouples kubelet's dual-stack logic from anything specific to any given cloud provider

@danwinship
Copy link
Contributor Author

I would even go as far to say that maybe we shouldn't support dual-stack for in-tree cloud providers at all

So one issue in this KEP is the difference in behavior between in-tree cloud providers and external cloud providers with --node-ip:

  • in-tree: passing --node-ip allows you override the cloud provider's choice about what the primary node IP is
  • external: passing --node-ip makes the cloud provider validate that the given IP exists but does not make that IP be primary

I had been assuming we should make external providers behave more like in-tree providers here, but if we're going to ignore in-tree providers for this KEP, then maybe we shouldn't? Do you know if there are any outstanding issues/complaints about the fact that external cloud providers do not allow you override the choice of primary node IP?

One issue that may have made the in-tree behavior more necessary before was that node.status.address handling used to be pretty broken and so --node-ip was sometimes necessary even when the IP you wanted was the one that should have been primary anyway. (eg, all cloud providers used to sometimes randomly reorder node.status.addresses (kubernetes/kubernetes#79391) and AWS was unpredictable on nodes with multiple interfaces (kubernetes/kubernetes#80747)). Maybe with those bugs fixed there is no longer a strong argument for overriding the cloud provider's choice of IP.

Well, actually, one case where it's important is if you want to try to create a single-stack IPv6 cluster in a dual-stack cloud. In that case you need to force kubelet to choose an IPv6 IP rather than an IPv4 IP. (There may be some use case for forcing an "IPv6,IPv4" ordering of node IPs in a dual-stack cluster rather than "IPv4,IPv6" too, though in theory it mostly shouldn't matter.)

So, definitely need a way to override the cloud IPv4-vs-IPv6 ordering. It's not clear if we need to be able to override the cloud IP address ordering in any other cases...

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member

aojea commented Mar 8, 2021

@danwinship do we still need the KEP?

@danwinship
Copy link
Contributor Author

@danwinship do we still need the KEP?

so, of the Goals in the KEP:

  1. Assign dual-stack Pod.Status.PodIPs to host-network Pods on nodes that have both IPv4 and IPv6 IPs

This was implemented in kubernetes/kubernetes#95239

  1. Make the necessary changes to kubelet to allow bare-metal clusters to have dual-stack node IPs (either auto-detected or specified on the command line) rather than limiting them to a single node IP.

95239 allows manually specifying dual-stack node IPs for bare metal clusters on the command line. It does not do dual-stack autodetection.

  1. Define how cloud providers should handle IPv4 and IPv6 node IPs in different cluster types (single-stack IPv4, single-stack IPv6, dual-stack) so as to enable IPv6/dual-stack functionality in clusters that want it without accidentally breaking old IPv4-only clusters.

This is not done, but also, it seems like kubernetes/kubernetes#86918 (dual-stack IPs on AWS) may merge now, and people are just leaning toward "well if your cluster has pods that will get confused by seeing dual-stack node addresses then you shouldn't be running on nodes with dual-stack IPs".

  1. Make built-in cloud providers and external cloud providers behave the same way with respect to detecting and overriding the Node IP(s). Allow administrators to override both IPv4 and IPv6 Node IPs in dual-stack clusters.

This is not done but as discussed in #1665 (comment) it's possibly not important, unless we care about letting people install single-stack IPv6 clusters on dual-stack cloud nodes. Arguably we should not care about that, since clouds that don't allow provisioning single-stack IPv6 nodes are probably not going to be able to bring up single-stack IPv6 clusters anyway (eg, due to IPv4-only DNS or other things like that).

  1. Find a home for the node-address-handling code which is shared between kubelet and external cloud providers.

Not done


There was also discussion in the PR about introducing alternative --node-ip behavior (#1665 (comment), #1665 (comment)) which maybe would be a good idea. In particular, the current kubelet bare-metal node-ip autodetection code is basically useless if you have multiple IPs, and in OCP we run a separate program to detect the node IP and then pass it to kubelet explicitly.

Also I think at some point there was discussion about the fact that we never pluralized pod.Status.HostIP, and maybe we should (but also maybe we don't actually need to).


It would be cool to have some kubelet option for better node IP autodetection. eg, the default should be something like "the first non-secondary non-deprecated IP on the lowest-ifindexed NIC that contains a default route which is not out-prioritied by any other default route", and there could maybe also additional modes ("use dns", "use interface ethX", "use the interface that has the route that would be used to reach IP W.X.Y.Z")

Combining also with some of the discussion in kubernetes/kubernetes#95768, and also kind of kubernetes/kubernetes#96981, I'd say maybe there's an argument for having a mode where Node.Status.Addresses always contains exactly one NodeInternalIP (and nothing else) on single-stack, and exactly two NodeInternalIPs (and nothing else) on dual-stack.

@aojea
Copy link
Member

aojea commented Apr 13, 2021

I'd say maybe there's an argument for having a mode where Node.Status.Addresses always contains exactly one NodeInternalIP (and nothing else) on single-stack, and exactly two NodeInternalIPs (and nothing else) on dual-stack.

besides having one or more, I think that it will be good to define clearly the "primary" IP of the node, the one that is going to be used as source for all the communications ... that will avoid lot of problems with multiples interfaces, asymmetric routing and consequences of wrong networking configurations from the operators .. having only one solves the problem immediately :)

@danwinship
Copy link
Contributor Author

I think that it will be good to define clearly the "primary" IP of the node

The "primary" IP of the node is already clearly defined; it's the one that kubelet sets as the HostIP for pods on that node, which is defined to be the first InternalIP address, or the first ExternalIP address if there are no InternalIPs.

the one that is going to be used as source for all the communications

Node.Status.Addresses is more about what IP we promise can be used as a target than it is about what IP we think will be used as a source. Almost no clients ever explicitly choose their source IP, and if you have non-trivial routing, then it is possible that one node IP will be used for some purposes and another IP will be used for other purposes. (eg, with many network plugins, hostNetwork-to-pod traffic will not use the official node IP as its source address but will instead use an undocumented IP associated with the virtual network interface that connects the node to the pod network)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants