New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPv4/IPv6 dual stack KEP #2254

Open
wants to merge 1 commit into
base: master
from

Conversation

@leblancd

leblancd commented Jun 12, 2018

Add a KEP for IPv4/IPv6 dual stack functionality to Kubernetes clusters. Dual-stack functionality includes the following concepts:

  • Awareness of multiple IPv4/IPv6 address assignments per pod
  • Native IPv4-to-IPv4 in parallel with IPv6-to-IPv6 communications to, from, and within a cluster

References:
kubernetes/kubernetes issue # 62822
kubernetes/features issue # 563

@feiskyer

This comment has been minimized.

Member

feiskyer commented Jun 12, 2018

@rpothier

This comment has been minimized.

rpothier commented Jun 12, 2018

/area ipv6

@caseydavenport

@leblancd thanks for making this! I've done a first pass and added some comments.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- Kubernetes needs to be configurable for up to two service CIDRs.
- Backend pods for a service can be dual stack. For the first release of dual-stack support, each IPv4/IPv6 address of a backend pod will be treated as a separate Kubernetes endpoint.
- Kube-proxy needs to support IPv4 and IPv6 services in parallel (e.g. drive iptables and ip6tables in parallel).

This comment has been minimized.

@caseydavenport

caseydavenport Jun 14, 2018

Member

We should also consider the impact to the IPVS proxy. I think at this point we need to maintain both (unless we say dual-stack is iptables only?)

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Yes, good point. I had added an IPVS section below, but forgot to add it to the proposal summary here.

This comment has been minimized.

@pmichali

pmichali Jun 20, 2018

FYI, did some exploring into IPVS, and have found one problem using it so far. The team reports that, it does not currently support IPv6. Seems like we need to document that effort is needed there too.

This comment has been minimized.

@leblancd

leblancd Jun 20, 2018

Thanks for looking into this, it's good to know up front!

### Awareness of Multiple IPs per Pod
Since Kubernetes Version 1.9, Kubernetes users have had the capability to use dual-stack-capable CNI network plugins (e.g. Bridge + Host Local, Calico, etc.), using the
[0.3.1 version of the CNI Networking Plugin API](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md), to configure multiple IPv4/IPv6 addresses on pods. However, Kubernetes currently captures and uses only the first IP address in the list of assigned pod IPs that a CNI plugin returns to Kubelet in the [CNI Results structure](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md#result).

This comment has been minimized.

@caseydavenport

caseydavenport Jun 14, 2018

Member

I believe Kubernetes uses only the IP address it reads from eth0 within the Pod and ignores the response from CNI altogether, right? Or did that get changed and I missed it? :)

This comment has been minimized.

@squeed

squeed Jun 15, 2018

That is still currently the case.. for now. CNI get is getting closer!

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Thanks for the clarification!
@squeed - If we do add the capability for the CNI plugin to pass up metadata (labels and such) to Kubernetes, does this have to be done via a CNI "get", or is there a way for kubelet to gather this information directly from the CNI results?

This comment has been minimized.

@squeed

squeed Jun 19, 2018

CNI doesn't allow for any kind of annotations (right now -- that could change) - it only has IPs and routes. Changing that is... out of scope :-)

I think this section is fine as-is; how kubelet gets the list of IPs from a running container is just an implementation detail.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

@squeed - Understood, thanks. This was more out of curiosity about where we're headed with the CNI API.

### Support of Health/Liveness/Readiness Probes
Currently, health, liveness, and readiness probes are defined without any concern for IP addresses or families. For the first release of dual-stack support, no configuration "knobs" will be added for probe definitions. A probe for a dual-stack pod will be deemed successful if either an IPv4 or IPv6 response is received. (QUESTION: Does the current probe implementation include DNS lookups, or are IP addresses hard coded?)

This comment has been minimized.

@caseydavenport

caseydavenport Jun 14, 2018

Member

Does the current probe implementation include DNS lookups, or are IP addresses hard coded

I believe you can configure a host for the probe which will be resolved via DNS if it is a DNS name rather than an exact IP. See here

### Load Balancer Operation ???
### Network Policy Considerations ???

This comment has been minimized.

@caseydavenport

caseydavenport Jun 14, 2018

Member

Speaking strictly for Calico, I don't believe there are any (besides testing).

I don't think there are API impacts because the NP API selects using labels rather than addresses (though now that I think about it we should check if the NP CIDR support does validation on IPv4 / IPv6)

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Cool, thanks!

#### First Release of Dual Stack: No Service Address Configuration "Knobs"
For the first release of Kubernetes dual-stack support, no new configuration "knobs" will be added for service definitions. This greatly simplifies the design and implementation, but requires imposing the following behavior:
- Service IP allocation: Kubernetes will always allocate a service IP from each service CIDR for each service that is created. (In the future, we might want to consider adding configuration options to allow a user to select e.g. whether a given service should be assigned only IPv4, only IPv6, or both IPv4 and IPv6 service IPs.)

This comment has been minimized.

@danwinship

danwinship Jun 15, 2018

Contributor

If kube-proxy/plugins/etc are going to have to be updated to deal with multiple CIDR ranges anyway, then it would be useful to let the admin say "a.b.c.d/x should be treated as part of the service IP range, but the controller shouldn't allocate any new IPs out of that range". This would let you live-migrate the cluster from one service CIDR to another (something we (OpenShift) have had a handful of requests for).

(The same applies to the cluster CIDRs, though in that case the allocations are done by the plugin, not kube itself, so you could already implement this via plugin-specific configuration.)

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Nit: the In the future clause in parens could be moved down, as a note, since it is not part of the imposed behavior for not using a knob).

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

I have heard a VERY small number of similar requests, but I think it's orthogonal to multi-family. We could add multi-CIDR with metadata (e.g. allow existing, but no more allocations) without touching multi-family support.

That said, if we do add multiple CIDRs, we should at least make it possible to add metadata later (so []struct, not []string).

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

I agree that we should add metadata within a PodIP structure, as @thockin described earlier, at least as a placeholder for the future. Support for live migration of pod CIDRs could then be done as a followup enhancement and make use of the metadata.

@squeed

This comment has been minimized.

squeed commented Jun 15, 2018

A few thoughts about ExtraPodIPs:

  • Should it always also include the default PodIP? So up-to-date clients don't have to manually concatenate
  • I would like some metadata attached to IPs. Some labels, perhaps?

Some time ago, before multi-network was tabled, I suggested that every IP should be labeled with the name of the network that created it. Then services could have an optional label selector.

Even if we don't use this right now, it would be good to have for room to grow.

The singular --service-cluster-ip-range argument will become deprecated.
#### controller-manager Startup Configuration for Multiple Service CIDRs
A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):

This comment has been minimized.

@squeed

squeed Jun 15, 2018

complete utter bikeshed: what about making --service-cluster-ip-range accept a comma-separated list?

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

if we can do that transparently, that might be great for all of these flags (if a bit less obviously named)

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

That sounds good to me! I wasn't sure if there'd be a problem with not having a trailing 's' for a plural argument, or any backwards compatibility headaches with command line arguments (e.g. old/new manifests working with new/old images).
I'll make this change for this and for the other command line arguments in this doc.

A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):
```
--service-cluster-ip-range ipNet [Singular IP CIDR, Default: 10.0.0.0/24]
--service-cluster-ip-ranges stringSlice [Multiple IP CIDRs, comma separated list of CIDRs, Default: []]

This comment has been minimized.

@squeed

squeed Jun 15, 2018

what happens if the user specifies two v4 ranges?

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

good call - err on the side of over-specifying :)

Show resolved Hide resolved keps/sig-network/0013-20180612-ipv4-ipv6-dual-stack.md
## Motivation
The adoption of IPv6 has increased in recent years, and customers are requesting IPv6 support in Kubernetes clusters. To this end, the support of IPv6-only clusters was added as an alpha feature in Kubernetes Version 1.9. Clusters can now be run in either IPv4-only, IPv6-only, or in a very limited dual-stack configuration. This limited dual-stack support might be called "headless", in the sense that a CNI network plugin can configure dual-stack addresses on a pod, but Kubernetes remains aware of only a single IP address per pod. This "headless" dual-stack support is limited by the following restrictions:

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

For the motivation, consider saying there is a need to go to IPv6 (the end goal), however there are some folks that are unable to adopt a pure IPv6 environment at this time. Because of that, there is the motivation to support stop-gap measures to meet their needs.

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

The following items, are more describing the limitation of the current situation of trying to do dual-stack in an ad-hoc manner. I guess I wouldn't indicate that as part of the motivation (see below).

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

I'm trying to describe what the benefits of the proposal are, and to do that, I think it's clearest to compare what's available today (limited dual stack) with what we'd have if the proposal is implemented (full dual-stack). The last paragraph attempts to describe who (or what scenarios) this proposal would benefit.

- Kubernetes needs to be configurable for up to two service CIDRs.
- Backend pods for a service can be dual stack. For the first release of dual-stack support, each IPv4/IPv6 address of a backend pod will be treated as a separate Kubernetes endpoint.
- Kube-proxy needs to support IPv4 and IPv6 services in parallel (e.g. drive iptables and ip6tables in parallel).
- Health/liveness/readiness probes should work for dual-stack pods. For the first release of dual-stack support, no additional configuration "knobs" will be added for probe definitions, and a probe is deemed successful if either an IPv4 or IPv6 response is received

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Is the goal of these checks to ascertain if the pod is "OK" and not necessarily to test the network connection? If so, maybe the phrasing could be that, if there is an IPv6 connection, it will be used (IPv6 only or dual-stack), otherwise IPv4 used? That would avoid the need for a knob or checking both IPv4 and IPv6 mode.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

My interpretation is that the intent of the probes is to check whether the pod itself is functional, regardless of how you are actually able to reach it (if it's responding via either V4 or V6 it can, ideally speaking, be declared a functioning pod). What you're suggesting for dual stack (ignoring v4-only and v6-only, since there's no argument there) is that health be determined based solely on v6 connectivity. If this is acceptable from a user perspective, it would save some implementation work (how much, I'm not sure).

This comment has been minimized.

@pmichali

pmichali Jun 19, 2018

That''s not what I'm saying, IIUC your response. I'm thinking that the goal is to see if the pod is working irrespective of the protocol used to communicate with the pod (of which you agree). As such, instead of having knobs to give the user the ability (and complexity) to select what protocol(s) to use, could we just pick a protocol. If only one protocol available, it is the same as today. If both available, pick one, with my preference being IPv6, as that is the end goal (IPv6 only) IMHO.

Later in this doc(L287-290) it talks about having config for both V4 and V6 health checks. I'm thinking we could just keep the original config with single address specified.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Here's a corner case to consider. I'm not sure how common this would be, but it may help clarify the tradeoffs we're talking about: Let's say there's a legacy IPv4-only application (one of the reasons we're adding dual stack) that's being run in an otherwise dual-stack cluster. One example is the standard nginx dockerhub image. If you load this image in an (otherwise) dual-stack cluster, its pod would have both IPv4 and IPv6 addresses assigned, but the application, by default, will only bind to (listen on) IPv4 (bind address of 0.0.0.0, or all v4 IPs). If someone were to try to configure a health probe for this, it would fail if health probes only test for health on IPv6. On the other hand, if we design it so that we're trying health probes on both IPv4 and IPv6 (declaring health if either V4 OR V6 response is received), then the probes would work correctly with out-of-the-box nginx. If this scenario (and others like it) are not a big concern, then limiting probes to V6 only support for dual stack might be a worthwhile tradeoff.

As for kube-proxy's healthz bind address config (L287-290), I don't want to conflate this kube-proxy startup config with the per-probe-definition config. Sure, these are both related to health probes (for kube-proxy, anyway), but the big difference is that the per-probe-definition config knobs are tricky to do in a portable and non-controversial manner, whereas adding v4 and v6 bind address startup configs are fairly trivial, if we need them (the if being predicated on the tradeoff described above).

- Kube-proxy needs to support IPv4 and IPv6 services in parallel (e.g. drive iptables and ip6tables in parallel).
- Health/liveness/readiness probes should work for dual-stack pods. For the first release of dual-stack support, no additional configuration "knobs" will be added for probe definitions, and a probe is deemed successful if either an IPv4 or IPv6 response is received
- Kubectl commands and output displays will need to be modified for dual-stack.
- Kubeadm support will need to be added to enable spin-up of dual-stack clusters.

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Wondering if this should be generalized and separate from this proposal, by saying that orchestration tools like MiniKube, KubeAdm, Kubespray, etc. may need updated, which is outside the scope of this proposal.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

I would prefer to leave Kubeadm included here, since we'll need one reference way of instantiating a dual-stack cluster anyway (similar to what we did with IPv6 only), and because we'll very likely need kubeadm support in implementing dual-stack CI. On the other hand, the other orchestrators should be explicitly out-of-scope, so I'll add an item in the "Non Goals" section to make that clear.

This comment has been minimized.

@pmichali

pmichali Jun 19, 2018

I think it's fine to select an orchestrator for the reference implementation. My question is whether the KubeAdm changes should be part of this effort or a companion effort. My vote is to keep the scope of an already large effort as small as possible.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Can we develop dual stack, along with CI, without Kubeadm changes?

This comment has been minimized.

@pmichali

pmichali Jun 19, 2018

I'm not saying that orchestrator changes are not needed. I'm suggesting breaking up the work into separate efforts ("a companion effort"). Are we going to scope out kubeadm changes in detail as part of this KEP, then?

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

If it's a required part of implementing and testing the dual stack feature, then I'm proposing that we need to mention that here. Specifically, I want to avoid saying that the kubeadm changes are out-of-scope, out of concern that the decoupled kubeadm dual-stack changes might get a lower priority relative to the rest of the dual-stack work. This would be in the spirit of a KEP intended to be a "feature, and effort tracking document" (from KEP summary). For kubeadm implementation details, we can always file an issue against kubeadm,

This proposal aims to extend the Kubernetes Pod Status API so that Kubernetes can track and make use of multiple IPv4/IPv6 address assignments per pod.
#### Versioned API Change: PodStatus v1 core
In order to maintain backwards compatibility for the core V1 API, this proposal adds a new "ExtraPodIPs" field to the core V1 version of the [PodStatus V1 core API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#podstatus-v1-core), while maintaining the existing "PodIP" field. This change is in accordance with the [Kubernetes API change

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Are we getting into too much of the implementation details here, specifying the actual changes needed? Would it suffice to just say that the API will need to support multiple IP, without breaking backwards compatibility? I'm thinking we shuld focus more on the what and not the how - to avoid bike shedding.

This comment has been minimized.

@pmichali

pmichali Jun 18, 2018

I guess the naive question is... Is the KEP supposed to be a functional spec (what), design spec (how), or some combination?

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

According to the Summary section of KEP # 1, "Kubernetes Enhancement Proposal Process":

A KEP attempts to combine aspects of a

    - feature, and effort tracking document
    - a product requirements document
    - design document

into one file which is created incrementally in collaboration with one or more Special Interest Groups (SIGs).

Either way, I think people in the community are going to want to know what changes are being made to the API. And based on what's written in the Kubernetes API change guidelines, it seems that backwards compatibility has been a tricky thing, so we need to show at least some high level details on how we expect to handle that.

This comment has been minimized.

@pmichali
ExtraClusterIPs []string `json:"extraclusterIP,omitempty" protobuf:"bytes,3,opt,name=ExtraclusterIPs"`
```
#### Internal Representation:

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Same question as above about getting into a lot of details.

```
The singular --service-cluster-ip-range argument will become deprecated.
#### 'kubectl get service' Command Display for Multiple Service IPs

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Other kubectl commands possibly needing addressing... "get pod", "describe pod"?, "describe svc".

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Ooh yeah, we should include those.

#### Kube-Proxy Startup Configuration Changes
##### Multiple bind addresses configuration
New [kube-proxy configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/) arguments will be added to allow a user to specify separate IPv4 and IPv6 bind addresses for the kube-proxy server:

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Mentioned above, would it suffice to just support health checks for one IP mode (IPv6, if available, else IPv4)?

For metrics, wouldn't only one IP mode be needed to access the metrics? Not sure why one would want to access metrics from IPv4 and IPv6. May be able to leave that, as is.

## Drawbacks [optional]
\<TBD\>
## Alternatives [optional]

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

I think this would be good to have up top, where we discuss the motivation, and then the ways to address the needs and the selection chosen.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

This "Alternatives" section is part of the standard KEPs template.

## Implementation History
\<TBD\>
## Drawbacks [optional]

This comment has been minimized.

@pmichali

pmichali Jun 15, 2018

Would be good to list the limitations/restrictions of the proposal as one of the prominent items, IMHO. Would cover this and the Risks/Mitigations item).

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

The "Non Goals", "Risks and Mitigations", and "Drawbacks" sections are part of the standard KEPs template.

@thockin

Great doc! Pretty complete and well thought out

I have some big concerns about API compat stuff that we need to sort through.

We may want to do this work in a "feature branch". It seems expansive enough. @lavalamp for thoughts on that

- Service Access: IPv4-to-IPv4 and IPv6-to-IPv6 access from pods to Kubernetes services
- External Server Access: IPv4-to-IPv4 and IPv6-to-IPv6 access from pods to external servers
- Ingress Access: Access from IPv4 and/or IPv6 clients to Kubernetes services. Depending on the ingress controller used, connections may cross IP families (IPv4 clients to IPv6 services, or vice versa).
- Service Discovery: Support for both A and AAAA records for kube-DNS and coreDNS

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

we maybe don't want to do kube-dns? Not sure, butu maybe we can start the EOL'ing process.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

If kube-dns will likely be EOL'd before this proposal is implemented (seems likely to me), then we should skip adding dual-stack support for it (and add it to Non-Goals section). Your call. :)

This comment has been minimized.

@thockin

thockin Jul 6, 2018

Member

I think we should NOT plan on a kube-dns update.

@bowei or @johnbelamaric can argue with me now...

- External Server Access: IPv4-to-IPv4 and IPv6-to-IPv6 access from pods to external servers
- Ingress Access: Access from IPv4 and/or IPv6 clients to Kubernetes services. Depending on the ingress controller used, connections may cross IP families (IPv4 clients to IPv6 services, or vice versa).
- Service Discovery: Support for both A and AAAA records for kube-DNS and coreDNS
- Functionality tested with the Bridge CNI plugin and Host-Local IPAM plugin as references

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

We might add ptp as a goal here, since I'd love to see most people adopt that rather than bridge (no more hairpin madness)

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

Sounds good. I'll leave Bridge plugin here in since it should just work (except for the hairpin gotcha).

- Cross-family connectivity: IPv4-to-IPv6 and IPv6-to-IPv4 connectivity is considered outside of the scope of this proposal, except in the case of ingress access (depending on ingress controller used).
- CNI network plugins other than the Bridge and Host-Local IPAM plugins should support Kubernetes dual stack, but the development and testing of dual stack support for these other plugins is considered outside of the scope of this proposal.
- Code changes should support any number IPv4 addresses and any number of IPv6 addresses per pod (up to some reasonable limit). However, testing will be done only up to the following limits:
- Pod addresses: 1 IPv4 address and 2 IPv6 addresses per pod maximum

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

explain the 2 v6 addresses here, and why this should not be 2 v4 ?

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

I'm not sure what the limits should be on this. I'm okay with restricting this to 1 v4 + 1 v6. There had been some earlier discussion around supporting multiple v6 addresses per pod (hadn't heard anything about multiple IPv4s). However, the only (very weak) use case that I can think of for 2 v6 addresses is to support one private (ULA) and one global (GUA) address in a pod. (There are also privacy extension addresses to consider, but I really don't think it's practical to support privacy extensions in Kubernetes, since it would be very difficult for Kubernetes to detect the ephemeral private addresses).

This comment has been minimized.

@pmichali

pmichali Jun 19, 2018

^^ More reasons why it may be useful to call out use cases for dual-stack. One to identify the cases that could be addressed. Another is to see if the use cases should be addressed. And a third, to ensure we have an common understanding of how dual-stack is expected to be used.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

@thockin - I'll switch this to 1 IPv4 + 1 IPv6 for this KEP. The support for multiple IPv6 addresses seems to be generating a lot of questions, and I don't find the 1 ULA + 1 GUA that compelling of a user case. If someone comes up with a compelling use case, we can change the test limits.

// Additional IP addresses allocated to the pod. Routable at least within
// the cluster. Empty if not yet allocated.
// +optional
ExtraPodIPs []string `json:"podIP,omitempty" protobuf:"bytes,6,opt,name=ExtrapodIPs"`

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

I am not sure string is sufficient.

Since this API will further open the door to multi-network efforts, we maybe want this to be a list of structs, with each element carrying some meta-data. E.g. “use this for kubelet health-checks” or even arbitrary annotations, so that non-core implementations of multi-networking can link pods to network instances. E.g.:

type PodStatus struct {
//...
ExtraPodIPs []PodIP
//...
}

type PodIP struct {
	IP string
	Properties map[string]string
}

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

and I am not sure map[string]string is what we want either. Worth thinking about what other data we might want here, or maybe even just using a struct with only one field for future expansion?

This comment has been minimized.

@lavalamp

lavalamp Jun 18, 2018

Member

It does seem like naming them might be useful, so you can refer to them before they have been assigned.

This proposal aims to extend the Kubernetes Pod Status API so that Kubernetes can track and make use of multiple IPv4/IPv6 address assignments per pod.
#### Versioned API Change: PodStatus v1 core
In order to maintain backwards compatibility for the core V1 API, this proposal adds a new "ExtraPodIPs" field to the core V1 version of the [PodStatus V1 core API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#podstatus-v1-core), while maintaining the existing "PodIP" field. This change is in accordance with the [Kubernetes API change

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

for context, maybe include an appendix or footnote detailing this?

The issue with compat here is older clients.

First: We can't really stop populating this field, or we break clients who don't understand PodIPs.

A simpler-seeming answer might be to have both PodIP and PodIPs where the plural always includes the singular. Again, the problem is older clients (think about a failed kubelet update which gets rolled back).

Consider an existing pod with PodIP = "X" and PodIPs" = [ "X", "Y" ]. If we have a client that only knows PodIP, and it updates that field to "Z", but does not update PodIPs`, the inclusivity invariant is violated.

If we make it optionally inclusive, we still may have muddled the intention - the client changed "X" to "Z", did they mean that "X" is no longer preferred or no longer present at all?

The same problem exists in the other direction. If the older client updated PodIP from "Z" to "X", that collides with PodIPs. Did the user mean the same "X"?

That doesn't leave much room for viable solutions. It can't be guaranteed exclusive nor can it be guaranteed inclusive, and optionally inclusive is confusing at best. The only real option here is to consider the real-life chance of detriment. Pod IPs do change (e.g. if the node reboots quickly, it can come back without evicting the pod, but IPAM can choose a new IP), but it's not expected to be common. It's also uncommon for multiple clients to be writing to Pod.status, so we really only need to consider the upgrade/rollback case for mutations.

Thus, I think it's reasonably safe to say it is optionally inclusive, and to define the edge case behavior.

This might be worth having a very focused sig-architecture chat about. There may be better options that I haven't thought about or precedents that I don't know.

Other options, off the top of my head:

  • Linked fields. Any update to singular with a corresponding update to plural can validate inclusivity/exclusivity. Any update to singular without an update to plural forces plural to some defined state (inclusive or exclusive TBD). Any update to plural without an update to singular forces singular to some defined state.

  • move Pod to v2, make plural be read-only in v1 and collaps singular and plural in v2. Changes through v2 extract the first item (or some other heuristic) into singular in v1 view.

This comment has been minimized.

@danwinship

danwinship Jun 15, 2018

Contributor

There may be better options that I haven't thought about or precedents that I don't know.

The "API changes" doc explicitly recommends this approach. (The example with Param and ExtraParams.)

This comment has been minimized.

@squeed

squeed Jun 18, 2018

Ah, that makes sense. I hadn't thought about kubelet downgrades (scary!).

I would really like to attach metadata to IPs - thoughts about how?

This comment has been minimized.

@thockin

thockin Jun 18, 2018

Member

I think we need it too. This is a pattern (pluralising a field) that doesn't pop up often enough to have a pattern that I know of, so maybe we will have to trailblaze a bit.

This comment has been minimized.

@leblancd

leblancd Jun 19, 2018

@squeed - Yes, I like the idea of adding metadata to IPs as well, and there seem to be a lot of future applications that will need this. If we're changing the API anyway, let's make it extensible. In a comment below the ExtraIPs field, Tim suggested adding a string mapping:

Properties map[string]string

This comment has been minimized.

@pmichali

pmichali Jun 19, 2018

Are there two different suggestions here? One being that the old field has the first (primary) value and the array has additional entries (seems like the API changes doc specifies this). The other being the new array containing the entry held in the original field (duplicating)? Just trying to get this straight.

Would it be good to have a separate section to discuss field extension in general, the pros, cons, corner cases, etc., since we have several of these?

This comment has been minimized.

@leblancd

leblancd Jun 20, 2018

@pmichali - Right, the two main options being discussed could be called exclusive, where you have

podIP="X"
podIPs=["Y", "Z"]

versus inclusive, where you have:

podIP="X"
podIPs=["X", "Y", "Z"]

The exclusive is what was originally proposed, and is based on the pluralized example in the API changes doc. And then there's an "optional inclusive" variant that seems to be a inclusive/exclusive hybrid (old clients only update the singular podIP, so podIPs inclusivity is lost upon modification of podIP). I had been thinking that either pure exclusive or pure inclusive models should work - old clients remain agnostic of the pluralized field, but the conversion logic in the server (stuff that converts from versioned APIs to internal versions, and vice versa) is smart enough to maintain coherency in the podIP and podIPs fields... but I could be missing nuances.

Yes, the "extra section" is what @thockin is suggesting (in appendix or footnote form), although I'm thinking that if the API changes doc can be updated accordingly, we could just point to that.

This comment has been minimized.

@thockin

thockin Jul 6, 2018

Member

@leblancd Not everyone will do a read-modify-write, and even if they do ... older clients will drop unknown fields AND they can send a patch. So pathologically:

Mandatory exclusive:

  • podIP="X", podIPs=["Y", "Z"]
  • client sends a patch operation changing podIP to "Z".
    => If we fail that at the apiserver, the client has no idea why an update that used to work doesn't any more. If we don't fail it, we can either remove "Z" from podIPs (mutating user input is generally verboten) or we can ignore it and break the invariant.

Mandatory inclusive:

  • podIP="X", podIPs=["X", "Y"]
  • client sends a patch operation changing podIP to "Z".
    => If we fail that at the apiserver, the client has no idea why an update that used to work doesn't any more. If we don't fail it, we can either add "Z" to podIPs (mutating user input is generally verboten) and remove "X" (loses user metadata) or we can ignore it and break the invariant.

I'm saying we need to find a cleaner way and set precedent very carefully. Optionally-inclusive (which I guess is == optionally exclusive :) is still confusing.

  • podIP="X", podIPs=["X", "Y"]
  • client sends a patch operation changing podIP to "Z".
    => Is the "X" in podIPs valid any more?

I want to consider more clever ideas (if anyone has some). This will happen again.

This comment has been minimized.

@leblancd

leblancd Jul 9, 2018

@thockin It's not clear to me how we can solve this.

We talked elsewhere in this review about including metadata (e.g. annotations) along with IPs in the podIPs/extraPodIPs array. If we do include metadata, I think we'd be better off (or it would at least be simpler) using an inclusive model. The reason being that the singular podIP field needs to remain a simple IP address (the "default" IP address) without metadata. If we use an inclusive model, we can store metadata associated with the default IP address in the podIPs array. (For exclusive, we'd have to add yet another field for that IP address' metadata).

Of course, if an old client changes podIP, the old client is agnostic to metadata, so the new podIP will be added without metadata. The podIPs field would then have a mix of entries with/without metadata. I don't know how bad of a problem this would be, but there's probably no way around it.

For the mandatory inclusive scenario, I would consider the "X" in podIP and the "X" in podIPs to be duplicate copies of the same field, so I would think that if client replaces/overwrites the "X" in podIP with "Z", then "Z" should be added to podIPs, and the (duplicate) "X" in podIPs should be removed, along with its metadata, IMO. Is mutating user input so bad that we shouldn't even consider it?

I'm looking for guidance. In the meantime, I'll rewrite this section assuming a mandatory inclusive model for now, because of the metadata concerns I mentioned above.

This comment has been minimized.

@thockin

thockin Jul 9, 2018

Member

@leblancd I think it is about harm reduction. If someone uses an older client, it can't stop working. In practice I doubt we will have many problems with multi-writer, so I think we can opt for the optionally-inclusive mode.

I've put this on the agenda for sig-arch this Thursday, though we have a pretty full list, so it may push to next week.

https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit?ts=5b43a34a#

(Note that this assumed behavior works very well for the typical dual-stack case where there are one IPv4 and one IPv6 service CIDR, and one IPv4 and one IPv6 address per pod.)
#### Versioned API Change: Service v1 core
In order to maintain backwards compatibility for the core V1 API, this proposal adds a new "ExtraClusterIPs" field to the core V1 version of the [Service V1 core API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#service-v1-core), while maintaining the existing "ClusterIP" field. This change is in accordance with the [Kubernetes API change

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

same discussion as above :)

The singular --service-cluster-ip-range argument will become deprecated.
#### controller-manager Startup Configuration for Multiple Service CIDRs
A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

if we can do that transparently, that might be great for all of these flags (if a bit less obviously named)

A new, plural "service-cluster-ip-ranges" option for the [controller-manager startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) is proposed, in addition to retaining the existing, singular "service-cluster-ip-range" option (for backwards compaibility):
```
--service-cluster-ip-range ipNet [Singular IP CIDR, Default: 10.0.0.0/24]
--service-cluster-ip-ranges stringSlice [Multiple IP CIDRs, comma separated list of CIDRs, Default: []]

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

good call - err on the side of over-specifying :)

### Support of Health/Liveness/Readiness Probes
Currently, health, liveness, and readiness probes are defined without any concern for IP addresses or families. For the first release of dual-stack support, no configuration "knobs" will be added for probe definitions. A probe for a dual-stack pod will be deemed successful if either an IPv4 or IPv6 response is received. (QUESTION: Does the current probe implementation include DNS lookups, or are IP addresses hard coded?)

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

Specifically this means that probes will execute against the primary IP only ?

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).
#### GCE Ingress Controller: Does Not Support External, Dual-Stack Operation
It appears that [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) support is limited to IPv4-only access (QUESTION: Is this true???) from outside of the cluster. Therefore, GCE ingress controllers will not be capable of supporting external, dual-stack access to services running in a dual-stack Kubernetes cluster.

This comment has been minimized.

@thockin

thockin Jun 15, 2018

Member

no, we allow V6, but I don't know if we allow both at the same time? Knowing the data model, I expect we do.

@thockin

This comment has been minimized.

Member

thockin commented Jun 16, 2018

#### Internal Representation:
The internal representation for the service IP (a.k.a. cluster IP) will be changed from a singular string to a slice of strings, and renamed "ClusterIPs":
```
ClusterIPs []string `json:"clusterIPs,omitempty" protobuf:"bytes,3,opt,name=clusterIPs"`

This comment has been minimized.

@lavalamp

lavalamp Jun 18, 2018

Member

I think you can use this construct in the external struct, too. Just require that ClusterIP == ClusterIPs[0]. Convert and default in a few places, and this is not a terrible user experience.

This comment has been minimized.

@lavalamp

lavalamp Jun 18, 2018

Member

Ah, I see there's already discussion about that for pod IPs--if cluster IP is already mutable then what I said probably won't work.

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jun 18, 2018

re: feature branches... I don't know. If you're careful about syncing from master back into your branch, maybe it will be OK. I don't think this would conflict too much with the apply branch, so it might be worth thinking about. I wouldn't do two branches that make extensive changes like this in the same areas at the same time.

You currently would need a few people working on this that are on the admin list, as it's a bit spotty whether the merge bot will get around to feature branch PRs.

@baodongli

This comment has been minimized.

baodongli commented Jun 21, 2018

I understand that the goal of this KEP is to support dual stack on the POD network, and to make the IPs configured on the POD network available through k8s APIs. I noticed in the comments that it leans toward supporting one address per family only although the proposal suggests to support arbitrary number of IP addresses. However, the semantics of those addresses are not specified outside of dual stack context. If one address per family is supported, I think that there is an alternative approach to achieve the same without requiring API change. The idea is to use ipv4-embedded ipv6 addresses as specified in here:https://tools.ietf.org/html/rfc6052. This is how it might work:
-- configure k8s cluster with both IPv6 CIDR and IPv4 CIDR
-- configure k8s cluster with an IPv6 service CIDR (and optionally an IPv4 service CIDR, I see a lot of discussions around this in this KEP).
-- cni-plugin's IPAM creates an ipv4-embedded ipv6 address based on the two CIDRs, and configure the pods to operate in dual stack mode.
-- the ipv4-embedded ipv6 address is recorded in the podIP field of the POD status struct.
-- K8s Services (such as kube-proxy) interprets the address based on if both service CIDRs or only one of the service CIDRs are configured.
-- kubectl displays the podIPs based on if both CIDRs or only one of the C/IDRs are configured.

This would require much less change, test efforts, and can make the feature available with less effort and time.

@jdumars jdumars added this to Backlog in KEP Tracking Jun 25, 2018

@jdumars jdumars moved this from Backlog to In SIG Review in KEP Tracking Jun 25, 2018

@idealhack

This comment has been minimized.

Member

idealhack commented Jul 4, 2018

/ok-to-test

@warmchang

This comment has been minimized.

Contributor

warmchang commented Jul 7, 2018

/retest

@leblancd

This comment has been minimized.

leblancd commented Oct 1, 2018

@thockin - Could you take a look at this when you get a chance? There have been several changes since the last time you reviewed that deserve some thought/discussion, e.g.:

  • Single-family service CIDR
  • Dual-stack endpoint API change
  • "endpointFamily" configuration for services
    I've also opened 2 API review PRs in architecture-tracking (one for PodStatus, one for Endpoint API change), and there are a couple of coding PRs (PR #69029 and PR #69153).
    Thanks!
@bowei

This comment has been minimized.

Member

bowei commented Oct 2, 2018

cc: @varunmar

@rkamudhan

This comment has been minimized.

rkamudhan commented Oct 4, 2018

@leblancd Regarding dual stack endpoints, Is that each endpoints subset have both IPv4 and IPv6 address ?

@leblancd

This comment has been minimized.

leblancd commented Oct 4, 2018

@rkamudhan - Yes, each endpoints subset will have both IPv4 and IPv6 addresses. Our initial proposal was to treat each IP address as separate endpoints (IPv4 address was one endpoint, IPv6 address would be a different endpoint, so 2 endpoints per backend). However, we were advised that the 2 endpoints per backend was problematic for the reasons listed here: https://github.com/kubernetes/community/pull/2254/files#diff-bd8f8bd5daa9561082402a2ff4795f05R279
The current proposal is to modify the "EndpointAddress" structure in the API to include an IPs[] list (see https://github.com/kubernetes/community/pull/2254/files#diff-bd8f8bd5daa9561082402a2ff4795f05R293).

@sb1975

This comment has been minimized.

sb1975 commented Oct 9, 2018

@leblancd - Can you please confirm whats the tentative timeline to for this Dual Stack Feature to be available ?
Also is it possible to utilize two different ingress - one for IPV4 and another for IPv6 in the same Kubernetes Cluster to reach the same Application API with the present Kubernetes Release (lets say v1.12 ) ?

@leblancd

This comment has been minimized.

leblancd commented Oct 9, 2018

@sb1975 - The target release for Kubernetes dual stack is now Release 1.14. The schedule for release 1.14 has not been finalized (release schedules are currently published only up to 1.13, see https://github.com/kubernetes/sig-release/tree/master/releases.
Setting up IPv4 ingress in parallel with IPv6 ingress with current Kubernetes might be possible, but complex. e.g. you could try setting up an IPv6-only cluster, and then add stateless NAT46 at the cluster edge for access from IPv4 external clients to IPv6 Kubernetes services. It might also be possible to use an NGINX ingress controller that has dual-stack access from external, but maps to single-family Kubernetes nodePorts. I haven't tried either approach, so YMMV.

@sb1975

This comment has been minimized.

sb1975 commented Oct 12, 2018

@leblancd - Thanks for the details, do you have any link to understand the steps involved in creating the multiple ingress- for IPv4 and IPv6 seperately and also how to configure the NAT46 in the IPv4 ingress.

@leblancd

This comment has been minimized.

leblancd commented Oct 22, 2018

@sb1975 - With the help of @aojea, we've put together an overview on how to install a dual-stack NGINX ingress controller on an (internally) IPv6-only cluster: "Installing a Dual-Stack Ingress Controller on an IPv6-Only Kubernetes Cluster". This requires that the nodes be configured with dual-stack public/global IPv4/IPv6 addresses, and it runs the ingress controller pods on the host network of each node.

I haven't configured Stateless NAT46 on a Kubernetes IPv6-only cluster, but you can find some good background references on the web. e.g. Citrux has a helpful reference for configuring their NAT46 appliance here, and there's a video on configuring Stateless NAT46 on a Cisco ASA here.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- For simplicity, only a single family of service IPs per cluster will be supported (i.e. service IPs are either all IPv4 or all IPv6).
- Backend pods for a service can be dual stack.
- Endpoints for a dual-stack backend pod will be represented as a dual-stack address pair (i.e. 1 IPv4/IPv6 endpoint per backend pod, rather than 2 single-family endpoints per backend pod)

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

On this and the previous: If we say that Services are always single-family and you said that "Cross-family connectivity" is a non-goal, what value do we get for endpoints to be dual-stack?

I guess I could see an argument for headless Services or external Services. Is that what motivates this? Is it worth the effort? Could it be deferred?

Or is this about NodePorts being available on both families?

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

This is a very good question, and your point is well-taken that we probably don't get value out of having endpoints being dual-stack. Maybe you can confirm my thought process here. I had added this dual-stack endpoints with the thinking that maybe, somehow, ingress controllers or load balancers might need to know about V4 and V6 addresses for endpoints, in order to provide dual-stack access from outside. Thinking about this more, I don't think this is the case. For ingress controllers and load balancers to provide dual-stack access, support of dual-stack NodePorts and dual-stack externalIPs (and ingress controllers using hostnetwork pods) should be sufficient.

Let me know what you think, so I can modify the spec.

For headless services, I believe that we can get by with a single IP family. The IP assigned for a headless service will match the "primary" IP family. This would put headless services on par with non-headless Kube services.

Re. the "Cross-family connectivity", I should remove this from the non-goals. It's confusing and misleading. Family cross over will be supported e.g. with dual-stack ingress controller mapping to a single family endpoint inside the cluster. Cross-family connectivity won't be supported inside the cluster, but that's pretty obvious.

This comment has been minimized.

@thockin

thockin Oct 29, 2018

Member

I think the non-goal is correct -- Kubernetes itself is NOT doing address family translation. The fact that Ingress controllers can do that is merely a side-effect of the fact that they are almost universally full proxies which accept the frontside connection and open another for the backside. I don't think it is obvious that we won't convert a v4 to a v6 connection via service VIP, and we should be clear that it is NOT part of the scope.

When I wrote this question I was thinking about Service VIP -> Backends. That has to be same-family (because of the non-goal above), so alt-family endpoints is nonsensical. I think this is also true for external IPs and load-balancer IPs -- they are received and processed the same as cluster IPs, with no family conversion.

BUT...

  1. external IPs is a list, so could plausibly include both families.
  2. LB status.ingress is a list, so could plausibly include both families.
  3. nodePorts could reasonably be expected to work on any interface address.
  4. headless services could reasonably be expected to work on any interface address.

You suggest you might be willing to step back from (4), but unless we also step back from (1), (2), AND (3), that wouldn't save any work.

I think the only simplification here is if we can say "all service-related logic is single family". And I am not sure that is very useful -- tell me if you disagree.

Assuming we have ANY service-related functionality with both families, we need dualstack endpoints :(

Or am I missing something?

This comment has been minimized.

@thockin

thockin Oct 29, 2018

Member

...more to the point, is multi-family endpoints, nodeports, LBs, etc somethign we can defer to a "phase 2" and iterate? Would it simplify this proposal or just make it not useful?

This comment has been minimized.

@leblancd

leblancd Oct 31, 2018

Your analysis is spot on. I agree, we would need dual-stack endpoints to support (1), (2), (3), and (4) (although I'm not real familiar with how LB status.ingress workings, but it sounds like it's also driven by endpoint events/state), and if we support 1 of the 4 we might as well support all 4.

And regarding the idea of NOT supporting 1-4 as a simplification, I believe that would make this proposal not very useful. What we'd have left is informational visibility to dual stack pod addresses, as far as I can tell.

I'd say that the minimum useful subset of support would have to include dual-stack endpoints, nodeports, LBs, externalIPs, and headless services, IMO.

This comment has been minimized.

@thockin

thockin Nov 2, 2018

Member

OK, so any scope-reduction around not doing dual-stack endpoints is rendered moot,a nd all such comments should be ignored :)

- NodePort: Support listening on both IPv4 and IPv6 addresses
- ExternalIPs: Can be IPv4 or IPv6
- Kube-proxy IPVS mode will support dual-stack functionality similar to kube-proxy iptables mode as described above. IPVS kube-router support for dual stack, on the other hand, is considered outside of the scope of this proposal.
- For health/liveness/readiness probe support, a kubelet configuration will be added to allow a cluster administrator to select a preferred IP family to use for implementing probes on dual-stack pods.

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

If we have a "primary" family (e.g. the one used for Services) do we need this flag?

Do we need a per-pod, per-probe flag to request address family?

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I think we can do without a global kubelet configuration for preferred IP family for probes. I'll change this to say that health/liveness/readiness probes will use the IP family of the default IP for a pod (which should match the primary IP family in most cases).

I don't think we need a per-pod, per-probe flag for IP family for the intial release. In a future release, we can consider adding a per-pod, per-probe flag to allow e.g. a user to specify that probes can be dual stack, meaning probes are sent for both IP families, and success is declared if either probe is successful, or alternatively if both probes are successful.

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

Unless we have examples of what we want to put in properties and we are willing to spec, validate, and test things like key formats, content size, etc, we should probably leave this out for now. Just a comment indicating this is left as a followup patch-set, perhaps?

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

@thockin - Sure, I can take this out. If the Properties map is removed, should the PodIPInfo structure be removed, and just leave PodIPs as a simple slice of strings, to simplify?

This comment has been minimized.

@squeed

squeed Oct 29, 2018

The properties map would be very useful for multi-network. And no sense in changing the data structure twice? I'd prefer to keep it if possible.

This comment has been minimized.

@thockin

thockin Oct 29, 2018

Member

No matter what, I would keep the struct.

@squeed if we keep it, we need to have concrete use-cases for it such that we can flesh out the management of that data as I listed above. We can always ADD fields, with new validation, etc. I'd rather add it when we have real need. I am confident it's something we will want, eventually.

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

I see this has already generated some discussion :)
I'll add my own comments here - @leblancd you can ignore my other comment up above.

One pattern I've seen used successfully in internal interfaces is to have a mostly-strongly typed struct with a bag-of-strings at the end for "experimental" free-for-all properties. But that requires agreement that the bag-of-strings should not be relied upon, and can change at any time. I do not think we can enforce such a thing if we put this into an external facing API, so my vote is to only add fully-typed fields with validation and strong semantic meaning. Then we can argue about names all at once before they are used instead of after the fact :)

Properties map[string]string
}
// IP addresses allocated to the pod with associated metadata. This list

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

We will also want to document the sync logic. I finally sent a PR against docs.

#2838

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

By "document the sync logic", you mean just adding references to that doc in this spec, right?

This comment has been minimized.

@thockin

thockin Oct 29, 2018

Member

I'd spell it out in the comments. I don't expect end-users to read our API devel docs :)

```
##### Default Pod IP Selection
Older servers and clients that were built before the introduction of full dual stack will only be aware of and make use of the original, singular PodIP field above. It is therefore considered to be the default IP address for the pod. When the PodIP and PodIPs fields are populated, the PodIPs[0] field must match the (default) PodIP entry. If a pod has both IPv4 and IPv6 addresses allocated, then the IP address chosen as the default IP address will match the IP family of the cluster's configured service CIDR. For example, if the service CIDR is IPv4, then the IPv4 address will be used as the default address.

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

"When the PodIP and PodIPs fields are populated" implies no sync logic. I think we all settled on sync being a better path?

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

By "sync logic" you mean how the singular value from old clients (and plural value from new clients) gets fixed up (as described in the "On Compatibility" section?

I'll delete that line. What I meant to say is covered in your API change guide update.

- Because service IPs will remain single-family, pods will continue to access the CoreDNS server via a single service IP. In other words, the nameserver entries in a pod's /etc/resolv.conf will typically be a single IPv4 or single IPv6 address, depending upon the IP family of the cluster's service CIDR.
- Non-headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record) or an IPv6 entry (AAAA record), depending upon the IP family of the cluster's service CIDR.
- Headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record), an IPv6 entry (AAAA record), or both, depending on the service's endpointFamily configuration (see [Configuration of Endpoint IP Family in Service Definitions](#configuration-of-endpoint-ip-family-in-service-definitions)).

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

Depends on previous question about this config being per-service vs per-pod

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I now think single-family headless services would work (on par with non-headless kubernetes services being single-family).

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).
#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

I'd say this is Google's problem to implement.

@bowei

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I'll just say this is out-of-scope for this effort.

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

The unclear parts should at least be clarified :)
I can take this one.

Show resolved Hide resolved keps/sig-network/0013-20180612-ipv4-ipv6-dual-stack.md
#### Multiple bind addresses configuration
The existing "--bind-address" option for the will be modified to support multiple IP addresses in a comma-separated list (rather than a single IP string).

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

Why is this a sub-heading of cloud-providers?

There are other components that support a flag like this - do we have a list?

This comment has been minimized.

@leblancd

This comment has been minimized.

@rpothier

rpothier Oct 30, 2018

Kube-proxy and the kubelet startup config have a similar requirement, that's a good idea to list them.
Also possibly the controller manager if we went with the full Dual Stack.

This comment has been minimized.

@rpothier

rpothier Nov 5, 2018

Also, that line is missing the link to the cloud controller manager, so it should read
The existing "--bind-address" option for the cloud-controller-manager will be modified ...

- name: MY_POD_IPS
valueFrom:
fieldRef:
fieldPath: status.podIPs

This comment has been minimized.

@thockin

thockin Oct 23, 2018

Member

@kubernetes/sig-cli-api-reviews We should get a consult as to whether this is right or whether the fieldpath should be something like status.podIPs[].ip - it was supposed to be a literal syntax.

@leblancd

@thockin - Thank you for your thorough review! I think that eliminating the support for dual-stack endpoints makes sense, let me know if I should go ahead and remove this.

- Link Local Addresses (LLAs) on a pod will remain implicit (Kubernetes will not display nor track these addresses).
- For simplicity, only a single family of service IPs per cluster will be supported (i.e. service IPs are either all IPv4 or all IPv6).
- Backend pods for a service can be dual stack.
- Endpoints for a dual-stack backend pod will be represented as a dual-stack address pair (i.e. 1 IPv4/IPv6 endpoint per backend pod, rather than 2 single-family endpoints per backend pod)

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

This is a very good question, and your point is well-taken that we probably don't get value out of having endpoints being dual-stack. Maybe you can confirm my thought process here. I had added this dual-stack endpoints with the thinking that maybe, somehow, ingress controllers or load balancers might need to know about V4 and V6 addresses for endpoints, in order to provide dual-stack access from outside. Thinking about this more, I don't think this is the case. For ingress controllers and load balancers to provide dual-stack access, support of dual-stack NodePorts and dual-stack externalIPs (and ingress controllers using hostnetwork pods) should be sufficient.

Let me know what you think, so I can modify the spec.

For headless services, I believe that we can get by with a single IP family. The IP assigned for a headless service will match the "primary" IP family. This would put headless services on par with non-headless Kube services.

Re. the "Cross-family connectivity", I should remove this from the non-goals. It's confusing and misleading. Family cross over will be supported e.g. with dual-stack ingress controller mapping to a single family endpoint inside the cluster. Cross-family connectivity won't be supported inside the cluster, but that's pretty obvious.

- NodePort: Support listening on both IPv4 and IPv6 addresses
- ExternalIPs: Can be IPv4 or IPv6
- Kube-proxy IPVS mode will support dual-stack functionality similar to kube-proxy iptables mode as described above. IPVS kube-router support for dual stack, on the other hand, is considered outside of the scope of this proposal.
- For health/liveness/readiness probe support, a kubelet configuration will be added to allow a cluster administrator to select a preferred IP family to use for implementing probes on dual-stack pods.

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I think we can do without a global kubelet configuration for preferred IP family for probes. I'll change this to say that health/liveness/readiness probes will use the IP family of the default IP for a pod (which should match the primary IP family in most cases).

I don't think we need a per-pod, per-probe flag for IP family for the intial release. In a future release, we can consider adding a per-pod, per-probe flag to allow e.g. a user to specify that probes can be dual stack, meaning probes are sent for both IP families, and success is declared if either probe is successful, or alternatively if both probes are successful.

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

@thockin - Sure, I can take this out. If the Properties map is removed, should the PodIPInfo structure be removed, and just leave PodIPs as a simple slice of strings, to simplify?

```
##### Default Pod IP Selection
Older servers and clients that were built before the introduction of full dual stack will only be aware of and make use of the original, singular PodIP field above. It is therefore considered to be the default IP address for the pod. When the PodIP and PodIPs fields are populated, the PodIPs[0] field must match the (default) PodIP entry. If a pod has both IPv4 and IPv6 addresses allocated, then the IP address chosen as the default IP address will match the IP family of the cluster's configured service CIDR. For example, if the service CIDR is IPv4, then the IPv4 address will be used as the default address.

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

By "sync logic" you mean how the singular value from old clients (and plural value from new clients) gets fixed up (as described in the "On Compatibility" section?

I'll delete that line. What I meant to say is covered in your API change guide update.

Properties map[string]string
}
// IP addresses allocated to the pod with associated metadata. This list

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

By "document the sync logic", you mean just adding references to that doc in this spec, right?

Currently, health, liveness, and readiness probes are defined without any concern for IP addresses or families. For the first release of dual-stack support, a cluster administrator will be able to select the preferred IP family to use for probes when a pod has both IPv4 and IPv6 addresses. For this selection, a new "--preferred-probe-ip-family" argument for the for the [kubelet startup configuration](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) will be added:
```
--preferred-probe-ip-family string ["ipv4", "ipv6", or "none". Default: "none", meaning use the pod's default IP]

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

See my response to your earlier comment. I think we can do without this configuration, and for probes Kubelet should use the family of the default IP for each pod. I don't think we need a per-pod or per-probe configuration in the initial release of dual-stack, maybe do this as a followup (including support of probes that work on both IP families, requiring either both V4 and V6 responses, or either V4 or V6 responses).

- Because service IPs will remain single-family, pods will continue to access the CoreDNS server via a single service IP. In other words, the nameserver entries in a pod's /etc/resolv.conf will typically be a single IPv4 or single IPv6 address, depending upon the IP family of the cluster's service CIDR.
- Non-headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record) or an IPv6 entry (AAAA record), depending upon the IP family of the cluster's service CIDR.
- Headless Kubernetes services: CoreDNS will resolve these services to either an IPv4 entry (A record), an IPv6 entry (AAAA record), or both, depending on the service's endpointFamily configuration (see [Configuration of Endpoint IP Family in Service Definitions](#configuration-of-endpoint-ip-family-in-service-definitions)).

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I now think single-family headless services would work (on par with non-headless kubernetes services being single-family).

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).
#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.

This comment has been minimized.

@leblancd

leblancd Oct 28, 2018

I'll just say this is out-of-scope for this effort.

Show resolved Hide resolved keps/sig-network/0013-20180612-ipv4-ipv6-dual-stack.md
#### Multiple bind addresses configuration
The existing "--bind-address" option for the will be modified to support multiple IP addresses in a comma-separated list (rather than a single IP string).

This comment has been minimized.

@leblancd
This feature requires the use of the [CNI Networking Plugin API version 0.3.1](https://github.com/containernetworking/cni/blob/spec-v0.3.1/SPEC.md)
or later. The dual-stack feature requires no changes to this API.
The versions of CNI plugin binaries that must be used for proper dual-stack functionality (and IPv6 functionality in general) depend upon the version of Docker that is used in the cluster nodes (see [CNI issue #531](https://github.com/containernetworking/cni/issues/531) and [CNI plugins PR #113](https://github.com/containernetworking/plugins/pull/113)):

This comment has been minimized.

@squeed

squeed Oct 29, 2018

Issue 531 was a weird docker interaction that we've fixed; CNI no longer has a Docker version dependency.

This comment has been minimized.

@leblancd

leblancd Nov 1, 2018

@squeed (and @nyren) thanks for taking care of this! I think it's fair to say that CNI 0.7.0 (or newer) no longer has the Docker dependency, i.e. if you're using CNI 0.6.0, you'll still have the dependency? Kubernetes has a bunch of distro pointers that point to CNI 0.6.0 for plugin binaries, so those pointers should be bumped up in the near future.

@sb1975

This comment has been minimized.

sb1975 commented Oct 29, 2018

@sb1975 - With the help of @aojea, we've put together an overview on how to install a dual-stack NGINX ingress controller on an (internally) IPv6-only cluster: "Installing a Dual-Stack Ingress Controller on an IPv6-Only Kubernetes Cluster". This requires that the nodes be configured with dual-stack public/global IPv4/IPv6 addresses, and it runs the ingress controller pods on the host network of each node.

I haven't configured Stateless NAT46 on a Kubernetes IPv6-only cluster, but you can find some good background references on the web. e.g. Citrux has a helpful reference for configuring their NAT46 appliance here, and there's a video on configuring Stateless NAT46 on a Cisco ASA here.

@leblancd : Thanks for the response , this is very helpful but we have a additional use case : we need a IPv4 client reach a Kubernetes IPv6 service over non-http traffic(like SNMP). Now I understand the ingress would only support only http rules so how do we enable this please ?

The kubeadm configuration options for advertiseAddress and podSubnet will need to be changed to handle a comma-separated list of CIDRs:
```
api:
advertiseAddress: "fd00:90::2,10.90.0.2" [Multiple IP CIDRs, comma separated list of CIDRs]

This comment has been minimized.

@rpothier

rpothier Oct 30, 2018

Nit: the advertiseAddresses are addresses, not CIDRs

This comment has been minimized.

@leblancd

leblancd Nov 1, 2018

Yes indeedy.

@aojea

This comment has been minimized.

aojea commented Oct 30, 2018

@leblancd : Thanks for the response , this is very helpful but we have a additional use case : we need a IPv4 client reach a Kubernetes IPv6 service over non-http traffic(like SNMP). Now I understand the ingress would only support only http rules so how do we enable this please ?

@sb1975 the nginx ingress controller supports non http traffic over TCP and UDP, however, seems that feature is going to be removed kubernetes/ingress-nginx#3197

@thockin

This comment has been minimized.

Member

thockin commented Nov 2, 2018

@leblancd Can you make your edits and resolve any comment threads that are old and stale? Ping me and I'll do another top-to-bottom pass. Hopefully we can merge soon and iterate finer points.

@leblancd

This comment has been minimized.

leblancd commented Nov 2, 2018

@thockin - Will get the next editing pass in by Monday. I'm working on some V6 CI changes today. Thanks!

@neolit123

This comment has been minimized.

Member

neolit123 commented Nov 2, 2018

/assign @timothysc
@kubernetes/sig-cluster-lifecycle
for the kubeadm and kubespray related topics.

## Motivation
The adoption of IPv6 has increased in recent years, and customers are requesting IPv6 support in Kubernetes clusters. To this end, the support of IPv6-only clusters was added as an alpha feature in Kubernetes Version 1.9. Clusters can now be run in either IPv4-only, IPv6-only, or in a "single-pod-IP-aware" dual-stack configuration. This "single-pod-IP-aware" dual-stack support is limited by the following restrictions:
- Some CNI network plugins are capable of assigning dual-stack addresses on a pod, but Kubernetes is aware of only one address per pod.

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

Thanks for the detailed design! We are running some prototype dual stack configurations inside of GCE and starting to find ways to work around the fact that Kubernetes itself is unaware of the IPv6 addresses.
I'd like very much to stay close to the design in this doc, and reinforce our prototyping/testing efforts when the time comes. Please keep me in the loop :)

This proposal aims to extend the Kubernetes Pod Status API so that Kubernetes can track and make use of up to one IPv4 address and up to one IPv6 address assignment per pod.
#### Versioned API Change: PodStatus v1 core
In order to maintain backwards compatibility for the core V1 API, this proposal retains the existing (singular) "PodIP" field in the core V1 version of the [PodStatus V1 core API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#podstatus-v1-core), and adds a new array of structures that store pod IPs along with associated metadata for that IP. The metadata for each IP (refer to the "Properties" map below) will not be used by the dual-stack feature, but is added as a placeholder for future enhancements, e.g. to allow CNI network plugins to indicate to which physical network that an IP is associated. Retaining the existing "PodIP" field for backwards compatibility is in accordance with the [Kubernetes API change quidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md).

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

I've grown really leery of a bag-of-strings properties approach - I think that unless we commit to a naming scheme, or at least a conflict resolution mechanism if two components start using the same keys in incompatible ways I would really like to see these develop as fully specified types rather than a loose bag of strings.
What do you think?

// Properties: Arbitrary metadata associated with the allocated IP.
type PodIPInfo struct {
IP string
Properties map[string]string

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

I see this has already generated some discussion :)
I'll add my own comments here - @leblancd you can ignore my other comment up above.

One pattern I've seen used successfully in internal interfaces is to have a mostly-strongly typed struct with a bag-of-strings at the end for "experimental" free-for-all properties. But that requires agreement that the bag-of-strings should not be relied upon, and can change at any time. I do not think we can enforce such a thing if we put this into an external facing API, so my vote is to only add fully-typed fields with validation and strong semantic meaning. Then we can argue about names all at once before they are used instead of after the fact :)

```
--pod-cidr ipNetSlice [IP CIDRs, comma separated list of CIDRs, Default: []]
```
Only the first address of each IP family will be used; all others will be ignored.

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

Also, I think you mean "first CIDR" here, not the first address.

IP string `json:"ip" protobuf:"bytes,1,opt,name=ip"`
// The IPs for this endpoint. The zeroth element (IPs[0] must match
// the default value set in the IP field)
IPs []string `json:"ips" protobuf:"bytes,5,opt,name=ips"`

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

If the pod IPs have metadata describing them (the PodIPs struct) then isnt' that useful to surface here as well?
It seems like @squeed has a use case for labeling IPs with the network names - is it useful to endpoints controllers (like ingress?) to know that here too?

If we do have the metadata here though, it will need to be exactly the same structure as the PodIPs. I'm not sure if that raises any other API issues.

#### Configuration of Endpoint IP Family in Service Definitions
This proposal adds an option to configure an endpoint IP family for a Kubernetes service:
```
endpointFamily: <ipv4|ipv6|dual-stack> [Default: dual-stack]

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

If service addresses only come from a single address family, why does this belong in the Service definition?

Or to put it another way - shouldn't the default be the same address family as the service CIDR? If Kubernetes itself isn't going to do any 6/4 translation, could you say more about how this can be used in any other way?

The [Kubernetes ingress feature](https://kubernetes.io/docs/concepts/services-networking/ingress/) relies on the use of an ingress controller. The two "reference" ingress controllers that are considered here are the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) and the [NGINX ingress controller](https://github.com/kubernetes/ingress-nginx/blob/master/README.md#nginx-ingress-controller).
#### GCE Ingress Controller: Out-of-Scope, Testing Deferred For Now
It is not clear whether the [GCE ingress controller](https://github.com/kubernetes/ingress-gce/blob/master/README.md#glbc) supports external, dual-stack access. Testing of dual-stack access to Kubernetes services via a GCE ingress controller is considered out-of-scope until after the initial implementation of dual-stack support for Kubernetes.

This comment has been minimized.

@varunmar

varunmar Nov 5, 2018

The unclear parts should at least be clarified :)
I can take this one.

@timothysc

This comment has been minimized.

Member

timothysc commented Nov 5, 2018

@neolit123 This will affect sig-cluster-lifecycle, but it's squarely on sig-networking.
/assign @thockin

@timothysc timothysc removed their assignment Nov 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment