Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set secondary address on host-network pods #106715

Merged
merged 2 commits into from Feb 4, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 29, 2021

host-network pods IPs are obtained from the reported kubelet nodeIPs.

Historically, host-network podIPs are immutable once set, but when
we've added dual-stack support, we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.

If a secondary IP address is added to a node after the host-network pods
IPs are set, we can add the secondary host-network pod IP address
maintaining the current behavior of not updating the current podIPs on
host-network pods.

This is an alternative to #105410 ,
without having to change the kubelet and host-network podIPs behavior
xref: kubernetes/enhancements#3000

/kind bug

CRI-API: IPs returned by PodSandboxNetworkStatus are ignored by the kubelet for host-network pods.

Fixes: #105320

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 29, 2021
@k8s-ci-robot
Copy link
Contributor

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 29, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 29, 2021
@aojea
Copy link
Member Author

aojea commented Nov 29, 2021

/assign @danwinship @khenidak @thockin
/cc @hanamantagoudvk

@k8s-ci-robot
Copy link
Contributor

@aojea: GitHub didn't allow me to request PR reviews from the following users: hanamantagoudvk.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @danwinship @khenidak @thockin
/cc @hanamantagoudvk

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -3284,6 +3284,7 @@ func TestGenerateAPIPodStatusHostNetworkPodIPs(t *testing.T) {
criPodIPs: []string{"192.168.0.1"},
podIPs: []v1.PodIP{
{IP: "192.168.0.1"},
{IP: "fd01::1234"},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting corner case from DanWinships test cases

			name: "CRI PodIPs override NodeAddresses",
			nodeAddresses: []v1.NodeAddress{
				{Type: v1.NodeInternalIP, Address: "10.0.0.1"},
				{Type: v1.NodeInternalIP, Address: "fd01::1234"},
			},
			criPodIPs: []string{"192.168.0.1"},
			podIPs: []v1.PodIP{
				{IP: "192.168.0.1"},
				{IP: "fd01::1234"},

CRIO and containerd doesn't send IPs for host network pods
https://github.com/cri-o/cri-o/blob/8d4d158935929800c4300b82eb4b5a83ded400f4/server/sandbox_network.go#L34-L36

https://github.com/containerd/containerd/blob/4921fb6b63e5476e54fb7e2cf93c3b505f1c09cb/pkg/cri/store/sandbox/sandbox.go#L40-L41

but they can.
I think that the discussion here is if override means removing IPs or just updating the ones that it sets leaving the others as is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think override means "ignore CRI".

Do we think it is EVER correct for CRI to return host-net IPs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think it is EVER correct for CRI to return host-net IPs?

I think we all agree that is a bad idea, but it seems that it is the way it was #106715 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else we should specify what Kubernetes expects host-network pod IPs to be (order, number, type). I don't like leaving that an undefined wild-west.

And I think we should enforce that expectation. If there are corner cases that are currently possible that we break, I think a KEP should get written to implement that corner-case correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, the CRI IPs are obtained in kuberuntime in GetPodStatus()

podIPs = m.determinePodSandboxIPs(namespace, name, podSandboxStatus)

and it seems they are stored in the PLEG cache

status, err := g.runtime.GetPodStatus(pod.ID, pod.Name, pod.Namespace)
if klog.V(6).Enabled() {
klog.V(6).ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
} else {
klog.V(4).ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))
}
if err == nil {
// Preserve the pod IP across cache updates if the new IP is empty.
// When a pod is torn down, kubelet may race with PLEG and retrieve
// a pod status after network teardown, but the kubernetes API expects
// the completed pod's IP to be available after the pod is dead.
status.IPs = g.getPodIPs(pid, status)
}

This cache is used to syncPod() from the PodWorker

err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)

and generateAPIPodStatus()

apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)

Copy link
Member Author

@aojea aojea Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems there is no way to differentiate between:

  • the CRI sets the IPs from CRI
  • CRI does NOT return IPs, but Pod status IPs are updated incrementally,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship what do you think?

I'm happy to declare that it was a bug that we ever let the CRI set hostNetwork pod IPs, but we should probably update the CRI docs to clarify this if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the commit with the CRI API change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds good from the CRI API perspective 👍

@aojea
Copy link
Member Author

aojea commented Nov 29, 2021

Kubernetes e2e suite: [sig-node] Pods should delete a collection of pods [Conformance] expand_less

/test pull-kubernetes-e2e-kind

@ehashman ehashman added this to Triage in SIG Node PR Triage Nov 29, 2021
@danwinship
Copy link
Contributor

danwinship commented Dec 4, 2021

we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.

why wouldn't it be? / why wouldn't we expect you to restart kubelet in that case?

CRIO and containerd doesn't send IPs for host network pods
...
but they can.

So, originally, cri-o did send IPs for host network pods, and kubelet just accepted them. But this was a bad idea, because it means that if you wanted to override kubelet's --node-ip, then you probably also had to override cri-o's idea of what the default host IP was or your host-network pods would end up with the wrong IP. Kubelet already had the fallback code to set host-network podIPs itself if the CRI didn't do it, so we just changed cri-o to leave the pod IPs blank for hostNetwork pods. (cri-o/cri-o#2944)

I'm not sure what the historical situation with containerd was.

The unit test is testing that if the CRI does set the pod IPs, that kubelet continues to behave like it did historically, so as to not break CRIs that still do that.

@aojea
Copy link
Member Author

aojea commented Dec 4, 2021

why wouldn't it be? / why wouldn't we expect you to restart kubelet in that case?

Because that is the issue reported, and kubelet will race again 😄 #105320

The unit test is testing that if the CRI does set the pod IPs, that kubelet continues to behave like it did historically, so as to not break CRIs that still do that.

agree, that is why I've commented here #106715 (comment)

I think that the discussion here is if override means removing IPs or just updating the ones that it sets leaving the others as is

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, I think the simpler and more obvious this is, the better. I like this.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 7, 2022
@@ -445,6 +445,7 @@ message PodIP{
string ip = 1;
}
// PodSandboxNetworkStatus is the status of the network for a PodSandbox.
// Ignored for pods sharing the host networking namespace.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea
Copy link
Member Author

aojea commented Jan 13, 2022

@thockin @danwinship I've updated the CRI API , can you please take another look?

// If the Node changes its IPs the user has to restart the Kubelet
// so the HostNetwork pods update their IPs. We may change this in a future
// https://github.com/kubernetes/enhancements/pull/3000
// but since it changes the current behavior "here be dragons"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need to explain that much. Also, you spend several sentences explaining that podIPs is immutable, and then add code that changes it.

// HostNetwork Pods inherit the node IPs as PodIPs. They are immutable once set,
// other than that if the node becomes dual-stack, we add the secondary IP.

(though probably we should remove the second IP on dual-stack-to-single-stack downgrade too)

Copy link
Member Author

@aojea aojea Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see my gap on the language, I try to mean that once a PodIP exist that never changes, however, you can add a new IP is not the array that is immutable (it can grow), is the element in the array that can't change once set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though probably we should remove the second IP on dual-stack-to-single-stack downgrade too)

this can open a new set of corner cases and we don't have a good history here, per example, I think that all the dual stack services that are stored will fail in validation on a migration like this.

@@ -445,6 +445,7 @@ message PodIP{
string ip = 1;
}
// PodSandboxNetworkStatus is the status of the network for a PodSandbox.
// Ignored for pods sharing the host networking namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance we might add new fields in the future that it wouldn't make sense to ignore for hostNetwork pods? Maybe hedge a little bit here and prepend "Currently" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point

host-network pods IPs are obtained from the reported kubelet nodeIPs.

Historically, host-network podIPs are immutable once set, but when
we've added dual-stack support, we didn't consider that the secondary
IP address may not be present at the same time that the primary nodeIP.

If a secondary IP address is added to a node after the host-network pods
IPs are set, we can add the secondary host-network pod IP address
maintaining the current behavior of not updating the current podIPs on
host-network pods.
@danwinship
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 17, 2022
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

s.PodIPs = []v1.PodIP{{IP: s.PodIP}}
if len(hostIPs) == 2 {
// HostNetwork Pods inherit the node IPs as PodIPs. They are immutable once set,
// other than that if the node becomes dual-stack, we add the secondary IP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also needto downgrade? I feel like eventually we'll have to revisit this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea
Copy link
Member Author

aojea commented Jan 25, 2022

/assign @mrunalp
for the CRI API doc update

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022
@aojea
Copy link
Member Author

aojea commented Feb 4, 2022

updated the CRI-API godoc on v1 and v1alpha2 too

@mrunalp
Copy link
Contributor

mrunalp commented Feb 4, 2022

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022
@thockin
Copy link
Member

thockin commented Feb 4, 2022

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, mrunalp, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 469c4c4 into kubernetes:master Feb 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 4, 2022
k8s-ci-robot added a commit that referenced this pull request Feb 10, 2022
…-of-#106715-upstream-release-1.23

Automated cherry pick of #106715: set secondary address on host-network pods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
7 participants