Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort service for hostNetwork pod fails when both Ready and Terminating pods are present #114440

Closed
jason-i-vv opened this issue Dec 13, 2022 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy.

Comments

@jason-i-vv
Copy link

What happened?

We had a nodePort service which use port 30002, transport to a host-network pod which listen 8002.Now it is invalid , we can't access nodeIp:30002 , but nodeIp:8002 is valid.

why we use nodePort when we already use host-network?
Because it could has multiple replicas, we use nodePort to ensure it's avaliable when one pod is down.

  1. we check ipvs rules,it's like :
ipvsadm -L -n 
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:30002 rr
TCP  172.17.0.1:30009 rr
  -> 10.16.203.62:8002

We all see , 30002 doesn't have any target.

  1. we check endpont status ,it's like:

1670324387

We all see , the same ipaddr is both in Addresses and NotReadyAddress.

And , there are some invalid pods and one valid pod , this is why the endpoint looks like.

What did you expect to happen?

if an ipaddr is both in readyAddrs and notReadyAddrs, it shouled be removed from notReadyAddrs to make sure that
valid IP addresses are not masked by invalid IP addresses。Then the url http://nodeIP:30002 we called will always be accessible whenever the deployment has invalid pods or not.

How can we reproduce it (as minimally and precisely as possible)?

  1. apply a pod which use host-network , and bind to a fixed node; the pod shouled not has request and limit value ,ensure is is BestEffort , so it will be evicited first.
  2. apply a service which use nodePort 30002, and bind to the pod
  3. manual make a Diskpressure which is easily caused by dd a big file.
  4. the pod will be evicited ,
  5. move the Diskpressure ,And the pod will be created
  6. Then you will reproduce it , the endpoint will has the nodeIp both in readyAddrs and notReadyAddrs.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# v1.22.4

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
NAME="Kylin Linux Advanced Server"
VERSION="V10 (Tercel)"
ID="kylin"
VERSION_ID="V10"
PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)"
ANSI_COLOR="0;31"
$ uname -a
Linux i-24AFE987 4.19.90-21.2.ky10.aarch64 #1 SMP Tue Oct 27 16:46:15 CST 2020 aarch64 aarch64 aarch64 GNU/Linux

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>
containerd -v
containerd github.com/containerd/containerd v1.6.2 de8046a5501db9e0e478e1c10cbcfb21af4c6b2d
</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>
@jason-i-vv jason-i-vv added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 13, 2022
@k8s-ci-robot
Copy link
Contributor

@jason-i-vv: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 13, 2022
@jason-i-vv
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 13, 2022
@jason-i-vv
Copy link
Author

/wg multitenancy

@k8s-ci-robot k8s-ci-robot added the wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy. label Dec 13, 2022
@danwinship
Copy link
Contributor

/retitle NodePort service for hostNetwork pod fails when both Ready and Terminating pods are present

The API docs say "No address will appear in both Addresses and NotReadyAddresses in the same subset" but I think that means "no completely identical EndpointAddress including TargetRef".

Also, kube-proxy doesn't actually operate on the Endpoints any more anyway, it operates on EndpointSlices, which are slightly different, and the distinction is whether the Terminating field is set or not.

I think probably we do want to indicate that there are simultaneous Ready and Terminating endpoints for the service on the same node, but the proxy code needs to be more careful to interpret this situation correctly...

/cc @robscott

@k8s-ci-robot k8s-ci-robot changed the title NodePort Service Lost Ipvs Target NodePort service for hostNetwork pod fails when both Ready and Terminating pods are present Dec 16, 2022
@robscott
Copy link
Member

I'm not sure I'm completely understanding this, but it sounds like we may need to update the kube-proxy code that handles the case where more than one endpoint exists for the same IP:

// This logic ensures we're deduplicating potential overlapping endpoints
// isLocal should not vary between matching endpoints, but if it does, we
// favor a true value here if it exists.
if _, exists := endpointSet[endpointInfo.String()]; !exists || isLocal {
endpointSet[endpointInfo.String()] = cache.makeEndpointInfo(endpointInfo, svcPortName)
}

@aojea
Copy link
Member

aojea commented Dec 17, 2022

How can we reproduce it (as minimally and precisely as possible)?

  1. apply a pod which use host-network , and bind to a fixed node; the pod shouled not has request and limit value ,ensure is is BestEffort , so it will be evicited first.
  2. apply a service which use nodePort 30002, and bind to the pod
  3. manual make a Diskpressure which is easily caused by dd a big file.
  4. the pod will be evicited ,
  5. move the Diskpressure ,And the pod will be created
  6. Then you will reproduce it , the endpoint will has the nodeIp both in readyAddrs and notReadyAddrs.

a reproducer will be very useful, it seems an issue with the endpoints controller, no?

there is an e2e that forces a disruption that can server as a basis of the reproducer

// regression test for https://issues.k8s.io/109414 and https://issues.k8s.io/109718
ginkgo.It("should be rejected for evicted pods (no endpoints exist)", func(ctx context.Context) {
namespace := f.Namespace.Name
serviceName := "evicted-pods"
jig := e2eservice.NewTestJig(cs, namespace, serviceName)
nodes, err := e2enode.GetBoundedReadySchedulableNodes(cs, e2eservice.MaxNodesForEndpointsTests)
framework.ExpectNoError(err)
nodeName := nodes.Items[0].Name

@jason-i-vv
Copy link
Author

as I know, kube-proxy check endpointslice condition, So this issue can be solved by update endpoint controller or endpointslice controller.

Finally, this two ways will both update endpointslice condition to Ready , when both Ready and Terminating pods are present.

My first way is to update endpoint controller in my pr , Do you suggest some other ways?

Name:         bingokube-console-thdj9
Namespace:    kube-system
Labels:       app.kubernetes.io/managed-by=Helm
              endpointslice.kubernetes.io/managed-by=endpointslice-controller.k8s.io
              kubernetes.io/cluster-service=true
              kubernetes.io/service-name=bingokube-console
              name=bingokube-console
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2022-12-16T02:02:20Z
AddressType:  IPv4
Ports:
  Name               Port  Protocol
  ----               ----  --------
  bingokube-console  8002  TCP
Endpoints:
  - Addresses:  10.202.40.245
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/bingokube-console-56f4684f76-6dmdq
    NodeName:   i-24afe987
    Zone:       <unset>
Events:         <none>

@jason-i-vv
Copy link
Author

#TestPackSubsets
{
			name: "two sets, two mixed ips, two dup ip with uid, dup port, wrong order, ends up with split addresses",
			given: []v1.EndpointSubset{{
				Addresses: []v1.EndpointAddress{{IP: "5.6.7.8"}},
				Ports:     []v1.EndpointPort{{Port: 111}},
			}, {
				NotReadyAddresses: []v1.EndpointAddress{{IP: "5.6.7.8", TargetRef: podRef("uid-1")}},
				Ports:             []v1.EndpointPort{{Port: 111}},
			}, {
				NotReadyAddresses: []v1.EndpointAddress{{IP: "1.2.3.4", TargetRef: podRef("uid-1")}},
				Ports:             []v1.EndpointPort{{Port: 111}},
			}, {
				NotReadyAddresses: []v1.EndpointAddress{{IP: "1.2.3.4"}},
				Ports:             []v1.EndpointPort{{Port: 111}},
			}},
			expect: []v1.EndpointSubset{{
				Addresses: []v1.EndpointAddress{
					{IP: "5.6.7.8"},
				},
				NotReadyAddresses: []v1.EndpointAddress{
					{IP: "1.2.3.4"},
					{IP: "1.2.3.4", TargetRef: podRef("uid-1")},
					{IP: "5.6.7.8", TargetRef: podRef("uid-1")},
				},
				Ports: []v1.EndpointPort{{Port: 111}},
			}},
		}

this test has been writed to expect the dup ip both in ready or not ready, I am not sure is my idea correct with remove the dup ip from not Ready ips set.
@danwinship

@aojea
Copy link
Member

aojea commented Dec 20, 2022

  • apply a pod which use host-network , and bind to a fixed node; the pod shouled not has request and limit value ,ensure is is BestEffort , so it will be evicited first.
  • apply a service which use nodePort 30002, and bind to the pod
  • manual make a Diskpressure which is easily caused by dd a big file.
  • the pod will be evicited ,
  • move the Diskpressure ,And the pod will be created
  • Then you will reproduce it , the endpoint will has the nodeIp both in readyAddrs and notReadyAddrs.

which version are you using?

I think that the problem here is that the endpoint is retaining the PodIP of the evicted pod and that is wrong.

This was fixed by #110255 , please check that you are running in a version with that patch included , it seems you are running in 1.22.4 and this was fixed in v1.22.10, you should update to the latest stable version

@jason-i-vv
Copy link
Author

Got it ,I will check it.

@thockin
Copy link
Member

thockin commented Jan 5, 2023

Any update here?

@jason-i-vv
Copy link
Author

still check for release-1.22.

@thockin thockin closed this as completed Jan 19, 2023
@jason-i-vv
Copy link
Author

@thockin i had checked the stabe version release-1.22, and this problem was indeed solved。Sorry for take so lang.

@thockin
Copy link
Member

thockin commented Jan 21, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants