Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint removal delayed when pod is deleted (1.6) #47597

Closed
jsravn opened this issue Jun 15, 2017 · 31 comments

Comments

Projects
None yet
@jsravn
Copy link
Contributor

commented Jun 15, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
No
Note: Please file issues for subcomponents under the appropriate repo

Component Repo
kubectl kubernetes/kubectl
kubeadm kubernetes/kubeadm

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
endpoints stale removed

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):
1.6.4

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.2
  • Kernel (e.g. uname -a): Linux ip-10-10-10-10 4.8.0-54-generic #57~16.04.1-Ubuntu SMP Wed May 24 16:22:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

3 node etcd cluster with quorum reads on.

What happened:
We observe transient connection failures between pods when they are being destroyed. Upon further investigation, we discovered endpoints are not being removed when the pod is deleted. In cases of failures, the endpoint was not removed at all during termination time, until the pod finally completed (after about 10s). In other non-fatal cases we observed endpoint removal delays of several seconds. It's still not clear yet whether it's just slow to update or whether it simply doesn't remove the endpoint for some reason.

What you expected to happen:
Endpoint should be removed immediately, or shortly after (<1s), when running kubectl delete pod my-pod.

How to reproduce it (as minimally and precisely as possible):

  1. Observe endpoints with kubectl get ep my-eip -o go-template='{{ range (index .subsets 0).addresses }} {{ .ip }} {{ end }}'.
  2. Delete the oldest pod, wait for replicaset controller to scale up a new one.
  3. Wait a few seconds
  4. Repeat 2.

We observe the stale endpoints with kubectl, as well as directly on nodes themselves by observing iptables.

Anything else we need to know:

This only started happening to us recently, in the last few weeks after we upgraded to 1.6 from 1.4. It could also be related to the fact we enabled quorum reads, as we were experiencing similar types of problems with stale kube-proxy updates that we were unable to pinpoint, and thought that would help.

Other notes:

  • Using etcd2.3.7, 3 node cluster
  • About 240 services/endpoints in this cluster, spread across 7 16 core nodes (pretty small, I think)
  • Using leader election / multi master
@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2017

@jsravn There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@xiangpengzhao

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2017

@xiangpengzhao: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-misc.

In response to this:

@kubernetes/sig-network-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

After more testing, it looks like just a delay to remove endpoints. It's bimodal though, either it's near instant or it takes 3-10s to remove. So I'm feeling this may be related to using quorum reads. I'll try disabling that next and retest.

@xiangpengzhao

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

Dose your pod have a graceful deletion process?

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

@xiangpengzhao Yes, it waits about 10s before termination.

@xiangpengzhao

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

@MrHohn

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

This might be the cause. See: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

AFAIK, pod should be removed from endpoints once it is marked as deleting (https://github.com/kubernetes/kubernetes/blob/v1.6.4/pkg/controller/endpoint/endpoints_controller.go#L376-L379), which happens before grace period.

@xiangpengzhao

This comment has been minimized.

Copy link
Member

commented Jun 15, 2017

sorry, I misunderstood:)

@bowei

This comment has been minimized.

Copy link
Member

commented Jun 16, 2017

@wojtek-t this seems to be related to SLO for endpoints update propagation

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

I tried with quorum reads disabled, but I'm still seeing the same problem.

I wrote a script that compares pods to endpoints using kubectl on a node. Here is an example where it took around 9s to remove the endpoint:

Jun 16 12:35:35 2 ready pods == 2 endpoints                                           
Jun 16 12:35:37 Discrepancy detected, 1 ready pods != 2 endpoints                     
Jun 16 12:35:37 Pods:                                                                 
Jun 16 12:35:37 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:37 172.16.35.4     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:37 <no value>      Ready:false     DeletionTimestamp:<no value>                                                            
Jun 16 12:35:37                                                             
Jun 16 12:35:37 Endpoints:                                                            
Jun 16 12:35:37 172.16.35.4                                                           
Jun 16 12:35:37 172.16.97.5                                                           
Jun 16 12:35:37                                                                       
Jun 16 12:35:38 Discrepancy detected, 1 ready pods != 2 endpoints                                                                       
Jun 16 12:35:38 Pods:                                                                 
Jun 16 12:35:38 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:38 172.16.35.4     Ready:true      DeletionTimestamp:<no value>          
Jun 16 12:35:38 172.16.58.5     Ready:false     DeletionTimestamp:<no value>
Jun 16 12:35:38                                                                       
Jun 16 12:35:38 Endpoints:                                                            
Jun 16 12:35:38 172.16.35.4                                                           
Jun 16 12:35:38 172.16.97.5                                                           
Jun 16 12:35:38                                                                                                                                               Jun 16 12:35:39 Discrepancy detected, 1 ready pods != 2 endpoints                     
Jun 16 12:35:39 Pods:                                                                                                                                       
Jun 16 12:35:39 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:39 172.16.35.4     Ready:true      DeletionTimestamp:<no value>          
Jun 16 12:35:39 172.16.58.5     Ready:false     DeletionTimestamp:<no value>          
Jun 16 12:35:39                                                                       
Jun 16 12:35:39 Endpoints:                                                            
Jun 16 12:35:39 172.16.35.4                                                 
Jun 16 12:35:39 172.16.97.5                                                           
Jun 16 12:35:39                                                             
Jun 16 12:35:40 Discrepancy detected, 1 ready pods != 2 endpoints                     
Jun 16 12:35:40 Pods:                                                                 
Jun 16 12:35:40 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:40 172.16.35.4     Ready:true      DeletionTimestamp:<no value>          
Jun 16 12:35:40 172.16.58.5     Ready:false     DeletionTimestamp:<no value>          
Jun 16 12:35:40                                                                       
Jun 16 12:35:40 Endpoints:                                                            
Jun 16 12:35:40 172.16.35.4                                                 
Jun 16 12:35:40 172.16.97.5                                                                                                             
Jun 16 12:35:40                                                             
Jun 16 12:35:42 Discrepancy detected, 1 ready pods != 2 endpoints                     
Jun 16 12:35:42 Pods:                                                                 
Jun 16 12:35:42 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:42 172.16.35.4     Ready:true      DeletionTimestamp:<no value>          
Jun 16 12:35:42 172.16.58.5     Ready:false     DeletionTimestamp:<no value>          
Jun 16 12:35:42                                                                       
Jun 16 12:35:42 Endpoints:                                                            
Jun 16 12:35:42 172.16.35.4                                                           
Jun 16 12:35:42 172.16.97.5                                                 
Jun 16 12:35:42                                                                       
Jun 16 12:35:43 Discrepancy detected, 1 ready pods != 2 endpoints                     
Jun 16 12:35:43 Pods:                                                                 
Jun 16 12:35:43 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:43 172.16.35.4     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:43 172.16.58.5     Ready:false     DeletionTimestamp:<no value>
Jun 16 12:35:43
Jun 16 12:35:43 Endpoints:
Jun 16 12:35:43 172.16.35.4
Jun 16 12:35:43 172.16.97.5
Jun 16 12:35:43
Jun 16 12:35:44 Discrepancy detected, 1 ready pods != 2 endpoints
Jun 16 12:35:44 Pods:
Jun 16 12:35:44 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:44 172.16.35.4     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:44 172.16.58.5     Ready:false     DeletionTimestamp:<no value>
Jun 16 12:35:44
Jun 16 12:35:44 Endpoints:
Jun 16 12:35:44 172.16.35.4
Jun 16 12:35:44 172.16.97.5
Jun 16 12:35:44
Jun 16 12:35:45 Discrepancy detected, 1 ready pods != 2 endpoints
Jun 16 12:35:45 Pods:
Jun 16 12:35:45 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:45 172.16.35.4     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:45 172.16.58.5     Ready:false     DeletionTimestamp:<no value>
Jun 16 12:35:45
Jun 16 12:35:45 Endpoints:
Jun 16 12:35:45 172.16.35.4
Jun 16 12:35:45 172.16.97.5
Jun 16 12:35:45
Jun 16 12:35:47 Discrepancy detected, 2 ready pods != 1 endpoints
Jun 16 12:35:47 Pods:
Jun 16 12:35:47 172.16.97.5     Ready:true      DeletionTimestamp:2017-06-16T12:36:06Z
Jun 16 12:35:47 172.16.35.4     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:47 172.16.58.5     Ready:true      DeletionTimestamp:<no value>
Jun 16 12:35:47
Jun 16 12:35:47 Endpoints:
Jun 16 12:35:47 172.16.35.4
Jun 16 12:35:47
Jun 16 12:35:48 2 ready pods == 2 endpoints

As can be seen the endpoint isn't removed although the pod was deleted and has a non nil deletion timestamp. Oddly the timestamp is in the future, not sure why that is yet. (edit: Okay, I see, it adds the terminationGracePeriodSeconds)

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

I suspect something is going wrong in endpoints_controller, because I notice it has a maxWait time of 10s in the endpoint work queue. Which oddly coincides with the max delay I'm seeing on endpoint updates.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Jun 16, 2017

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

@smarterclayton It seems unloaded to me, from a system perspective. For the above example I posted, the master nodes (which contain apiserver and etcd) were lightly loaded. etcd didn't use more than about a second of CPU time in a minute. Peak load average was 0.57 - these are 2 core boxes. Average io was also about 1.5ms (but could check etcd metrics to check further). It's a pretty small cluster at 7 nodes with about 250 endpoints total. Since I'm using leader election on controller manager / scheduler, there is some endpoint spam there, but it's only about once per second or so.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

Max etcd fsync time ranges from ~32ms to ~256ms, 0 unsynced slow watchers.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

I turned up controller manager logging to -v=4.

What I observe is that the endpoint update delay seems to happen whenever the endpoints_controller is doing a full sync. During this time new endpoint updates don't seem to be processed. It takes about 15s or so for the controller to finish syncing all the endpoints. It looks like it is hitting request throttling limits (I'm using default values on controller-manager). I see one of these for every endpoint, about 250 endpoints:

I0616 15:16:47.169823       1 request.go:638] Throttling request took 244.812348ms, request: PUT:http://localhost:8080/api/v1/namespaces/my-ns/endpoints/my-svc
I0616 15:16:47.179573       1 endpoints_controller.go:303] Finished syncing service "my-ns/my-svc" endpoints. (505.389156ms)

(note: I see both PUTs and GETs getting throttled - I had to sanitise my logs so just listed the single entry above)

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

It looks like endpoint controller is doing a full list on every single endpoint every 30s... which is why it's getting throttled in my case. As a workaround I can probably increase the request limits on controller manager.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2017

Default qps is 20, so with about 250 endpoints it takes about 15s to query them all. Is there a guideline to set qps? 20 seems really low with this full sync behaviour.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

@smarterclayton It's a mixture. We have 223 endpoints, during one resync I count 93 PUTs and 198 GETs which got throttled. It is a relatively high churn environment since it includes developer build deployments. It does look to me like it calls GET for every single endpoint though - so how are you avoiding api throttling when it does the 30s resync? Is it expected for endpoint updates to be delayed during the resync?

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

I see the throttling even in smaller clusters which have ~50 endpoints. The sync takes 2-3s in this case, during which I suspect updates are not processed, adding a delay to the ep updates during this time.

It seems the only workaround is to increase the qps limits on the controller manager to be at least the amount of endpoints in the cluster. I think it's a bug that during the resync, updates can't be processed - but am not sure whether this is due to the work queue implementation or the throttling.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

We're seeing ~12 sustained QPS, so this is far below the qps limit. It's just the resync burst that adds a huge delay to endpoint updates. So I'm thinking the qps burst needs to be large enough to accommodate the resync size, aka > number of endpoints.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

I wonder why it doesn't use a cache for the endpoints.

@jsravn jsravn changed the title Endpoints not being removed when pod is deleted (1.6) Endpoint removal delayed when pod is deleted (1.6) Jun 19, 2017

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

@smarterclayton I created a PR to move endpoints controller to use the cache instead of GETs (#47731). Seems reasonable to me. I'll test it out in my cluster and see how it goes.

@tharpa

This comment has been minimized.

Copy link

commented Jun 21, 2017

How can I manually permanently clean up the endpoint list for a service? I have a staled entry in the ep list which is being recreated every time I start a new pod for the service.

k8s-github-robot pushed a commit that referenced this issue Jun 26, 2017

Kubernetes Submit Queue
Merge pull request #47788 from shyamjvs/resync-period-ep-controller
Automatic merge from submit-queue

Get rid of 30s ResyncPeriod in endpoint controller

Ref: #47597 
This should fix one of the demons of endpoint controller.

/cc @smarterclayton @gmarek

k8s-github-robot pushed a commit that referenced this issue Jun 27, 2017

Kubernetes Submit Queue
Merge pull request #47731 from jsravn/use-endpoints-cache-for-endpoin…
…t-controller

Automatic merge from submit-queue

Use endpoints informer for the endpoint controller

This substantially reduces the number of API calls made by the endpoint
controller. Currently the controller makes an API call per endpoint for
each service that is synced. When the 30s resync is triggered, this
results in an API call for every single endpoint in the cluster. This
quickly exceeds the default qps/burst limit of 20/30 even in small
clusters, leading to delays in endpoint updates.

This change modifies the controller to use the endpoint informer cache
for all endpoint GETs. This means we only make API calls for changes in
endpoints. As a result, qps only depends on the pod activity in the
cluster, rather than the number of services.



**What this PR does / why we need it**:

Address endpoint update delays as described in #47597.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #

#47597

**Special notes for your reviewer**:

**Release note**:

```release-note
```
@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 28, 2017

@smarterclayton Regarding issues with the ep controller we've discussed before:

  1. Doing unnecessary GETs before PUTs - Fixed by #47731 which introduced ep informer.
  2. 30s resync period for service informer - Fixed by #47788 which removes it and enqueues services for tombstone pods rather than wait for resync.
  3. PUTting things not changing - We are not doing that if the ep object hasn't changed.
  4. Flapping state of a pod causing frequent GETs+PUTs - The GETs part of it is solved by ep informer. The PUTs part requires using PATCH instead and that might be infeasible (see below)
  5. Using PATCH instead of PUT - It seems infeasible without introducing breaking API change (details in #47787)

So at this moment ep controller is performing the best (almost..) it can. I guess we can close this issue now?

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2017

@shyamjvs shyamjvs closed this Jun 28, 2017

@chesterlai

This comment has been minimized.

Copy link

commented Sep 29, 2017

HI
We use k8s 1.2 and 1.6
When our kube-dns restarted, it impacted our service seriously.
As k8s 1.8 is launched recently, and we can't upgrade immediately.
We will delete our pod after kube-dns restarted.
It impact our system heavily.
Have any work around on this issue?
Please advice.

@jsravn

This comment has been minimized.

Copy link
Contributor Author

commented Sep 29, 2017

@chesterlai This issue is fixed in 1.6. But it might not be related to the problem you're seeing. I suggest making a new bug, with a detailed description of the problem.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Oct 2, 2017

@chesterlai This issue is fixed in 1.6. But it might not be related to the problem you're seeing. I suggest making a new bug, with a detailed description of the problem.

+1

k8s-github-robot pushed a commit that referenced this issue Nov 29, 2017

Kubernetes Submit Queue
Merge pull request #55474 from jsravn/automated-cherry-pick-of-#47731…
…-upstream-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #47731

Cherry pick of #47731 to fix #47597.

```release-note
Improve time to update endpoints by reducing number of API calls.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.