Endpoints API Object could not support large number of endpoints #73324

freehan · 2019-01-25T19:13:43Z

What happened:

When a service select many backend pods (e.g. 10k), the corresponding Endpoints object becomes big (>1MB in proto).

On Kube-proxy, the API watcher will start dropping the endpoints object and will not program iptables for the service. #57073

k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:86: watch of *core.Endpoints ended with: very short watch: 
k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:86: Unexpected watch close - watch lasted less than a second and no items received

On master, if the object is too big. It can result in write failure to etcd.

Error syncing endpoints for service "xxxxx/xxxxxxxxx": etcdserver: request is too large

What you expected to happen:

It should just work.

Proposed Short Term Fix:

Add support in Endpoint controller to truncate the endpoint object to the size supported(configured). With this approach, while all endpoints are not reflected, traffic disruption can be avoided by avoiding stale IP tables state on nodes.
Add events when endpoint object size is approaching limits and when it exceeds limit - resulting in truncation. (Ideally also generate events where we would drop the message)

Environment: OSS K8s

Kubernetes version (use kubectl version): 1.11

The text was updated successfully, but these errors were encountered:

nikopen · 2019-03-01T18:56:42Z

Hi @freehan , is another PR needed to close this?
Code Freeze is in effect from next Friday.

liggitt · 2019-03-13T16:58:42Z

given this is not a regression, it doesn't seem release blocking

spiffxp · 2019-03-13T21:51:24Z

/milestone clear
v1.14 release lead here, at this late stage in the release cycle, this doesn't seem like it's destined for this release.

Please come talk to us in #sig-release if you feel this was done in error

fejta-bot · 2019-06-19T22:47:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-07-19T23:34:29Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-08-19T00:30:54Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-08-19T00:31:02Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alejandrox1 · 2020-11-14T22:23:33Z

This is being tackled in https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/0752-endpointslices
so removing the rotten label to (to distinguish this from issues which are labeled as rotten and do need some work done)
/remove-lifecycle rotten

This PR introduced a new API secpolnodestatus that aims to address the following problems: - it was impossible to reflect per-node status such as a profile failed to install on a single node or a node does not support the selected security profile - simply adding per-node attributes like a map might not scale, given very high number of nodes, the object might become too big for etcd's 1MB limit (see also kubernetes/kubernetes#73324) - because the per-profile status is written to by several sources (each pod in a DaemonSet), the status might appear as "flapping" as different pods reach different states at their own pace. The secpolnodestatus is created, managed and deleted together with finalizers through an API in a new module. Instead of updating the global state directly, the DS pods now call the API update method that updates the node status and if needed, also the global status. When a policy is deleted, the object is marked as terminating, when the policy payload is removed, the node status object along with its finalizer is deleted. Finally, when all finalizers are gone, so is the global policy object.

freehan added sig/network Categorizes an issue or PR as relevant to SIG Network. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 25, 2019

freehan added this to the v1.14 milestone Jan 25, 2019

freehan self-assigned this Jan 25, 2019

freehan mentioned this issue Feb 12, 2019

record event on endpoint update failure #73968

Merged

thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019

k8s-ci-robot removed this from the v1.14 milestone Mar 13, 2019

thockin removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 21, 2019

thockin mentioned this issue May 9, 2019

at large scale, kube-apiserver CPU-starves to death at 8qps of "Endpoints PATCH" #58050

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2019

k8s-ci-robot closed this as completed Aug 19, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 14, 2020

jhrozek mentioned this issue Mar 4, 2021

SPO scalable status kubernetes-sigs/security-profiles-operator#343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoints API Object could not support large number of endpoints #73324

Endpoints API Object could not support large number of endpoints #73324

freehan commented Jan 25, 2019 •

edited

nikopen commented Mar 1, 2019

liggitt commented Mar 13, 2019

spiffxp commented Mar 13, 2019

fejta-bot commented Jun 19, 2019

fejta-bot commented Jul 19, 2019

fejta-bot commented Aug 19, 2019

k8s-ci-robot commented Aug 19, 2019

alejandrox1 commented Nov 14, 2020

Endpoints API Object could not support large number of endpoints #73324

Endpoints API Object could not support large number of endpoints #73324

Comments

freehan commented Jan 25, 2019 • edited

nikopen commented Mar 1, 2019

liggitt commented Mar 13, 2019

spiffxp commented Mar 13, 2019

fejta-bot commented Jun 19, 2019

fejta-bot commented Jul 19, 2019

fejta-bot commented Aug 19, 2019

k8s-ci-robot commented Aug 19, 2019

alejandrox1 commented Nov 14, 2020

freehan commented Jan 25, 2019 •

edited