APIServer watchcache lost events #123072

mengqiy · 2024-02-01T07:58:13Z

What happened?

It appears that APIServer watchcache occasionally lost events. We can confirm that this is NOT a stale watchcache issue.

In some 1.27 clusters, we observed that both watchcache in 2 APIServer instances are pretty up-to-date (object created within 60s can be found from both cache). However, we believe some delete events were lost in the APIServer watchcache. In the bad apiserver, a few objects that deleted more than 24 hours ago still shows up in one of the APIServer cache. It's possible that other types of events (e.g. update) may also get lost, but they are not as noticeable as delete event since it can recover from the 2nd update event even if the first update event is lost.

This issue impacts k8s clients that use an informer cache. Once the informer get the same events from the bad APIServer, it won't recover until it gets restarted. Replacing the bad APIServer with a good one won't help the informer to discover the missing events.

We have observed at least 6 clusters run into this issue in EKS. 5 of them started to have this issue shortly after control plane upgrade but 1 cluster started to have this issue more than 1 hour before the control plane upgrade kicked in.

The clusters were running OK on 1.26, the issue started to show up when the clusters were upgraded to 1.27.

We saw apiserver_watch_cache_events_received_total{resource="pods"} diverged between the 2 apiserver instances. during the incident while the delta between the 2 apiserver instances are expected to be the same.

We run the following command kubectl get --raw "/api/v1/namespaces/my-ns/pods/my-pod?resourceVersion=0" on 2 different APIServers. One returns an object and the other returns NotFound.

EDIT: add some additional data points.
We did see the etcd memory kept increasing during the incident.

We believe the components that triggered this is Falco v0.35.1. It is a daemonset and it made a lot of watch requests w/o resourceVersion. All 6 clusters have a couple hundreds of nodes when the incident started.

In my repro cluster, I saw one etcd has much higher etcd_debugging_mvcc_pending_events_total(over 1 million) than the other etcd instances (< 20k).

What did you expect to happen?

APIServer watch cache not to lose event.

How can we reproduce it (as minimally and precisely as possible)?

EDIT:
We have a way to repro in EKS. It may not be the minimum steps to repro.

Create a 1.27 cluster with 800 worker nodes
Maintain high pod churn in the cluster. I have 1000 per minute
Deploy falco using helm: helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.

falco-chart-values.yaml

json_output: true
log_syslog: false
collectors:
  containerd:
    enabled: false
  crio:
    enabled: false
  docker:
    enabled: false
driver:
  enabled: true
  kind: module
  loader:
    enabled: true
    initContainer:
      image:
        registry: your-account.dkr.ecr.us-west-2.amazonaws.com
        repository: david-falco-driver-loader
        tag: 0.35.1
        pullPolicy: IfNotPresent
image:
  registry: your-account.dkr.ecr.us-west-2.amazonaws.com
  repository: david-falco
  tag: 0.35.1
  pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
  - operator: Exists
falcoctl:
  image:
    registry: your-account.dkr.ecr.us-west-2.amazonaws.com
    repository: david-falcoctl
    tag: 0.5.1
    pullPolicy: IfNotPresent
  artifact:
    install:
      enabled: false
    follow:
      enabled: false
falco:
  syscall_event_drops:
    actions:
    - log
    - alert
    rate: 1
    max_burst: 999
  metadata_download:
    maxMb: 200
extra:
  env:
  - name: SKIP_DRIVER_LOADER
    value: "yes"

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

AWS EKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

mengqiy · 2024-02-01T07:59:18Z

/sig api-machinery

mengqiy · 2024-02-01T07:59:53Z

/area apiserver

likakuli · 2024-02-01T14:35:58Z

/cc

wojtek-t · 2024-02-01T18:30:06Z

While I obviously can't exclude the bug in watchcache, we should try to validate it e2e to, so ensure that that it works fine on etcd side.

wojtek-t · 2024-02-01T18:33:45Z

We definitely need more info to debug it.

I would also add that watchcache is propagate with reflector, so I think that if the issue exists, it's may potentially happen to any watch (not just watchcache...)

mengqiy · 2024-02-03T01:03:34Z

This issue is highly correlated to the PrevKv=nil error. A sample message:
E0129 20:22:12.231725 11 watcher.go:253] watch chan error: etcd event received with PrevKv=nil (key="/registry/pods/foo/bar", modRevision=8143031328, type=PUT)

FWIW we found the symptom is very similar to #76624 which has been fixed in #76675 since 1.15.

wojtek-t · 2024-02-05T13:23:23Z

Hmm - it seems that this might be a problem around guarantees for PrevKey
@serathius can provide more details here

@mborsz - FYI

MadhavJivrajani · 2024-02-05T13:35:22Z

@serathius this is also the reason for my question here: etcd-io/etcd#17352 (comment)

serathius · 2024-02-06T16:29:38Z

Right, would like to revisit this in etcd-io/etcd#10681.

dims · 2024-02-07T01:12:53Z

cc @chaochn47

dims · 2024-02-07T02:18:43Z

waded through old issues ... found #121285 and #100112

serathius · 2024-02-07T17:38:07Z

@mengqiy Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

What etcd version are you using?

mengqiy · 2024-02-07T18:27:38Z

etcd version is 3.5.10.

mengqiy · 2024-02-07T18:34:54Z

Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

We no longer have such information, because we have to restart the APIServers in the customer's cluster to mitigate the issue and the metrics were reset after restarting.

mengqiy · 2024-02-07T19:50:33Z

Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

We no longer have such information, because we have to restart the APIServers in the customer's cluster to mitigate the issue and the metrics were reset after restarting.

@serathius Luckily we got this info from other source. It shows that apiserver_watch_cache_events_dispatched_total{resource="pods"} started to diverge after upgrading to 1.27 and the 2 apiserver had roughly 13 million and 16 million pods events dispatched after running for about 22 hours before restarting.

Similarly for apiserver_watch_cache_events_received_total{resource="pods"} metric. They diverge to about 6.5 million and 8 million for 2 apiservers.

From my current understanding, apiserver_watch_cache_events_dispatched_total should be the almost same if the 2 apiservers start at almost the same time. Similarly for apiserver_watch_cache_events_received_total

seans3 · 2024-02-09T05:52:09Z

/triage accepted

dims · 2024-02-22T15:47:34Z

xref: falcosecurity/falco#2973 falcosecurity/libs#1478 falcosecurity/falco#2995 falcosecurity/falco#2763

mengqiy · 2024-02-22T17:08:24Z

We observed that Falco daemonset makes a lot of watch requests that don't have resourceVersion.

mengqiy · 2024-02-22T22:25:17Z

We have a way to repro in EKS. It may not be the minimum steps to repro.

Create a 1.27 cluster with 800 worker nodes
Maintain high pod churn in the cluster. I have 1000 per minute
Deploy falco using helm: helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.
Scaling the worker nodes from 800 to 1000

falco-chart-values.yaml

json_output: true
log_syslog: false
collectors:
  containerd:
    enabled: false
  crio:
    enabled: false
  docker:
    enabled: false
driver:
  enabled: true
  kind: module
  loader:
    enabled: true
    initContainer:
      image:
        registry: your-account.dkr.ecr.us-west-2.amazonaws.com
        repository: david-falco-driver-loader
        tag: 0.35.1
        pullPolicy: IfNotPresent
image:
  registry: your-account.dkr.ecr.us-west-2.amazonaws.com
  repository: david-falco
  tag: 0.35.1
  pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
  - operator: Exists
falcoctl:
  image:
    registry: your-account.dkr.ecr.us-west-2.amazonaws.com
    repository: david-falcoctl
    tag: 0.5.1
    pullPolicy: IfNotPresent
  artifact:
    install:
      enabled: false
    follow:
      enabled: false
falco:
  syscall_event_drops:
    actions:
    - log
    - alert
    rate: 1
    max_burst: 999
  metadata_download:
    maxMb: 200
extra:
  env:
  - name: SKIP_DRIVER_LOADER
    value: "yes"

I confirmed the repro by looking for:

find the highest pod(s) that results in 404s.
Get this pod via etcd quorum read to confirm it has been deleted.
Get this pod from 2 apiserver cache (with ?resourceVersion=0). If one of apiserver cache still has the pod, we can confirm the delete pod event was lost.

mengqiy · 2024-02-28T18:48:18Z

~~@chaochn47 confirmed that the event was not sent when the delete pod event was not seen in the apiserver watch cache.~~ One of its assumption may not stand. It's still pending confirmation.

This issue cannot be repro in 1.26 cluster and can be repro in 1.28 cluster.
It seems #123532 can mitigate the issue. I cannot repro it on 1.28 after applying the patch.

chaochn47 · 2024-02-29T01:28:53Z

FYI @mengqiy found out that 610b670#diff-14c4fb9f290753f50e0af4856d871ff41e2b520d747158cbff7bafc577fbb29eR490 is the trigger added since v1.27, cacher would delegate watch with resource version unset request to etcd.

dims · 2024-02-29T02:32:59Z

FYI @mengqiy found out that 610b670#diff- 14c4fb9f290753f50e0af4856d871ff41e2b520d747158cbff7bafc577fbb29eR490 is the trigger added since v1.27, cacher would delegate watch with resource version unset request to etcd.

cc @MadhavJivrajani

MadhavJivrajani · 2024-02-29T08:31:15Z

Thanks for the ping @dims

Considering sending the watch the etcd is still the correct semantics, the solution proposed here is probably the right way forward: #123448 (comment)

shyamjvs · 2024-02-29T18:22:09Z

This is technically much more expensive request, but given there should be very few of those and eventually we can optimize it, I'm fine with it for now.

@wojtek-t making a great call-out on that PR - but under-estimating how notorious our client ecosystem can be :')

mengqiy · 2024-02-29T19:06:15Z

Considering sending the watch the etcd is still the correct semantics, the solution proposed here is probably the right way forward: #123448 (comment)

@MadhavJivrajani I agree your change made watch to have the right semantic. But we just don't anticipate there can be so many clients making direct etcd watch. We don't have any test to catch this issue.

dims · 2024-02-29T22:18:53Z

@mengqiy @shyamjvs so we will help review things better next time... right? :)

likakuli · 2024-03-04T08:48:38Z

seems there is another case which will result to event lost.

FWIW we found the symptom is very similar to #76624 which has been fixed in #76675 since 1.15.

kube-apiserver will check if the PrevKV == nil for NonCreate request. This is ok util #111387. It use MaxInternalErrorRetryDuration and set it 30s for Reflector used in Cacher which will result to event lost when there is a event with PrevKV=nil returned by etcd. An StatusError with reason StatusReasonInternalError will be returned to Reflector and there is a risk of lost event caused by the following logic https://github.com/kubernetes/kubernetes/pull/111387/files#diff-9ccdf713e010f73dbebd01e936cb0077fc63e4f5ab941d865ded42da219d84ecR343-R345. The Reflector will continue watch with old object exists in its store.

chaochn47 · 2024-03-04T20:39:34Z

Thanks @likakuli for looking into this, this aligns with my understanding in some sort that reflector keeps retry the watch with an old resource version. This could explain a few minutes watch cache being stale but does not explain indefinite lost event symptoms.

The problem is etcd is expected to send out a watch response with StartRevision not equal to 0 (translates to ErrCompacted in client side) when a watch is not in sync with the progress of the MVCC store. code wise, it happens in syncWatchersLoop where the slow watchers cannot send out the events due to the watchStream Chan is clogged.

Or when client initializing a watch with expired resource version.

And reflector would list and watch again to refresh the watch cache if the watchHandler returns an The resourceVersion provided watch is too old error, so the issue should go away. I suspect that this special watch response is also clogged at etcd server side and would create an etcd upstream issue to consult the maintainers.

serathius · 2024-03-13T14:14:23Z

Update from etcd maintainers, we have confirmed that this is a real issue. If etcd watch stream is congested and watch goes outside of compaction window and etcd is unable to send a compacted response to watcher, it will incorrectly synchronize the watch with incomplete data.

Thanks @mengqiy and @chaochn47 for reporting and proposing a fix etcd-io/etcd#17555

serathius · 2024-06-25T08:24:20Z

While the watch starvation issue has been resolved, a new problem with watch dropping events in certain edge cases has surfaced (see etcd-io/etcd#18089).

mengqiy added the kind/bug Categorizes issue or PR as related to a bug. label Feb 1, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 1, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2024

k8s-ci-robot added the area/apiserver label Feb 1, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2024

serathius mentioned this issue Feb 22, 2024

Watch with resourceVersion="" can take down control plane #123448

Open

This was referenced Feb 23, 2024

Implement per-request watch cache opt-out #123472

Closed

Move cacher watch to separate rpc preventing starvation #123532

Merged

chaochn47 mentioned this issue Mar 4, 2024

etcd watch events starvation / lost multiplexed on a single watch stream etcd-io/etcd#17529

Open

4 tasks

liggitt mentioned this issue Mar 14, 2024

Automated cherry pick of #123532: Prevent watch cache starvation, by moving its watch to #123700

Closed

serathius mentioned this issue Mar 14, 2024

Serve watch without resourceVersion from cache and introduce a WatchFromStorageWithoutResourceVersion feature gate to allow serving watch from storage #123935

Merged

ugur99 mentioned this issue May 20, 2024

Etcd version update kubernetes-sigs/kubespray#11218

Open

chaochn47 mentioned this issue Jun 12, 2024

Watch dropping an event when compacting on delete etcd-io/etcd#18089

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APIServer watchcache lost events #123072

APIServer watchcache lost events #123072

mengqiy commented Feb 1, 2024 •

edited

Loading

mengqiy commented Feb 1, 2024

mengqiy commented Feb 1, 2024

likakuli commented Feb 1, 2024

wojtek-t commented Feb 1, 2024

wojtek-t commented Feb 1, 2024

mengqiy commented Feb 3, 2024

wojtek-t commented Feb 5, 2024

MadhavJivrajani commented Feb 5, 2024

serathius commented Feb 6, 2024

dims commented Feb 7, 2024

dims commented Feb 7, 2024

serathius commented Feb 7, 2024

mengqiy commented Feb 7, 2024

mengqiy commented Feb 7, 2024

mengqiy commented Feb 7, 2024 •

edited

Loading

seans3 commented Feb 9, 2024

dims commented Feb 22, 2024

mengqiy commented Feb 22, 2024

mengqiy commented Feb 22, 2024 •

edited

Loading

mengqiy commented Feb 28, 2024 •

edited

Loading

chaochn47 commented Feb 29, 2024

dims commented Feb 29, 2024

MadhavJivrajani commented Feb 29, 2024

shyamjvs commented Feb 29, 2024

mengqiy commented Feb 29, 2024

dims commented Feb 29, 2024

likakuli commented Mar 4, 2024

chaochn47 commented Mar 4, 2024 •

edited

Loading

serathius commented Mar 13, 2024 •

edited

Loading

serathius commented Jun 25, 2024

APIServer watchcache lost events #123072

APIServer watchcache lost events #123072

Comments

mengqiy commented Feb 1, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

mengqiy commented Feb 1, 2024

mengqiy commented Feb 1, 2024

likakuli commented Feb 1, 2024

wojtek-t commented Feb 1, 2024

wojtek-t commented Feb 1, 2024

mengqiy commented Feb 3, 2024

wojtek-t commented Feb 5, 2024

MadhavJivrajani commented Feb 5, 2024

serathius commented Feb 6, 2024

dims commented Feb 7, 2024

dims commented Feb 7, 2024

serathius commented Feb 7, 2024

mengqiy commented Feb 7, 2024

mengqiy commented Feb 7, 2024

mengqiy commented Feb 7, 2024 • edited Loading

seans3 commented Feb 9, 2024

dims commented Feb 22, 2024

mengqiy commented Feb 22, 2024

mengqiy commented Feb 22, 2024 • edited Loading

mengqiy commented Feb 28, 2024 • edited Loading

chaochn47 commented Feb 29, 2024

dims commented Feb 29, 2024

MadhavJivrajani commented Feb 29, 2024

shyamjvs commented Feb 29, 2024

mengqiy commented Feb 29, 2024

dims commented Feb 29, 2024

likakuli commented Mar 4, 2024

chaochn47 commented Mar 4, 2024 • edited Loading

serathius commented Mar 13, 2024 • edited Loading

serathius commented Jun 25, 2024

mengqiy commented Feb 1, 2024 •

edited

Loading

mengqiy commented Feb 7, 2024 •

edited

Loading

mengqiy commented Feb 22, 2024 •

edited

Loading

mengqiy commented Feb 28, 2024 •

edited

Loading

chaochn47 commented Mar 4, 2024 •

edited

Loading

serathius commented Mar 13, 2024 •

edited

Loading