Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APIServer watchcache lost events #123072

Open
mengqiy opened this issue Feb 1, 2024 · 30 comments
Open

APIServer watchcache lost events #123072

mengqiy opened this issue Feb 1, 2024 · 30 comments
Labels
area/apiserver kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mengqiy
Copy link
Member

mengqiy commented Feb 1, 2024

What happened?

It appears that APIServer watchcache occasionally lost events. We can confirm that this is NOT a stale watchcache issue.

In some 1.27 clusters, we observed that both watchcache in 2 APIServer instances are pretty up-to-date (object created within 60s can be found from both cache). However, we believe some delete events were lost in the APIServer watchcache. In the bad apiserver, a few objects that deleted more than 24 hours ago still shows up in one of the APIServer cache. It's possible that other types of events (e.g. update) may also get lost, but they are not as noticeable as delete event since it can recover from the 2nd update event even if the first update event is lost.

This issue impacts k8s clients that use an informer cache. Once the informer get the same events from the bad APIServer, it won't recover until it gets restarted. Replacing the bad APIServer with a good one won't help the informer to discover the missing events.

We have observed at least 6 clusters run into this issue in EKS. 5 of them started to have this issue shortly after control plane upgrade but 1 cluster started to have this issue more than 1 hour before the control plane upgrade kicked in.

The clusters were running OK on 1.26, the issue started to show up when the clusters were upgraded to 1.27.

We saw apiserver_watch_cache_events_received_total{resource="pods"} diverged between the 2 apiserver instances. during the incident while the delta between the 2 apiserver instances are expected to be the same.

We run the following command kubectl get --raw "/api/v1/namespaces/my-ns/pods/my-pod?resourceVersion=0" on 2 different APIServers. One returns an object and the other returns NotFound.

EDIT: add some additional data points.
We did see the etcd memory kept increasing during the incident.

We believe the components that triggered this is Falco v0.35.1. It is a daemonset and it made a lot of watch requests w/o resourceVersion. All 6 clusters have a couple hundreds of nodes when the incident started.

In my repro cluster, I saw one etcd has much higher etcd_debugging_mvcc_pending_events_total(over 1 million) than the other etcd instances (< 20k).

What did you expect to happen?

APIServer watch cache not to lose event.

How can we reproduce it (as minimally and precisely as possible)?

EDIT:
We have a way to repro in EKS. It may not be the minimum steps to repro.

  • Create a 1.27 cluster with 800 worker nodes
  • Maintain high pod churn in the cluster. I have 1000 per minute
  • Deploy falco using helm: helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.
falco-chart-values.yaml
json_output: true
log_syslog: false
collectors:
  containerd:
    enabled: false
  crio:
    enabled: false
  docker:
    enabled: false
driver:
  enabled: true
  kind: module
  loader:
    enabled: true
    initContainer:
      image:
        registry: your-account.dkr.ecr.us-west-2.amazonaws.com
        repository: david-falco-driver-loader
        tag: 0.35.1
        pullPolicy: IfNotPresent
image:
  registry: your-account.dkr.ecr.us-west-2.amazonaws.com
  repository: david-falco
  tag: 0.35.1
  pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
  - operator: Exists
falcoctl:
  image:
    registry: your-account.dkr.ecr.us-west-2.amazonaws.com
    repository: david-falcoctl
    tag: 0.5.1
    pullPolicy: IfNotPresent
  artifact:
    install:
      enabled: false
    follow:
      enabled: false
falco:
  syscall_event_drops:
    actions:
    - log
    - alert
    rate: 1
    max_burst: 999
  metadata_download:
    maxMb: 200
extra:
  env:
  - name: SKIP_DRIVER_LOADER
    value: "yes"

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

AWS EKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mengqiy mengqiy added the kind/bug Categorizes issue or PR as related to a bug. label Feb 1, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 1, 2024
@mengqiy
Copy link
Member Author

mengqiy commented Feb 1, 2024

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2024
@mengqiy
Copy link
Member Author

mengqiy commented Feb 1, 2024

/area apiserver

@likakuli
Copy link
Contributor

likakuli commented Feb 1, 2024

/cc

@wojtek-t
Copy link
Member

wojtek-t commented Feb 1, 2024

While I obviously can't exclude the bug in watchcache, we should try to validate it e2e to, so ensure that that it works fine on etcd side.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 1, 2024

We definitely need more info to debug it.

I would also add that watchcache is propagate with reflector, so I think that if the issue exists, it's may potentially happen to any watch (not just watchcache...)

@mengqiy
Copy link
Member Author

mengqiy commented Feb 3, 2024

This issue is highly correlated to the PrevKv=nil error. A sample message:
E0129 20:22:12.231725 11 watcher.go:253] watch chan error: etcd event received with PrevKv=nil (key="/registry/pods/foo/bar", modRevision=8143031328, type=PUT)

FWIW we found the symptom is very similar to #76624 which has been fixed in #76675 since 1.15.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 5, 2024

Hmm - it seems that this might be a problem around guarantees for PrevKey
@serathius can provide more details here

@mborsz - FYI

@MadhavJivrajani
Copy link
Contributor

@serathius this is also the reason for my question here: etcd-io/etcd#17352 (comment)

@serathius
Copy link
Contributor

Right, would like to revisit this in etcd-io/etcd#10681.

@dims
Copy link
Member

dims commented Feb 7, 2024

cc @chaochn47

@dims
Copy link
Member

dims commented Feb 7, 2024

waded through old issues ... found #121285 and #100112

@serathius
Copy link
Contributor

@mengqiy Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

What etcd version are you using?

@mengqiy
Copy link
Member Author

mengqiy commented Feb 7, 2024

etcd version is 3.5.10.

@mengqiy
Copy link
Member Author

mengqiy commented Feb 7, 2024

Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

We no longer have such information, because we have to restart the APIServers in the customer's cluster to mitigate the issue and the metrics were reset after restarting.

@mengqiy
Copy link
Member Author

mengqiy commented Feb 7, 2024

Do you see any asymmetry between apiservers reporting apiserver_watch_cache_events_dispatched_total metric?

We no longer have such information, because we have to restart the APIServers in the customer's cluster to mitigate the issue and the metrics were reset after restarting.

@serathius Luckily we got this info from other source. It shows that apiserver_watch_cache_events_dispatched_total{resource="pods"} started to diverge after upgrading to 1.27 and the 2 apiserver had roughly 13 million and 16 million pods events dispatched after running for about 22 hours before restarting.

Similarly for apiserver_watch_cache_events_received_total{resource="pods"} metric. They diverge to about 6.5 million and 8 million for 2 apiservers.

From my current understanding, apiserver_watch_cache_events_dispatched_total should be the almost same if the 2 apiservers start at almost the same time. Similarly for apiserver_watch_cache_events_received_total

@seans3
Copy link
Contributor

seans3 commented Feb 9, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2024
@dims
Copy link
Member

dims commented Feb 22, 2024

@mengqiy
Copy link
Member Author

mengqiy commented Feb 22, 2024

We observed that Falco daemonset makes a lot of watch requests that don't have resourceVersion.

@mengqiy
Copy link
Member Author

mengqiy commented Feb 22, 2024

We have a way to repro in EKS. It may not be the minimum steps to repro.

  • Create a 1.27 cluster with 800 worker nodes
  • Maintain high pod churn in the cluster. I have 1000 per minute
  • Deploy falco using helm: helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.
  • Scaling the worker nodes from 800 to 1000
falco-chart-values.yaml
json_output: true
log_syslog: false
collectors:
  containerd:
    enabled: false
  crio:
    enabled: false
  docker:
    enabled: false
driver:
  enabled: true
  kind: module
  loader:
    enabled: true
    initContainer:
      image:
        registry: your-account.dkr.ecr.us-west-2.amazonaws.com
        repository: david-falco-driver-loader
        tag: 0.35.1
        pullPolicy: IfNotPresent
image:
  registry: your-account.dkr.ecr.us-west-2.amazonaws.com
  repository: david-falco
  tag: 0.35.1
  pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
  - operator: Exists
falcoctl:
  image:
    registry: your-account.dkr.ecr.us-west-2.amazonaws.com
    repository: david-falcoctl
    tag: 0.5.1
    pullPolicy: IfNotPresent
  artifact:
    install:
      enabled: false
    follow:
      enabled: false
falco:
  syscall_event_drops:
    actions:
    - log
    - alert
    rate: 1
    max_burst: 999
  metadata_download:
    maxMb: 200
extra:
  env:
  - name: SKIP_DRIVER_LOADER
    value: "yes"

I confirmed the repro by looking for:

  1. find the highest pod(s) that results in 404s.
  2. Get this pod via etcd quorum read to confirm it has been deleted.
  3. Get this pod from 2 apiserver cache (with ?resourceVersion=0). If one of apiserver cache still has the pod, we can confirm the delete pod event was lost.

@mengqiy
Copy link
Member Author

mengqiy commented Feb 28, 2024

@chaochn47 confirmed that the event was not sent when the delete pod event was not seen in the apiserver watch cache. One of its assumption may not stand. It's still pending confirmation.

This issue cannot be repro in 1.26 cluster and can be repro in 1.28 cluster.
It seems #123532 can mitigate the issue. I cannot repro it on 1.28 after applying the patch.

@chaochn47
Copy link

FYI @mengqiy found out that 610b670#diff-14c4fb9f290753f50e0af4856d871ff41e2b520d747158cbff7bafc577fbb29eR490 is the trigger added since v1.27, cacher would delegate watch with resource version unset request to etcd.

@dims
Copy link
Member

dims commented Feb 29, 2024

FYI @mengqiy found out that 610b670#diff- 14c4fb9f290753f50e0af4856d871ff41e2b520d747158cbff7bafc577fbb29eR490 is the trigger added since v1.27, cacher would delegate watch with resource version unset request to etcd.

cc @MadhavJivrajani

@MadhavJivrajani
Copy link
Contributor

Thanks for the ping @dims

Considering sending the watch the etcd is still the correct semantics, the solution proposed here is probably the right way forward: #123448 (comment)

@shyamjvs
Copy link
Member

This is technically much more expensive request, but given there should be very few of those and eventually we can optimize it, I'm fine with it for now.

@wojtek-t making a great call-out on that PR - but under-estimating how notorious our client ecosystem can be :')

@mengqiy
Copy link
Member Author

mengqiy commented Feb 29, 2024

Considering sending the watch the etcd is still the correct semantics, the solution proposed here is probably the right way forward: #123448 (comment)

@MadhavJivrajani I agree your change made watch to have the right semantic. But we just don't anticipate there can be so many clients making direct etcd watch. We don't have any test to catch this issue.

@dims
Copy link
Member

dims commented Feb 29, 2024

@mengqiy @shyamjvs so we will help review things better next time... right? :)

padme

@likakuli
Copy link
Contributor

likakuli commented Mar 4, 2024

seems there is another case which will result to event lost.

FWIW we found the symptom is very similar to #76624 which has been fixed in #76675 since 1.15.

kube-apiserver will check if the PrevKV == nil for NonCreate request. This is ok util #111387. It use MaxInternalErrorRetryDuration and set it 30s for Reflector used in Cacher which will result to event lost when there is a event with PrevKV=nil returned by etcd. An StatusError with reason StatusReasonInternalError will be returned to Reflector and there is a risk of lost event caused by the following logic https://github.com/kubernetes/kubernetes/pull/111387/files#diff-9ccdf713e010f73dbebd01e936cb0077fc63e4f5ab941d865ded42da219d84ecR343-R345. The Reflector will continue watch with old object exists in its store.

@chaochn47
Copy link

chaochn47 commented Mar 4, 2024

Thanks @likakuli for looking into this, this aligns with my understanding in some sort that reflector keeps retry the watch with an old resource version. This could explain a few minutes watch cache being stale but does not explain indefinite lost event symptoms.

The problem is etcd is expected to send out a watch response with StartRevision not equal to 0 (translates to ErrCompacted in client side) when a watch is not in sync with the progress of the MVCC store. code wise, it happens in syncWatchersLoop where the slow watchers cannot send out the events due to the watchStream Chan is clogged.

Or when client initializing a watch with expired resource version.

And reflector would list and watch again to refresh the watch cache if the watchHandler returns an The resourceVersion provided watch is too old error, so the issue should go away. I suspect that this special watch response is also clogged at etcd server side and would create an etcd upstream issue to consult the maintainers.

@serathius
Copy link
Contributor

serathius commented Mar 13, 2024

Update from etcd maintainers, we have confirmed that this is a real issue. If etcd watch stream is congested and watch goes outside of compaction window and etcd is unable to send a compacted response to watcher, it will incorrectly synchronize the watch with incomplete data.

Thanks @mengqiy and @chaochn47 for reporting and proposing a fix etcd-io/etcd#17555

@serathius
Copy link
Contributor

While the watch starvation issue has been resolved, a new problem with watch dropping events in certain edge cases has surfaced (see etcd-io/etcd#18089).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants