-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APIServer watchcache lost events #123072
Comments
/sig api-machinery |
/area apiserver |
/cc |
While I obviously can't exclude the bug in watchcache, we should try to validate it e2e to, so ensure that that it works fine on etcd side. |
We definitely need more info to debug it. I would also add that watchcache is propagate with reflector, so I think that if the issue exists, it's may potentially happen to any watch (not just watchcache...) |
This issue is highly correlated to the FWIW we found the symptom is very similar to #76624 which has been fixed in #76675 since 1.15. |
Hmm - it seems that this might be a problem around guarantees for PrevKey @mborsz - FYI |
@serathius this is also the reason for my question here: etcd-io/etcd#17352 (comment) |
Right, would like to revisit this in etcd-io/etcd#10681. |
cc @chaochn47 |
@mengqiy Do you see any asymmetry between apiservers reporting What etcd version are you using? |
etcd version is 3.5.10. |
We no longer have such information, because we have to restart the APIServers in the customer's cluster to mitigate the issue and the metrics were reset after restarting. |
@serathius Luckily we got this info from other source. It shows that Similarly for From my current understanding, |
/triage accepted |
We observed that Falco daemonset makes a lot of watch requests that don't have |
We have a way to repro in EKS. It may not be the minimum steps to repro.
falco-chart-values.yamljson_output: true
log_syslog: false
collectors:
containerd:
enabled: false
crio:
enabled: false
docker:
enabled: false
driver:
enabled: true
kind: module
loader:
enabled: true
initContainer:
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falco-driver-loader
tag: 0.35.1
pullPolicy: IfNotPresent
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falco
tag: 0.35.1
pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
- operator: Exists
falcoctl:
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falcoctl
tag: 0.5.1
pullPolicy: IfNotPresent
artifact:
install:
enabled: false
follow:
enabled: false
falco:
syscall_event_drops:
actions:
- log
- alert
rate: 1
max_burst: 999
metadata_download:
maxMb: 200
extra:
env:
- name: SKIP_DRIVER_LOADER
value: "yes" I confirmed the repro by looking for:
|
This issue cannot be repro in 1.26 cluster and can be repro in 1.28 cluster. |
FYI @mengqiy found out that 610b670#diff-14c4fb9f290753f50e0af4856d871ff41e2b520d747158cbff7bafc577fbb29eR490 is the trigger added since v1.27, cacher would delegate watch with resource version unset request to etcd. |
|
Thanks for the ping @dims Considering sending the watch the etcd is still the correct semantics, the solution proposed here is probably the right way forward: #123448 (comment) |
@wojtek-t making a great call-out on that PR - but under-estimating how notorious our client ecosystem can be :') |
@MadhavJivrajani I agree your change made watch to have the right semantic. But we just don't anticipate there can be so many clients making direct etcd watch. We don't have any test to catch this issue. |
seems there is another case which will result to event lost.
kube-apiserver will check if the PrevKV == nil for NonCreate request. This is ok util #111387. It use |
Thanks @likakuli for looking into this, this aligns with my understanding in some sort that reflector keeps retry the watch with an old resource version. This could explain a few minutes watch cache being stale but does not explain indefinite lost event symptoms. The problem is etcd is expected to send out a watch response with Or when client initializing a watch with expired resource version. And reflector would list and watch again to refresh the watch cache if the watchHandler returns an |
Update from etcd maintainers, we have confirmed that this is a real issue. If etcd watch stream is congested and watch goes outside of compaction window and etcd is unable to send a compacted response to watcher, it will incorrectly synchronize the watch with incomplete data. Thanks @mengqiy and @chaochn47 for reporting and proposing a fix etcd-io/etcd#17555 |
While the watch starvation issue has been resolved, a new problem with watch dropping events in certain edge cases has surfaced (see etcd-io/etcd#18089). |
What happened?
It appears that APIServer watchcache occasionally lost events. We can confirm that this is NOT a stale watchcache issue.
In some 1.27 clusters, we observed that both watchcache in 2 APIServer instances are pretty up-to-date (object created within 60s can be found from both cache). However, we believe some delete events were lost in the APIServer watchcache. In the bad apiserver, a few objects that deleted more than 24 hours ago still shows up in one of the APIServer cache. It's possible that other types of events (e.g. update) may also get lost, but they are not as noticeable as delete event since it can recover from the 2nd update event even if the first update event is lost.
This issue impacts k8s clients that use an informer cache. Once the informer get the same events from the bad APIServer, it won't recover until it gets restarted. Replacing the bad APIServer with a good one won't help the informer to discover the missing events.
We have observed at least 6 clusters run into this issue in EKS. 5 of them started to have this issue shortly after control plane upgrade but 1 cluster started to have this issue more than 1 hour before the control plane upgrade kicked in.
The clusters were running OK on 1.26, the issue started to show up when the clusters were upgraded to 1.27.
We saw
apiserver_watch_cache_events_received_total{resource="pods"}
diverged between the 2 apiserver instances. during the incident while the delta between the 2 apiserver instances are expected to be the same.We run the following command
kubectl get --raw "/api/v1/namespaces/my-ns/pods/my-pod?resourceVersion=0"
on 2 different APIServers. One returns an object and the other returns NotFound.EDIT: add some additional data points.
We did see the etcd memory kept increasing during the incident.
We believe the components that triggered this is Falco v0.35.1. It is a daemonset and it made a lot of watch requests w/o resourceVersion. All 6 clusters have a couple hundreds of nodes when the incident started.
In my repro cluster, I saw one etcd has much higher
etcd_debugging_mvcc_pending_events_total
(over 1 million) than the other etcd instances (< 20k).What did you expect to happen?
APIServer watch cache not to lose event.
How can we reproduce it (as minimally and precisely as possible)?
EDIT:
We have a way to repro in EKS. It may not be the minimum steps to repro.
helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml
Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.falco-chart-values.yaml
Anything else we need to know?
No response
Kubernetes version
1.27
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: