-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List Watch Failed because of "The resourceVersion for the provided list is too old" #6032
Comments
Thanks @yunhjiang you mentioned on Slack that you thought this commit was responsible: 8980383 Looks like the first list will work fine but, if we ever get into a state where the current revision is "stale", we'll get stuck. |
I checked the code again, seems this stuck situation only if the current revision is stale happens in a very small window. Checked the followed code in resyncAndCreateWatcher(): It will only initiate the full list when the currentWatchRevision is "0", thus when we begin the first round of the loop, it should be ok as we will pass "0" to the wc.client.List() call. Then it will use that version to the followed wc.client.Watch() call. In normall case, this should be ok. However, if because of some reason (maybe burst update), this old version is triggered between the list and watch call, then we will dead loop. So a simple solution is to reset the version to "0" after https://github.com/projectcalico/calico/blob/master/libcalico-go/lib/backend/watchersyncer/watchercache.go#L239 , setting performFullResync to be true. This should be harmless. I will cook a patch for it. |
And I think this is window is quite small, and it is the reason that we hit this issue after running the typha for some time. Also, I'm a bit confused on the performFullResync=true. The performFullResync seems never set to false, thus resetting to true is meaningless, right? |
Agree that code looks a little weird, not clearing the the resync flag. I've put up a candidate fix here: #6045 It attempts to clear the resync flag on success, and it also clears the revision number when we get a "too old" error back from the server. |
Hi all! Same issue: Calico v 3.24.1, typha logs:
@ivanovpavel1983 Did you follow these steps (Upgrade from Calico versions prior to v3.23.0) ? |
We are using typha on our deployment and the typha fail after runs some time and the logs is below. Then the typha will print such log about every 1 second. And the felix can't get new update anymore.
2022-05-03 00:19:06.899 [INFO][7] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=The resourceVersion for the provided list is too old.
2022-05-03 00:19:07.899 [INFO][7] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"
Expected Behavior
When the "resourceVersion too old" happens, the list watch should recover automatically and calico agent can still get update.
Current Behavior
The typha continuously print the error and the felix can't get update anymore.
Possible Solution
Bug fixing.
Steps to Reproduce (for bugs)
Start typha pointing to a busy k8s deployment which has a lot of object changes.
Context
We are deploying the felix with typha.
Your Environment
Calico version
v3.22.0 calico
v3.22 typha
Orchestrator version (e.g. kubernetes, mesos, rkt):
k8s 1.8.
Operating System and version:
Link to your project (optional):
The text was updated successfully, but these errors were encountered: