Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List Watch Failed because of "The resourceVersion for the provided list is too old" #6032

Closed
yunhjiang opened this issue May 6, 2022 · 6 comments · Fixed by #6045
Closed

Comments

@yunhjiang
Copy link

We are using typha on our deployment and the typha fail after runs some time and the logs is below. Then the typha will print such log about every 1 second. And the felix can't get new update anymore.

2022-05-03 00:19:06.899 [INFO][7] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=The resourceVersion for the provided list is too old.
2022-05-03 00:19:07.899 [INFO][7] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"

Expected Behavior

When the "resourceVersion too old" happens, the list watch should recover automatically and calico agent can still get update.

Current Behavior

The typha continuously print the error and the felix can't get update anymore.

Possible Solution

Bug fixing.

Steps to Reproduce (for bugs)

Start typha pointing to a busy k8s deployment which has a lot of object changes.

Context

We are deploying the felix with typha.

Your Environment

  • Calico version
    v3.22.0 calico
    v3.22 typha

  • Orchestrator version (e.g. kubernetes, mesos, rkt):
    k8s 1.8.

  • Operating System and version:

  • Link to your project (optional):

@fasaxc
Copy link
Member

fasaxc commented May 9, 2022

Thanks @yunhjiang you mentioned on Slack that you thought this commit was responsible: 8980383

Looks like the first list will work fine but, if we ever get into a state where the current revision is "stale", we'll get stuck.

@yunhjiang
Copy link
Author

I checked the code again, seems this stuck situation only if the current revision is stale happens in a very small window.

Checked the followed code in resyncAndCreateWatcher():

It will only initiate the full list when the currentWatchRevision is "0", thus when we begin the first round of the loop, it should be ok as we will pass "0" to the wc.client.List() call. Then it will use that version to the followed wc.client.Watch() call.

In normall case, this should be ok. However, if because of some reason (maybe burst update), this old version is triggered between the list and watch call, then we will dead loop.

So a simple solution is to reset the version to "0" after https://github.com/projectcalico/calico/blob/master/libcalico-go/lib/backend/watchersyncer/watchercache.go#L239 , setting performFullResync to be true.

This should be harmless. I will cook a patch for it.

@yunhjiang
Copy link
Author

yunhjiang commented May 9, 2022

And I think this is window is quite small, and it is the reason that we hit this issue after running the typha for some time.

Also, I'm a bit confused on the performFullResync=true. The performFullResync seems never set to false, thus resetting to true is meaningless, right?

@caseydavenport
Copy link
Member

Agree that code looks a little weird, not clearing the the resync flag.

I've put up a candidate fix here: #6045

It attempts to clear the resync flag on success, and it also clears the revision number when we get a "too old" error back from the server.

@ivanovpavel1983
Copy link

ivanovpavel1983 commented Dec 12, 2022

Hello. Have the same issue after upgrade from calico 3.16 to 3.23.3(migrated from manifest to tigera-operator):

image

And after that have info logs:

image

In k8s api server have errors:

image

Maybe I need to manually remove some resources?

@daemonadmin
Copy link

daemonadmin commented Feb 16, 2023

Hi all! Same issue:

Calico v 3.24.1, typha logs:

2023-02-15 07:46:21.197 [INFO][7] sync_server.go 756: Status update to send. client=XXXXX:37056 connID=0x1 newStatus=in-sync thread="kv-sender" 2023-02-15 08:03:21.718 [INFO][7] watchercache.go 125: Watch error received from Upstream ListRoot="/calico/ipam/v2/host/" error=too old resource version: 226184 (136011276) 2023-02-15 08:03:22.211 [INFO][7] watchercache.go 181: Full resync is required ListRoot="/calico/ipam/v2/host/" 2023-02-15 08:59:46.439 [INFO][7] watchercache.go 125: Watch error received from Upstream ListRoot="/calico/ipam/v2/host/" error=too old resource version: 233636 (136011276) 2023-02-15 08:59:46.929 [INFO][7] watchercache.go 181: Full resync is required ListRoot="/calico/ipam/v2/host/" 2023-02-15 09:58:50.411 [INFO][7] watchercache.go 125: Watch error received from Upstream ListRoot="/calico/ipam/v2/host/" error=too old resource version: 3302222 (136011276) 2023-02-15 09:58:50.905 [INFO][7] watchercache.go 181: Full resync is required ListRoot="/calico/ipam/v2/host/" 2023-02-15 10:33:25.134 [INFO][7] watchercache.go 125: Watch error received from Upstream ListRoot="/calico/ipam/v2/host/" error=too old resource version: 226184 (136011276)

@ivanovpavel1983 Did you follow these steps (Upgrade from Calico versions prior to v3.23.0) ?
https://artifacthub.io/packages/helm/projectcalico/tigera-operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants