Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache sync timeout after upgrading to version 0.11.0 #1786

Closed
erikgb opened this issue Jan 26, 2022 · 9 comments · Fixed by #1808
Closed

Cache sync timeout after upgrading to version 0.11.0 #1786

erikgb opened this issue Jan 26, 2022 · 9 comments · Fixed by #1808

Comments

@erikgb
Copy link
Contributor

erikgb commented Jan 26, 2022

After upgrading our controller K8s dependencies to more recent versions, controller-runtime 0.11.0, K8s 0.23.1, we see an increased frequency of this type of error:

{
  "level": "error",
  "ts": 1642682485.0051146,
  "logger": "controller.application",
  "msg": "Could not wait for Cache to sync",
  "reconciler group": "stas.mycompany.com",
  "reconciler kind": "Application",
  "error": "failed to wait for application caches to sync: timed out waiting for cache to be synced",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/runnable_group.go:218"
}

Any clues what might cause this, and how to fix it? The default cache timeout seems to be 2 minutes, and I think that should be (more than) enough time to sync the cache..... 🤔 We see the problem in 2 out of 3 controllers.

Is there any change that might have caused this error to emerge, and that we just have to increase the cache timeout? After reverting to the previous versions, controller-runtime 0.10.3, K8s 0.22.6, this error does not occur at all.

@FillZpp
Copy link
Contributor

FillZpp commented Jan 27, 2022

Do you have any conversion webhook for multi-version CRD? (actually it should be solved in 0.11, which used to cause cache sync timeout before)

@erikgb
Copy link
Contributor Author

erikgb commented Jan 27, 2022

Do you have any conversion webhook for multi-version CRD? (actually it should be solved in 0.11, which used to cause cache sync timeout before)

@FillZpp Yes (and no), in one of the two failing controllers we have multi-version CRD and conversion webhook on the owner side. But also failing for a controller which does not have multiple versions - at least not on the owner side. It might have multi-version CRD with conversion webhook on the ownee side, though. Could that also be relevant?

@alvaroaleman
Copy link
Member

So you do use the default two minutes timeout? Does it fail like that after two minutes or a different interval?

@erikgb
Copy link
Contributor Author

erikgb commented Feb 8, 2022

Thanks for challenging me to report this issue better! 😊 It seems like I didn't investigate our issue well enough, but now I think I got a bit further.

The issue is reproducible when enabling leader election, and occurs during shutdown on the non-leader replica. It seems like the non-leader attempts to start sync'ing it' cache just before it is shut down - which is aborted resulting in this annoying and misleading error message.

The full log for a non-leader replica running controller-runtime v0.10.3:

{"level":"info","ts":1644321376.4299257,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644321376.4308634,"msg":"Starting metrics server","path":"/metrics"}
{"level":"info","ts":1644321376.4305346,"logger":"setup","msg":"starting manager"}
I0208 11:56:16.431148       1 leaderelection.go:248] attempting to acquire leader lease identity-provider/b6e1c15e.REDACTED...

And the full log for a non-leader replica running controller-runtime v0.11.0:

{"level":"info","ts":1644313447.3173363,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644313447.3181336,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1644313447.4092572,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1644313447.3186686,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0208 09:44:13.219438       1 leaderelection.go:248] attempting to acquire leader lease identity-provider/b6e1c15e.REDACTED...
{"level":"info","ts":1644321368.1553714,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1644321368.1553307,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1644321368.1558619,"msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"info","ts":1644321368.1558402,"msg":"Stopping and waiting for webhooks"}
{"level":"error","ts":1644321368.1557145,"msg":"error received after stop sequence was engaged","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/internal.go:541"}
{"level":"info","ts":1644321368.1556916,"msg":"Stopping and waiting for caches"}
{"level":"error","ts":1644321368.1556363,"logger":"controller.serviceaccount","msg":"Could not wait for Cache to sync","reconciler group":"","reconciler kind":"ServiceAccount","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/runnable_group.go:218"}
{"level":"info","ts":1644321368.1555128,"logger":"controller.serviceaccount","msg":"Starting Controller","reconciler group":"","reconciler kind":"ServiceAccount"}
{"level":"info","ts":1644321368.1555092,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.Secret"}
{"level":"info","ts":1644321368.1555035,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.Certificate"}
{"level":"info","ts":1644321368.1554766,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.ServiceAccount"}

@FillZpp
Copy link
Contributor

FillZpp commented Feb 9, 2022

"error received after stop sequence was engaged","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced"

If the manager starts to terminate when the controllers and cache are starting, the cache syncing will get timeout because of the canceled context.

func (ks *Kind) WaitForSync(ctx context.Context) error {
select {
case err := <-ks.started:
return err
case <-ctx.Done():
ks.startCancel()
return errors.New("timed out waiting for cache to be synced")
}
}

So the question is what's the cause of your manager termination?

@erikgb
Copy link
Contributor Author

erikgb commented Feb 9, 2022

So the question is what's the cause of your manager termination?

@FillZpp That is quite easy to answer, since there is a new replica set available - as part of a normal rolling update of the controller deployment. As noted above this happens when the non-leader is about to shut down. For some reason it seems to initiate a startup sequence as part of the shutdown, and that is doomed.....

If we compare the behavior on c-r 0.10.3 with c-r 0.11.0, it seems like some mechanics have changed causing this behavior. Could it be that it thinks it has become the new leader? I would prefer if the new leader is elected from the newest replica set, but that is probably not possible to control. 😉 But at least I would expect this shutdown to be more gracefully than to always end with a (IMO misleading) error in the logs.

@FillZpp
Copy link
Contributor

FillZpp commented Feb 10, 2022

So it happened when you were updating your controller deployment.

When the old leader terminated, another old controller became leader and began to start its runnables. And then deployment delete it, so it terminated when cache was starting.

So... I don't think there is something wrong with the cache syncing canceled. It is the expected behavior IMHO. Maybe you are expecting the log message to be "cache starting has been canceled" instead of "timed out waiting"?

@erikgb
Copy link
Contributor Author

erikgb commented Feb 10, 2022

So... I don't think there is something wrong with the cache syncing canceled. It is the expected behavior IMHO. Maybe you are expecting the log message to be "cache starting has been canceled" instead of "timed out waiting"?

I agree that this is a probable cause, and to some extent the expected behavior. But I would not expect this to be logged as an error. Would it be possible to fix that? A cancellation from the control plane is not an error IMO.

@FillZpp
Copy link
Contributor

FillZpp commented Feb 10, 2022

But I would not expect this to be logged as an error.

@alvaroaleman What do you think? Does it make sense to check if the ctx is done which means it is canceled instead of timeout before line 202?

// use a context with timeout for launching sources and syncing caches.
sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout)
defer cancel()
// WaitForSync waits for a definitive timeout, and returns if there
// is an error or a timeout
if err := syncingSource.WaitForSync(sourceStartCtx); err != nil {
err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err)
c.Log.Error(err, "Could not wait for Cache to sync")
return err
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants