Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

Closed
BarthV opened this issue Jul 9, 2022 · 10 comments
Labels
bug Something isn't working FSM Connected to FSM Type: Improvement

Comments

@BarthV
Copy link

BarthV commented Jul 9, 2022

Hi ! It seems that I've hit an annoying issue there ...

  • I'm building k6 operator from master ( commit c63b127 )

How to reproduce :

  • create a simple k6 custom resource for a test job (anything , any duration ... it doesn't have any impact)
  • apply it on a kubernetes cluster running k6 operator
  • myjob-initializer-12345 pod should be spawned immediately , starting downling the job worker container image
  • force image pulling to be very long (huge image , slow network, slow registry, ...) or force controller timeout threshold to be very short

Controller log "init & before timeout":

2022-07-09T08:30:53.878Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z	INFO	controllers.K6	Initialize test	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:54.372Z	INFO	controllers.K6	Initializer job is ready to start with image `private.registry.foobar.local/public/load-tests-bundle` and command `[sh -c k6 archive /k6/job.js -O ./job.js.archived.tar --stage 60s:2 --stage 60s:4 --stage 120s:4 --stage 30s:0 && k6 inspect --execution-requirements ./job.js.archived.tar]`	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:56.182Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:01.184Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:06.185Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/k6-test-
[ ... ]
2022-07-09T08:31:56.194Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:56.276Z	ERROR	controllers.K6	Failed to initialize the script	{"k6": "k6-operator-system/myjob", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
	/workspace/controllers/k6_initialize.go:152
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
	/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-07-09T08:31:56.376Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90

After hitting timeout, k6 controller will retry reconciling k6 resource several times ... so the resource seems initialized but nothing is happening next (which is not expected behaviour):

2022-07-09T08:31:57.382Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:57.382Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:47.298Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:47.299Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:49.295Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:49.371Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:51.275Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:51.275Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}

And that's all ... no more logs... The job is stalling (the intializer log output seems correct tho)

> kubectl get pods -n k6-operator-system
NAME                                            READY   STATUS      RESTARTS   AGE
k6-operator-controller-manager-6889f4fc-jnrwc   1/1     Running     0          17m
myjob-initializer-zb89p                         0/1     Completed   0          15m

If you try to remove the K6 resource, it works. If you try to addthe same resource after that, everything works as expected since the (big) image is now already loaded on the only worker node I'm working against.

2022-07-09T08:47:33.271Z	INFO	controllers.K6	Request deleted. Nothing to reconcile.	{"k6": "k6-operator-system/myjob"}

Analysis :
It seems that the operator workflow is not really idempotent , so it's not really able to recover from any previous stage or step if the first reconciliation is failed for any reason.

In addition, custom resource stored state is not really helpful to understand what's going on. Maybe consider using official k8s metav1.Condition to store states and transitions directly inside custom resource live object.

thanks 👍

@yorugac
Copy link
Collaborator

yorugac commented Jul 15, 2022

@BarthV, Thanks for the detailed report 👍 I don't remember observing a scenario exactly like yours before but there are definitely issues with the state right now. IME, the most nasty thing is that the state is not always being updated despite the transition happening de-facto but IIRC, I saw a couple of cases of "frozen" state as well. So your case sounds very similar. In the case of initializer, there is indeed a timeout here:

err = wait.PollImmediate(time.Second*5, time.Second*60, func() (done bool, err error) {

but since the state hasn't changed, the function should repeat until it succeeds. The logic probably could be improved more and maybe timeout increased, but the fact that you don't see the repeat of Waiting... and other messages in your logs is surely weird.

One question from my side: what kind of image size do you have here + how long does it take to load from the scratch?

I think both the freezing and lack of state transitions might be manifestations of the same problem. Currently, I suspect there is some kind of hard-to-detect race with the state update cycle. IOW, improving state transitions is a planned work for the operator 😂 Btw, thanks for metav1.Condition - I'll look into it.

Let's keep this open as a bug for now: once there are improvements in how state is handled, it makes sense to re-check this scenario as well.

@yorugac yorugac added bug Something isn't working evaluation needed Type: Improvement FSM Connected to FSM and removed evaluation needed labels Jul 15, 2022
@BarthV
Copy link
Author

BarthV commented Jul 15, 2022

My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser.
The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.

@cdanmar
Copy link

cdanmar commented Jul 26, 2022

We too have noticed the k6-operator will fail to start the load test and only generates the starter as of late. We found restarting the operator pod fixes this, but it's starting to become a problem for our workloads. Logs are almost identical as above.

@cdanmar
Copy link

cdanmar commented Jul 26, 2022

It seems the last update completely broke our ability to do multiple load tests at once. Any way we can revert to an older release?

manager 2022-07-26T17:57:45.214Z    ERROR    controllers.K6    Failed to start all jobs    {"k6": "k6-workloads/jkidd-prod", "error": "timed out waiting for the condition"}                                       │
│ manager github.com/go-logr/zapr.(*zapLogger).Error

@prashanth-volvocars
Copy link

Hello @BarthV

Are you running the setup in EKS with Fargate or GKE autopilot by any chance?

@BarthV
Copy link
Author

BarthV commented Sep 9, 2022

Hello @BarthV

Are you running the setup in EKS with Fargate or GKE autopilot by any chance?

no, I'm running self manager kube clusters backed by a french CSP VM infra.

@bobcode99
Copy link

bobcode99 commented Oct 23, 2022

My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser. The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.

Hi @BarthV .
Are you using custom your own k6 image, that inside have xk6-browser and chromium inside?
Then you use your custom-k6-image as the k6-operator's Runner?

I was trying run xk6-browser test script on k8s using this k6-operator way, but fail. Because the runner using k6 instead of xk6-browser to run the script.

@yann-soubeyrand
Copy link

Hello,

Trying to use EKS Fargate, I get the same behaviour:

2022-12-05T18:25:06.909Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z	INFO	controllers.K6	Initialize test	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.920Z	INFO	controllers.K6	Initializer job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[sh -c k6 archive --log-output=none /test/load-test.js -O ./load-test.js.archived.tar  && k6 inspect --execution-requirements --log-output=none ./load-test.js.archived.tar]`	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.933Z	INFO	controllers.K6	No initializing pod found yet	{"k6": "load-testing/load-test"}
2022-12-05T18:25:11.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:16.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:21.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:26.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:31.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:36.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:41.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:46.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:51.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:56.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:01.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	ERROR	controllers.K6	Failed to initialize the script	{"k6": "load-testing/load-test", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
	/workspace/controllers/k6_initialize.go:153
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
	/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-12-05T18:26:06.934Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-12-05T18:26:07.935Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:07.935Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:11.059Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:11.059Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:35.114Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:35.114Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:37.103Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:37.103Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}

@rverma-dev
Copy link

Same issue, while using Karpenter the time to start the initialiser pod is higher and because of that the timeout happened first and thus no tests are submitted

@yorugac
Copy link
Collaborator

yorugac commented Apr 12, 2023

I think this issue should be resolved since v0.0.9rc2: that's when the timeout via PollImmediate was removed.

@BarthV could you confirm if your case is resolved with the latest versions (v0.0.9rc2 or v0.0.9rc3 or just latest)?

@yorugac yorugac closed this as completed Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FSM Connected to FSM Type: Improvement
Projects
None yet
Development

No branches or pull requests

7 participants