k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

BarthV · 2022-07-09T08:54:14Z

Hi ! It seems that I've hit an annoying issue there ...

I'm building k6 operator from master ( commit c63b127 )

How to reproduce :

create a simple k6 custom resource for a test job (anything , any duration ... it doesn't have any impact)
apply it on a kubernetes cluster running k6 operator
myjob-initializer-12345 pod should be spawned immediately , starting downling the job worker container image
force image pulling to be very long (huge image , slow network, slow registry, ...) or force controller timeout threshold to be very short

Controller log "init & before timeout":

2022-07-09T08:30:53.878Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z	INFO	controllers.K6	Initialize test	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:54.372Z	INFO	controllers.K6	Initializer job is ready to start with image `private.registry.foobar.local/public/load-tests-bundle` and command `[sh -c k6 archive /k6/job.js -O ./job.js.archived.tar --stage 60s:2 --stage 60s:4 --stage 120s:4 --stage 30s:0 && k6 inspect --execution-requirements ./job.js.archived.tar]`	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:56.182Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:01.184Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:06.185Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/k6-test-
[ ... ]
2022-07-09T08:31:56.194Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:56.276Z	ERROR	controllers.K6	Failed to initialize the script	{"k6": "k6-operator-system/myjob", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
	/workspace/controllers/k6_initialize.go:152
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
	/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-07-09T08:31:56.376Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90

After hitting timeout, k6 controller will retry reconciling k6 resource several times ... so the resource seems initialized but nothing is happening next (which is not expected behaviour):

2022-07-09T08:31:57.382Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:57.382Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:47.298Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:47.299Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:49.295Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:49.371Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:51.275Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:51.275Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}

And that's all ... no more logs... The job is stalling (the intializer log output seems correct tho)

> kubectl get pods -n k6-operator-system
NAME                                            READY   STATUS      RESTARTS   AGE
k6-operator-controller-manager-6889f4fc-jnrwc   1/1     Running     0          17m
myjob-initializer-zb89p                         0/1     Completed   0          15m

If you try to remove the K6 resource, it works. If you try to addthe same resource after that, everything works as expected since the (big) image is now already loaded on the only worker node I'm working against.

2022-07-09T08:47:33.271Z	INFO	controllers.K6	Request deleted. Nothing to reconcile.	{"k6": "k6-operator-system/myjob"}

Analysis :
It seems that the operator workflow is not really idempotent , so it's not really able to recover from any previous stage or step if the first reconciliation is failed for any reason.

In addition, custom resource stored state is not really helpful to understand what's going on. Maybe consider using official k8s metav1.Condition to store states and transitions directly inside custom resource live object.

thanks 👍

The text was updated successfully, but these errors were encountered:

yorugac · 2022-07-15T10:59:22Z

@BarthV, Thanks for the detailed report 👍 I don't remember observing a scenario exactly like yours before but there are definitely issues with the state right now. IME, the most nasty thing is that the state is not always being updated despite the transition happening de-facto but IIRC, I saw a couple of cases of "frozen" state as well. So your case sounds very similar. In the case of initializer, there is indeed a timeout here:

k6-operator/controllers/k6_initialize.go

Line 64 in c63b127

    
           err = wait.PollImmediate(time.Second*5, time.Second*60, func() (done bool, err error) {

but since the state hasn't changed, the function should repeat until it succeeds. The logic probably could be improved more and maybe timeout increased, but the fact that you don't see the repeat of Waiting... and other messages in your logs is surely weird.

One question from my side: what kind of image size do you have here + how long does it take to load from the scratch?

I think both the freezing and lack of state transitions might be manifestations of the same problem. Currently, I suspect there is some kind of hard-to-detect race with the state update cycle. IOW, improving state transitions is a planned work for the operator 😂 Btw, thanks for metav1.Condition - I'll look into it.

Let's keep this open as a bug for now: once there are improvements in how state is handled, it makes sense to re-check this scenario as well.

BarthV · 2022-07-15T15:20:17Z

My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser.
The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.

cdanmar · 2022-07-26T16:24:29Z

We too have noticed the k6-operator will fail to start the load test and only generates the starter as of late. We found restarting the operator pod fixes this, but it's starting to become a problem for our workloads. Logs are almost identical as above.

cdanmar · 2022-07-26T17:59:13Z

It seems the last update completely broke our ability to do multiple load tests at once. Any way we can revert to an older release?

manager 2022-07-26T17:57:45.214Z    ERROR    controllers.K6    Failed to start all jobs    {"k6": "k6-workloads/jkidd-prod", "error": "timed out waiting for the condition"}                                       │
│ manager github.com/go-logr/zapr.(*zapLogger).Error

prashanth-volvocars · 2022-09-09T04:30:48Z

Hello @BarthV

Are you running the setup in EKS with Fargate or GKE autopilot by any chance?

BarthV · 2022-09-09T05:11:35Z

Hello @BarthV

Are you running the setup in EKS with Fargate or GKE autopilot by any chance?

no, I'm running self manager kube clusters backed by a french CSP VM infra.

bobcode99 · 2022-10-23T09:09:53Z

My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser. The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.

Hi @BarthV .
Are you using custom your own k6 image, that inside have xk6-browser and chromium inside?
Then you use your custom-k6-image as the k6-operator's Runner?

I was trying run xk6-browser test script on k8s using this k6-operator way, but fail. Because the runner using k6 instead of xk6-browser to run the script.

yann-soubeyrand · 2022-12-05T18:43:55Z

Hello,

Trying to use EKS Fargate, I get the same behaviour:

2022-12-05T18:25:06.909Z	INFO	controllers.K6	Reconcile(); stage = 	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z	INFO	controllers.K6	Initialize test	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z	INFO	controllers.K6	Changing stage of K6 status to initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.920Z	INFO	controllers.K6	Initializer job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[sh -c k6 archive --log-output=none /test/load-test.js -O ./load-test.js.archived.tar  && k6 inspect --execution-requirements --log-output=none ./load-test.js.archived.tar]`	{"k6": "load-testing/load-test"}
2022-12-05T18:25:06.933Z	INFO	controllers.K6	No initializing pod found yet	{"k6": "load-testing/load-test"}
2022-12-05T18:25:11.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:16.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:21.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:26.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:31.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:36.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:41.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:46.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:51.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:25:56.933Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:01.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	INFO	controllers.K6	Waiting for initializing pod to finish	{"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z	ERROR	controllers.K6	Failed to initialize the script	{"k6": "load-testing/load-test", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
	/workspace/controllers/k6_initialize.go:153
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
	/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-12-05T18:26:06.934Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90
2022-12-05T18:26:07.935Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:07.935Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:11.059Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:11.059Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:35.114Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:35.114Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:37.103Z	INFO	controllers.K6	Reconcile(); stage = initialization	{"k6": "load-testing/load-test"}
2022-12-05T18:26:37.103Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}

rverma-dev · 2023-01-26T02:12:03Z

Same issue, while using Karpenter the time to start the initialiser pod is higher and because of that the timeout happened first and thus no tests are submitted

yorugac · 2023-04-12T11:35:22Z

I think this issue should be resolved since v0.0.9rc2: that's when the timeout via PollImmediate was removed.

@BarthV could you confirm if your case is resolved with the latest versions (v0.0.9rc2 or v0.0.9rc3 or just latest)?

yorugac added bug Something isn't working evaluation needed Type: Improvement FSM Connected to FSM and removed evaluation needed labels Jul 15, 2022

yorugac mentioned this issue Jul 22, 2022

Support for multiple simultaneous tests #126

Closed

yorugac mentioned this issue Aug 9, 2022

Make the operator idempotent: resilient to state changes #138

Closed

yorugac closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

BarthV commented Jul 9, 2022 •

edited

yorugac commented Jul 15, 2022

BarthV commented Jul 15, 2022

cdanmar commented Jul 26, 2022

cdanmar commented Jul 26, 2022

prashanth-volvocars commented Sep 9, 2022

BarthV commented Sep 9, 2022

bobcode99 commented Oct 23, 2022 •

edited

yann-soubeyrand commented Dec 5, 2022

rverma-dev commented Jan 26, 2023

yorugac commented Apr 12, 2023 •

edited

k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128

Comments

BarthV commented Jul 9, 2022 • edited

yorugac commented Jul 15, 2022

BarthV commented Jul 15, 2022

cdanmar commented Jul 26, 2022

cdanmar commented Jul 26, 2022

prashanth-volvocars commented Sep 9, 2022

BarthV commented Sep 9, 2022

bobcode99 commented Oct 23, 2022 • edited

yann-soubeyrand commented Dec 5, 2022

rverma-dev commented Jan 26, 2023

yorugac commented Apr 12, 2023 • edited

BarthV commented Jul 9, 2022 •

edited

bobcode99 commented Oct 23, 2022 •

edited

yorugac commented Apr 12, 2023 •

edited