-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k6-operator manager freeze and don't recover after reaching timeout in initialize phase #128
Comments
@BarthV, Thanks for the detailed report 👍 I don't remember observing a scenario exactly like yours before but there are definitely issues with the state right now. IME, the most nasty thing is that the state is not always being updated despite the transition happening de-facto but IIRC, I saw a couple of cases of "frozen" state as well. So your case sounds very similar. In the case of initializer, there is indeed a timeout here: k6-operator/controllers/k6_initialize.go Line 64 in c63b127
but since the state hasn't changed, the function should repeat until it succeeds. The logic probably could be improved more and maybe timeout increased, but the fact that you don't see the repeat of Waiting... and other messages in your logs is surely weird.
One question from my side: what kind of image size do you have here + how long does it take to load from the scratch? I think both the freezing and lack of state transitions might be manifestations of the same problem. Currently, I suspect there is some kind of hard-to-detect race with the state update cycle. IOW, improving state transitions is a planned work for the operator 😂 Btw, thanks for Let's keep this open as a bug for now: once there are improvements in how state is handled, it makes sense to re-check this scenario as well. |
My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser. |
We too have noticed the k6-operator will fail to start the load test and only generates the starter as of late. We found restarting the operator pod fixes this, but it's starting to become a problem for our workloads. Logs are almost identical as above. |
It seems the last update completely broke our ability to do multiple load tests at once. Any way we can revert to an older release?
|
Hello @BarthV Are you running the setup in EKS with Fargate or GKE autopilot by any chance? |
no, I'm running self manager kube clusters backed by a french CSP VM infra. |
Hi @BarthV . I was trying run |
Hello, Trying to use EKS Fargate, I get the same behaviour:
|
Same issue, while using Karpenter the time to start the initialiser pod is higher and because of that the timeout happened first and thus no tests are submitted |
Hi ! It seems that I've hit an annoying issue there ...
How to reproduce :
myjob-initializer-12345
pod should be spawned immediately , starting downling the job worker container imageController log "init & before timeout":
After hitting timeout, k6 controller will retry reconciling k6 resource several times ... so the resource seems initialized but nothing is happening next (which is not expected behaviour):
And that's all ... no more logs... The job is stalling (the intializer log output seems correct tho)
If you try to remove the K6 resource, it works. If you try to addthe same resource after that, everything works as expected since the (big) image is now already loaded on the only worker node I'm working against.
Analysis :
It seems that the operator workflow is not really idempotent , so it's not really able to recover from any previous stage or step if the first reconciliation is failed for any reason.
In addition, custom resource stored state is not really helpful to understand what's going on. Maybe consider using official k8s
metav1.Condition
to store states and transitions directly inside custom resource live object.thanks 👍
The text was updated successfully, but these errors were encountered: