-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212
Conversation
… websocket connections
If the failure is due to too many concurrent requests against the k8s control plane, it would be best to use exponential backoff instead. |
Added exponential backoff using power of 2 with number of retries. With 5 max retries and 30 max backoff time in seconds by default. I need to test this in a test environment first though. |
Rather than wasting time trying to fix the current impl of this decorator, which is known to be fundamentally not scalable due to use of the API server, why not actually fix it to stream commands and (non-durable-task) responses over the Remoting channel as has been long proposed? |
@jglick it feels like a quick workaround that could have great impact and help users until we can design an alternative solution. |
OK, fine if so. It looked like a big change from the diff, even with whitespace ignored, but maybe that is mostly hunks of code being rearranged in a way that diff cannot grok? |
It mostly is. I put the snippet that handle the websocket connection (opening) in a function |
src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java
Outdated
Show resolved
Hide resolved
I put back the all snippet where it was, also did some improvement. Hopefully that is simpler to review as is. |
@jglick can this review be expedited? we've been running out of options for using the k8s plugin reliably due to the 5000 millisecond error in Jenkins |
@SeriousMatt I am not a maintainer and am not reviewing. There is a simple workaround: run single-container pods and do not use the |
thanks, we use the jnlp and a separate deploy container in a pod. combining into a single container is not unfortunately a simple process for our setup but we'll give it a try |
Hi @jglick this issue is forcing us to migrate a couple of projects from Jenkins to Gitlab CI. Jenkins stability is unacceptable - jobs failing randomly due to this error, please prioritize it if you can. I hope you were joking about the single container approach- all major CI/CD tools support running jobs in multiple containers, which is really the benefit of having a dynamic agent in K8. With this approach way, we would soon recommend spinning up bare metal agents |
The retry mechanism seems to have helped according to this comment. @jenkinsci/kubernetes-plugin-developers any chance to get further review and help make some progress here ? |
As I said, it is a workaround, since the design of the
AFAIK the maintainer of this plugin is unavailable for a couple of weeks. |
BTW @Dohbedoh I merged in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry mechanism looks solid to me, i am only moderately familiar with this plugin overall though.
Adding a note; we have deployed the incremental build into our environment. We normally have somewhere close to 100 k8s agent jobs per hour that cycle through our env during business hours. Probably too early for me to say unequivocally it's helped, but we have changed our configuration back to a k8s idle timeout of 0 so agents are created and destroyed more often. Last we had this configuration it increased the observed rate of So far, no occurrences of the error have been observed yet. We will report back in a few days with more info if this is still open. |
To elaborate on this: you CAN use multiple containers pods, such as these use cases where you want a service next to the agent pod (a database, a docker server, etc.), however, the
As far as I know, Gitlab doesn't have anything equivalent to the |
src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java
Outdated
Show resolved
Hide resolved
src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java
Outdated
Show resolved
Hide resolved
@Dohbedoh noted https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1775 which sounds potentially analogous. |
What I don't understand is how one month ago I was able to run multiple containers with shell executions and now no matter to what I migrate I simply get the 500 internal server websocket error very frequently if use anything outside of the inbound, retry on kuberentes doesn't work and I can't influence the timeout or anything. |
… websocket connections
Proposing a retry mechanism to mitigate sporadic issues as encountered in JENKINS-67664 and JENKINS-59652. as the mechanism of the decorator can be fragile depending on environments. Also amends #1159.
If the websocket connection is unsuccessful for any reason, the launch reattempt it after
3
seconds. After3
failed attempts it gives up.The retry wait time of 3 seconds is configurable withorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionRetryWait
. Default to3
seconds.The number of retry is configurable withorg.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetries
. Default to3
times.org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetry
. Default to5
times.org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetryBackoff
. Default to30
seconds.