[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

Dohbedoh · 2022-07-18T07:27:34Z

… websocket connections

Proposing a retry mechanism to mitigate sporadic issues as encountered in JENKINS-67664 and JENKINS-59652. as the mechanism of the decorator can be fragile depending on environments. Also amends #1159.

If the websocket connection is unsuccessful for any reason, the launch reattempt it after 3 seconds. After 3 failed attempts it gives up.

~~The retry wait time of 3 seconds is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionRetryWait. Default to 3 seconds.~~
~~The number of retry is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetries. Default to 3 times.~~
The number of retry is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetry. Default to 5 times.
The maximum backoff time between retries is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetryBackoff. Default to 30 seconds.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

… websocket connections

Vlatombe · 2022-07-18T07:30:00Z

If the failure is due to too many concurrent requests against the k8s control plane, it would be best to use exponential backoff instead.

Dohbedoh · 2022-07-18T12:26:05Z

Added exponential backoff using power of 2 with number of retries. With 5 max retries and 30 max backoff time in seconds by default. I need to test this in a test environment first though.

jglick · 2022-07-18T19:49:07Z

Rather than wasting time trying to fix the current impl of this decorator, which is known to be fundamentally not scalable due to use of the API server, why not actually fix it to stream commands and (non-durable-task) responses over the Remoting channel as has been long proposed?

Dohbedoh · 2022-07-19T11:16:55Z

@jglick it feels like a quick workaround that could have great impact and help users until we can design an alternative solution.

jglick · 2022-07-19T14:36:23Z

a quick workaround

OK, fine if so. It looked like a big change from the diff, even with whitespace ignored, but maybe that is mostly hunks of code being rearranged in a way that diff cannot grok?

Dohbedoh · 2022-07-20T11:22:55Z

It mostly is. I put the snippet that handle the websocket connection (opening) in a function openExecInContainer.

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java

Dohbedoh · 2022-07-21T04:19:32Z

I put back the all snippet where it was, also did some improvement. Hopefully that is simpler to review as is.

SeriousMatt · 2022-07-27T21:38:00Z

@jglick can this review be expedited? we've been running out of options for using the k8s plugin reliably due to the 5000 millisecond error in Jenkins

jglick · 2022-07-27T21:40:43Z

@SeriousMatt I am not a maintainer and am not reviewing.

There is a simple workaround: run single-container pods and do not use the container step.

SeriousMatt · 2022-07-27T21:54:28Z

thanks, we use the jnlp and a separate deploy container in a pod. combining into a single container is not unfortunately a simple process for our setup but we'll give it a try

brudnyhenry · 2022-07-29T06:53:59Z

Hi @jglick this issue is forcing us to migrate a couple of projects from Jenkins to Gitlab CI. Jenkins stability is unacceptable - jobs failing randomly due to this error, please prioritize it if you can. I hope you were joking about the single container approach- all major CI/CD tools support running jobs in multiple containers, which is really the benefit of having a dynamic agent in K8. With this approach way, we would soon recommend spinning up bare metal agents

Dohbedoh · 2022-07-29T08:19:51Z

The retry mechanism seems to have helped according to this comment. @jenkinsci/kubernetes-plugin-developers any chance to get further review and help make some progress here ?

jglick · 2022-07-29T14:50:32Z

I hope you were joking about the single container approach

As I said, it is a workaround, since the design of the container step has long been known to be flawed. A single-container agent pod has better performance, with the obvious drawback that you are forced to prepare a container image with all required tooling as well as the Jenkins agent and its JRE.

any chance to get further review

AFAIK the maintainer of this plugin is unavailable for a couple of weeks.

jglick · 2022-07-29T14:53:10Z

BTW @Dohbedoh I merged in master to make sure this gets an incremental deployment (assuming tests pass). Please refer to that URL in the incrementals repo (you will get a check here) when asking users to test prerelease versions, rather than file attachments or downloads from ci.jenkins.io.

amuniz

Looks good to me.

twasyl

Good job

mikecirioli

The retry mechanism looks solid to me, i am only moderately familiar with this plugin overall though.

jmhardison · 2022-08-04T14:11:16Z

Adding a note; we have deployed the incremental build into our environment. We normally have somewhere close to 100 k8s agent jobs per hour that cycle through our env during business hours.

Probably too early for me to say unequivocally it's helped, but we have changed our configuration back to a k8s idle timeout of 0 so agents are created and destroyed more often. Last we had this configuration it increased the observed rate of 5000ms errors.

So far, no occurrences of the error have been observed yet. We will report back in a few days with more info if this is still open.

Vlatombe · 2022-08-08T08:12:53Z

There is a simple workaround: run single-container pods and do not use the container step.

To elaborate on this: you CAN use multiple containers pods, such as these use cases where you want a service next to the agent pod (a database, a docker server, etc.), however, the container step has design issues as Jesse mentioned, so as long as you don't use it to execute any interactive command to a side container, you're safe from this particular issue.

forcing us to migrate a couple of projects from Jenkins to Gitlab CI

As far as I know, Gitlab doesn't have anything equivalent to the container step (even though it has support for services, which you can't interact with at shell level). So whatever you're doing is based on misinterpretation or misusage of either the Jenkins Kubernetes plugin or Gitlab.

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java

jglick · 2022-08-08T14:02:21Z

As far as I know, Gitlab doesn't have anything equivalent to the container step

@Dohbedoh noted https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1775 which sounds potentially analogous.

s7an-it · 2023-01-14T05:51:54Z

What I don't understand is how one month ago I was able to run multiple containers with shell executions and now no matter to what I migrate I simply get the 500 internal server websocket error very frequently if use anything outside of the inbound, retry on kuberentes doesn't work and I can't influence the timeout or anything.
Something went wrong somewhere, maybe there is a problem with EKS 1.22 specifically, but for certain on EKS.1.20 with old jnlp containers and pre 2.36x Jenkins was a powerhouse in comparison to now, simply because it's hyper unstable and lacks multi container with exec.

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec…

6aed756

… websocket connections

[JENKINS-67664] Use exponential backoff

d48b5fc

Dohbedoh added 3 commits July 19, 2022 18:15

[JENKINS-67664] sleep time should be in milliseconds

01bfefa

[JENKINS-67664] Adatp exceptions

931427e

[JENKINS-67644] cleanup import

3846a92

Dohbedoh commented Jul 20, 2022

View reviewed changes

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecDecorator.java Outdated Show resolved Hide resolved

Dohbedoh added 2 commits July 21, 2022 13:12

[JENKINS-67664] Wraps ExecWatch with final alive and finished attributes

fd6b1a5

[JENKINS-67664] Better Exception handling + no retry on 400

897bf11

Merge branch 'master' into JENKINS-67664

a31c13c

jglick requested a review from a team as a code owner July 29, 2022 14:52

amuniz approved these changes Jul 29, 2022

View reviewed changes

twasyl approved these changes Aug 1, 2022

View reviewed changes

mikecirioli approved these changes Aug 1, 2022

View reviewed changes

carlossg approved these changes Aug 4, 2022

View reviewed changes

Vlatombe reviewed Aug 8, 2022

View reviewed changes

Apply suggestions from code review

f4d22c9

Vlatombe approved these changes Aug 8, 2022

View reviewed changes

Vlatombe added the bug Bug Fixes label Aug 8, 2022

Vlatombe merged commit 200a57a into jenkinsci:master Aug 8, 2022

Dohbedoh deleted the JENKINS-67664 branch August 31, 2022 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

Dohbedoh commented Jul 18, 2022 •

edited

Loading

Vlatombe commented Jul 18, 2022

Dohbedoh commented Jul 18, 2022 •

edited

Loading

jglick commented Jul 18, 2022

Dohbedoh commented Jul 19, 2022

jglick commented Jul 19, 2022

Dohbedoh commented Jul 20, 2022

Dohbedoh commented Jul 21, 2022

SeriousMatt commented Jul 27, 2022

jglick commented Jul 27, 2022

SeriousMatt commented Jul 27, 2022

brudnyhenry commented Jul 29, 2022

Dohbedoh commented Jul 29, 2022

jglick commented Jul 29, 2022

jglick commented Jul 29, 2022

amuniz left a comment

twasyl left a comment

mikecirioli left a comment

jmhardison commented Aug 4, 2022

Vlatombe commented Aug 8, 2022 •

edited

Loading

jglick commented Aug 8, 2022

s7an-it commented Jan 14, 2023

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

Conversation

Dohbedoh commented Jul 18, 2022 • edited Loading

Vlatombe commented Jul 18, 2022

Dohbedoh commented Jul 18, 2022 • edited Loading

jglick commented Jul 18, 2022

Dohbedoh commented Jul 19, 2022

jglick commented Jul 19, 2022

Dohbedoh commented Jul 20, 2022

Dohbedoh commented Jul 21, 2022

SeriousMatt commented Jul 27, 2022

jglick commented Jul 27, 2022

SeriousMatt commented Jul 27, 2022

brudnyhenry commented Jul 29, 2022

Dohbedoh commented Jul 29, 2022

jglick commented Jul 29, 2022

jglick commented Jul 29, 2022

amuniz left a comment

Choose a reason for hiding this comment

twasyl left a comment

Choose a reason for hiding this comment

mikecirioli left a comment

Choose a reason for hiding this comment

jmhardison commented Aug 4, 2022

Vlatombe commented Aug 8, 2022 • edited Loading

jglick commented Aug 8, 2022

s7an-it commented Jan 14, 2023

Dohbedoh commented Jul 18, 2022 •

edited

Loading

Dohbedoh commented Jul 18, 2022 •

edited

Loading

Vlatombe commented Aug 8, 2022 •

edited

Loading