Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67664][JENKINS-59652] Add retry mechanism for container exec… #1212

Merged
merged 9 commits into from
Aug 8, 2022

Conversation

Dohbedoh
Copy link
Contributor

@Dohbedoh Dohbedoh commented Jul 18, 2022

… websocket connections

Proposing a retry mechanism to mitigate sporadic issues as encountered in JENKINS-67664 and JENKINS-59652. as the mechanism of the decorator can be fragile depending on environments. Also amends #1159.

If the websocket connection is unsuccessful for any reason, the launch reattempt it after 3 seconds. After 3 failed attempts it gives up.

  • The retry wait time of 3 seconds is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionRetryWait. Default to 3 seconds.
  • The number of retry is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetries. Default to 3 times.
  • The number of retry is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetry. Default to 5 times.
  • The maximum backoff time between retries is configurable with org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionMaxRetryBackoff. Default to 30 seconds.
  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@Vlatombe
Copy link
Member

If the failure is due to too many concurrent requests against the k8s control plane, it would be best to use exponential backoff instead.

@Dohbedoh
Copy link
Contributor Author

Dohbedoh commented Jul 18, 2022

Added exponential backoff using power of 2 with number of retries. With 5 max retries and 30 max backoff time in seconds by default. I need to test this in a test environment first though.

@jglick
Copy link
Member

jglick commented Jul 18, 2022

Rather than wasting time trying to fix the current impl of this decorator, which is known to be fundamentally not scalable due to use of the API server, why not actually fix it to stream commands and (non-durable-task) responses over the Remoting channel as has been long proposed?

@Dohbedoh
Copy link
Contributor Author

@jglick it feels like a quick workaround that could have great impact and help users until we can design an alternative solution.

@jglick
Copy link
Member

jglick commented Jul 19, 2022

a quick workaround

OK, fine if so. It looked like a big change from the diff, even with whitespace ignored, but maybe that is mostly hunks of code being rearranged in a way that diff cannot grok?

@Dohbedoh
Copy link
Contributor Author

It mostly is. I put the snippet that handle the websocket connection (opening) in a function openExecInContainer.

@Dohbedoh
Copy link
Contributor Author

I put back the all snippet where it was, also did some improvement. Hopefully that is simpler to review as is.

@SeriousMatt
Copy link

@jglick can this review be expedited? we've been running out of options for using the k8s plugin reliably due to the 5000 millisecond error in Jenkins

@jglick
Copy link
Member

jglick commented Jul 27, 2022

@SeriousMatt I am not a maintainer and am not reviewing.

There is a simple workaround: run single-container pods and do not use the container step.

@SeriousMatt
Copy link

thanks, we use the jnlp and a separate deploy container in a pod. combining into a single container is not unfortunately a simple process for our setup but we'll give it a try

@brudnyhenry
Copy link

Hi @jglick this issue is forcing us to migrate a couple of projects from Jenkins to Gitlab CI. Jenkins stability is unacceptable - jobs failing randomly due to this error, please prioritize it if you can. I hope you were joking about the single container approach- all major CI/CD tools support running jobs in multiple containers, which is really the benefit of having a dynamic agent in K8. With this approach way, we would soon recommend spinning up bare metal agents

@Dohbedoh
Copy link
Contributor Author

The retry mechanism seems to have helped according to this comment. @jenkinsci/kubernetes-plugin-developers any chance to get further review and help make some progress here ?

@jglick
Copy link
Member

jglick commented Jul 29, 2022

I hope you were joking about the single container approach

As I said, it is a workaround, since the design of the container step has long been known to be flawed. A single-container agent pod has better performance, with the obvious drawback that you are forced to prepare a container image with all required tooling as well as the Jenkins agent and its JRE.

any chance to get further review

AFAIK the maintainer of this plugin is unavailable for a couple of weeks.

@jglick jglick requested a review from a team as a code owner July 29, 2022 14:52
@jglick
Copy link
Member

jglick commented Jul 29, 2022

BTW @Dohbedoh I merged in master to make sure this gets an incremental deployment (assuming tests pass). Please refer to that URL in the incrementals repo (you will get a check here) when asking users to test prerelease versions, rather than file attachments or downloads from ci.jenkins.io.

Copy link
Member

@amuniz amuniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@twasyl twasyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job

Copy link
Contributor

@mikecirioli mikecirioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry mechanism looks solid to me, i am only moderately familiar with this plugin overall though.

@jmhardison
Copy link

Adding a note; we have deployed the incremental build into our environment. We normally have somewhere close to 100 k8s agent jobs per hour that cycle through our env during business hours.

Probably too early for me to say unequivocally it's helped, but we have changed our configuration back to a k8s idle timeout of 0 so agents are created and destroyed more often. Last we had this configuration it increased the observed rate of 5000ms errors.

So far, no occurrences of the error have been observed yet. We will report back in a few days with more info if this is still open.

@Vlatombe
Copy link
Member

Vlatombe commented Aug 8, 2022

There is a simple workaround: run single-container pods and do not use the container step.

To elaborate on this: you CAN use multiple containers pods, such as these use cases where you want a service next to the agent pod (a database, a docker server, etc.), however, the container step has design issues as Jesse mentioned, so as long as you don't use it to execute any interactive command to a side container, you're safe from this particular issue.

forcing us to migrate a couple of projects from Jenkins to Gitlab CI

As far as I know, Gitlab doesn't have anything equivalent to the container step (even though it has support for services, which you can't interact with at shell level). So whatever you're doing is based on misinterpretation or misusage of either the Jenkins Kubernetes plugin or Gitlab.

@Vlatombe Vlatombe added the bug Bug Fixes label Aug 8, 2022
@Vlatombe Vlatombe merged commit 200a57a into jenkinsci:master Aug 8, 2022
@jglick
Copy link
Member

jglick commented Aug 8, 2022

As far as I know, Gitlab doesn't have anything equivalent to the container step

@Dohbedoh noted https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1775 which sounds potentially analogous.

@Dohbedoh Dohbedoh deleted the JENKINS-67664 branch August 31, 2022 02:02
@s7an-it
Copy link

s7an-it commented Jan 14, 2023

What I don't understand is how one month ago I was able to run multiple containers with shell executions and now no matter to what I migrate I simply get the 500 internal server websocket error very frequently if use anything outside of the inbound, retry on kuberentes doesn't work and I can't influence the timeout or anything.
Something went wrong somewhere, maybe there is a problem with EKS 1.22 specifically, but for certain on EKS.1.20 with old jnlp containers and pre 2.36x Jenkins was a powerhouse in comparison to now, simply because it's hyper unstable and lacks multi container with exec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.