Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Global Time Out #81

Merged
merged 3 commits into from
Mar 21, 2022
Merged

Introduce Global Time Out #81

merged 3 commits into from
Mar 21, 2022

Conversation

sghill
Copy link
Contributor

@sghill sghill commented Aug 16, 2020

This change introduces a global build time out that re-uses the existing strategies and operations. The current functionality requires users to opt-in and update the configuration of each build.

Goal

I'd like to use a global timeout to ensure builds don't run longer than the set time. Occasionally builds hang, and in large deployments it's easy for this to go unnoticed. A hanging build that goes unnoticed occupies an executor indefinitely, reducing the capacity of the system and potentially allowing the queue for a particular job to build up.

Implementation

On each build:

  1. Find a configured global time out
  2. Schedule the configured time out operations on a dedicated ScheduledExecutorService with one thread
  3. Once complete, cancel the task

Implemented with a hudson.model.listeners.RunListener so it can be applied to every build. The existing implementation uses a hudson.tasks.BuildWrapper, which requires a checkbox on each build.

Screenshots

The configuration screen looks like this:
system-configuration

When a build fails due to a global timeout, it is noted in the build's log:
build-log

Extensive logging is available on the FINE level:
logs

@sghill
Copy link
Contributor Author

sghill commented Aug 16, 2020

@oleg-nenashev thanks for the encouragement of a global build timeout. I'd appreciate your thoughts on this implementation if you get a chance.

@oleg-nenashev oleg-nenashev self-requested a review August 17, 2020 07:02
@krisstern
Copy link
Member

Hi @sghill Thanks for this PR! However, could you please fix the sole failing test for the Windows build before I can do a thorough review of it? Please note also that some PR's have recently been merged so please be careful to rebase before pushing again if you have not already done so.

@sghill
Copy link
Contributor Author

sghill commented Feb 1, 2022

Hi @krisstern, thanks for the ping. I rebased the PR and the tests are all passing.

@krisstern krisstern self-assigned this Feb 1, 2022
@krisstern
Copy link
Member

Thanks @sghill for the rebase! Let me start the reviewing process soon.

@krisstern
Copy link
Member

HI @sghill My apologies for having to ask you to rebase one more time, as I have just merged a PR to switch Java support from version 8 to 11. Without this change I cannot check out then try out your new feature, as I was experiencing some setup issues previously. Many thanks for this!

@sghill
Copy link
Contributor Author

sghill commented Feb 18, 2022

Hi @krisstern, I've rebased and updated to Mockito 4 but I'm getting some odd errors from the windows build. Is it possible there is an issue with the Java install on this agent?

[2022-02-18T18:54:49.915Z] [ERROR] C:\Jenkins\workspace\ugins_build-timeout-plugin_PR-81\src\main\java\hudson\plugins\build_timeout\global\Lifecycle.java:31: Undefined reference: java.util.function.Supplier

Separately, I wanted to share that I'd be happy with other implementations as well. I tried to make this PR testable and fit well within the plugin by supporting existing strategies, but I only plan to use a fixed 24 hour timeout on our Jenkins controllers. If there is a simpler way to achieve that, I'm all for it.

@krisstern
Copy link
Member

Hi @sghill I think it may have to do with the signature currently set for the animal-sniffer-maven-plugin to be 17, as there are currently no available signature for Java 11:

<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>animal-sniffer-maven-plugin</artifactId>
<version>1.21</version>
<configuration>
<signature>
<groupId>org.codehaus.mojo.signature</groupId>
<artifactId>java17</artifactId>
<version>1.0</version>
</signature>
</configuration>
<executions>
<execution>
<goals>
<goal>check</goal>
</goals>
</execution>
</executions>
</plugin>

So it may have been introduced by me inadvertently while trying for a quick fix that worked previously. So I would recommend you do try and tweak the POM a bit according to mojohaus/animal-sniffer#62 (comment). I think the error we are encountering is similar to the one for Java 8 discussed on Stack Overflow at https://stackoverflow.com/q/45997103/9959070.

@sghill
Copy link
Contributor Author

sghill commented Feb 26, 2022

@krisstern The build is green.

My read of this comment is the animal-sniffer-plugin can be removed because the maven-compiler-plugin now covers the same featureset when the release flag is set (it already is):

Apart from that you can use the release configuration in maven-compiler-plugin with JDK9+ to have exactly what animal sniffer offers and that's the reason why there are no JDK9+ signatures.

Unfortunately removing the plugin block from the pom doesn't work, so it seems the plugin is requiring 11 but disallowing any new APIs after 8. It'd be nice to keep the minimum version at 8 if that's the case.

@krisstern
Copy link
Member

Yeah, I can understand. On top of the aforementioned comment there is also mojohaus/animal-sniffer#62 (comment), so maybe I could take another look after this PR is reviewed and processed to see what to do with animal-sniffer next. My hunch is sooner or later we will need to get rid of it one way or the other with some config.

@krisstern
Copy link
Member

krisstern commented Feb 28, 2022

Hi @sghill I am in the process of reviewing the code in this PR. So far so good, I cannot help but noticed the following default time limit of 3 minutes or 180 seconds for the Absolute and No Activity timeout strategies accordingly:

Screenshot 2022-02-27 at 12 03 48 PM

Screenshot 2022-02-27 at 12 04 20 PM

I think the timeout settings of the other strategies look okay. But would like to know if we could increase the default to something slightly longer like 5 minutes or 300 seconds for these two strategies only, respectively? Or did you have anything particular in mind when setting the default to 3 minutes effectively to both these strategies. Three minutes seem a bit short for me, but of course it depends on what we are trying to build as well. Say previously in your example you had been using 4 minutes for the Absolute strategy...

@krisstern
Copy link
Member

I have reviewed the code, everything looks okay

@krisstern
Copy link
Member

@oleg-nenashev If you have some time, please review this PR so I could process it promptly.

@sghill
Copy link
Contributor Author

sghill commented Mar 5, 2022

if we could increase the default to something slightly longer like 5 minutes or 300 seconds for these two strategies only, respectively? Or did you have anything particular in mind when setting the default to 3 minutes effectively to both these strategies.

Hi @krisstern, definitely didn't intend for a 3 minute default. I think that's coming from here? I imagine very few users would use the default, so perhaps it's best to clear it out if possible? I intend to set to 24 hours on our instances.

@krisstern
Copy link
Member

krisstern commented Mar 6, 2022

Hi @sghill Or alternatively we could overwrite the timeout duration for the strategy say for the Absolute Timeout Strategy to something else here... I noticed the timeout settings of the Deadline Timeout Strategy, the Elastic Timeout Strategy, and also the Likely Stuck Timeout Strategy seem more agreeable.

So maybe we could look into tweaking the default timeout settings somewhere like the following two places for the Absolute Timeout Strategy class file? I will need time to look into the impact of modifying the BuildTimeoutWrapper.MINIMUM_TIMEOUT_MILLISECONDS more thoroughly later. It may be needed to set the lower bound of the timeout limits still.

this.timeoutMinutes = Integer.toString(Math.max((int) (BuildTimeoutWrapper.MINIMUM_TIMEOUT_MILLISECONDS / MINUTES), timeoutMinutes));

return MINUTES * Math.max((int) (BuildTimeoutWrapper.MINIMUM_TIMEOUT_MILLISECONDS / MINUTES), Integer.parseInt(

@krisstern
Copy link
Member

Or we could leave things as they are... And let the user choose a timeout value manually different from the default one.

@krisstern
Copy link
Member

@sghill Some conflicts have been introduced because of some urgent PR that needed to be closed. Would you like to rebase your PR so I could have it merge as soon as possible? My apologies for the confusion.

On each build:
  1. Find a configured global time out
  2. Schedule the configured time out operations on a
     dedicated ScheduledExecutorService with one thread
  3. Once complete, cancel the task

Implemented with a hudson.model.listeners.RunListener so it can
be applied to every build. The existing implementation uses a
hudson.tasks.BuildWrapper, which requires a checkbox on each build.
Replace deprecated #initMocks with MockitoRule
@sghill
Copy link
Contributor Author

sghill commented Mar 18, 2022

Hi @krisstern. I have rebased, but noticed a couple issues I'd like your input on before merging.

  1. I'd like to disable the global timeout strategy by unchecking the box. Today if I uncheck and reload the page, it's checked again. Is there a recommended way to do this?
  2. when I mvn hpi:run, I'm seeing an error that looks like it may expect a ldap dependency?
WARNING	h.ExtensionFinder$GuiceFinder$FaultTolerantScope$1#error: Failed to instantiate Key[type=jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl, annotation=[none]]; skipping this component
java.lang.ClassNotFoundException: org.acegisecurity.userdetails.ldap.LdapUserDetails
	at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1402)
	at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1357)
	at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1112)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
Caused: java.lang.NoClassDefFoundError: org/acegisecurity/userdetails/ldap/LdapUserDetails
	at java.lang.Class.getDeclaredMethods0(Native Method)
	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
	at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
	at java.lang.Class.getMethod0(Class.java:3018)
	at java.lang.Class.getMethod(Class.java:1784)
	at hudson.model.Descriptor.<init>(Descriptor.java:304)
	at jenkins.security.plugins.ldap.LDAPGroupMembershipStrategyDescriptor.<init>(LDAPGroupMembershipStrategyDescriptor.java:32)
	at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl.<init>(FromUserRecordLDAPGroupMembershipStrategy.java:94)
	at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl$$FastClassByGuice$$491260477.GUICE$TRAMPOLINE(<generated>)
	at jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl$$FastClassByGuice$$491260477.apply(<generated>)
	at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:82)
	at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:114)
	at com.google.inject.internal.ConstructorInjector.access$000(ConstructorInjector.java:33)
	at com.google.inject.internal.ConstructorInjector$1.call(ConstructorInjector.java:98)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:109)
	at hudson.ExtensionFinder$GuiceFinder$SezpozModule.onProvision(ExtensionFinder.java:568)
	at com.google.inject.internal.ProvisionListenerStackCallback$Provision.provision(ProvisionListenerStackCallback.java:117)
	at com.google.inject.internal.ProvisionListenerStackCallback.provision(ProvisionListenerStackCallback.java:66)
	at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:93)
	at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:296)
	at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
Caused: com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInjectingConstructor]: NoClassDefFoundError: org/acegisecurity/userdetails/ldap/LdapUserDetails
  at FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl.<init>(FromUserRecordLDAPGroupMembershipStrategy.java:94)

Learn more:
  https://github.com/google/guice/wiki/ERROR_INJECTING_CONSTRUCTOR

1 error

======================
Full classname legend:
======================
FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl: "jenkins.security.plugins.ldap.FromUserRecordLDAPGroupMembershipStrategy$DescriptorImpl"
========================
End of classname legend:
========================

@krisstern
Copy link
Member

Hi @sghill Thanks for the prompt response!

Let me answer the checkbox question first. I think to disable it you can modify

<f:optionalBlock name="global" title="Enable Global Time Out" checked="${instance.enabled}">

I think it is better to get rid of the option checked="${instance.enabled}" so that by default the options is disabled, like the example they have for OptionalBlock at https://wiki.jenkins.io/display/JENKINS/Jelly+form+controls.

@krisstern
Copy link
Member

The build seems to have passed all CI/CD tests on GitHub though, so should be okay. Could be something with your local setup.

@krisstern
Copy link
Member

BTW, have you tried running mvn clean before you run mvn hpi:run, and did it make a difference? I will check out this branch and test locally later over the next few days just to make sure everything is okay.

@krisstern krisstern self-requested a review March 18, 2022 12:23
Unchecking the Enable Global Timeout box will now unset any configured
timeout. This also adds several INFO-level logs about what configuration
was loaded, updated, and cleared.

Additionally the constructor-injected Jenkins instance has been moved to
a setter. It's the same amount of code, but no longer requires
suppressing the unused warning on the default constructor.
@sghill
Copy link
Contributor Author

sghill commented Mar 19, 2022

Hi @krisstern,

I think it is better to get rid of the option checked="${instance.enabled}" so that by default the options is disabled

I'm seeing it disabled by default, are you seeing different behavior? When I remove the checked attribute it shows as unchecked on page reload, even if there is a global timeout set. The timeout is active too.

I think it should automatically show on page load as checked if configured, but when unchecked the configuration is cleared. I've updated the code to do this -- if the submitted form object is null, we now manually null out the class properties before saving.

have you tried running mvn clean before you run mvn hpi:run, and did it make a difference?

I tried mvn clean and the stacktrace still appeared. Manually removing the ldap plugin from work/plugins did solve it.

New Logs

I've added toString implementations on all the strategies for use with logging. I don't think this should be a problem, but wanted to call it out since it's new. Here are some examples of the logs it will show now:

On load:

INFO	h.p.b.g.GlobalTimeOutConfiguration#load: global timeout not set
INFO	h.p.b.g.GlobalTimeOutConfiguration#load: global timeout loaded as AbsoluteTimeOutStrategy[timeoutMinutes='1440'] with operations: AbortOperation

On update:

INFO	h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout updated to AbsoluteTimeOutStrategy[timeoutMinutes='300'] with operations: AbortOperation, WriteDescriptionOperation
INFO	h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout updated to LikelyStuckTimeOutStrategy[preferred='10 x estimated duration', fallback='24 hours'] with operations: AbortOperation

On clearing the 'Enable' checkbox:

INFO	h.p.b.g.GlobalTimeOutConfiguration#configure: global timeout has been cleared

@krisstern
Copy link
Member

Hi @sghill My apologies I think I may have misunderstood your question regarding the checkbox earlier. So I have tested your latest branch locally on my computer and everything works fine.

I have approved the PR and will merge it over the next few days. Thanks for your contribution!

@krisstern krisstern merged commit b7a1d57 into jenkinsci:master Mar 21, 2022
@sghill
Copy link
Contributor Author

sghill commented Mar 30, 2022

Thanks @krisstern! Is there a release coming up soon? Also happy to try out a candidate and report back.

@sghill
Copy link
Contributor Author

sghill commented May 11, 2022

Hi @krisstern, can I help create a new release? Looks like the last one was May 2020.

@krisstern
Copy link
Member

Hi @sghill. Yup, I agree it is time for me to cut a new release. Let me study up the docs and get back to you shortly. I had missed your prevous message hence the late reply.

@krisstern
Copy link
Member

Hi @sghill I have been trying to get the plugin code working locally on my new laptop over the last few days. However, I have been running into some problems getting the tests working again. Seems like once I have fixed up the POM file after updating to the latest weekly jenkins version, the hudson.Extension.optional() method from java.lang.* is gone due to some dependency issues.

As a result, I may need more time to get the code running again before I can cut a release.

BTW, would you like to help me to debug this issue? I could send you the new pom.xml file I have for you to see if there is anything that can be done with it to make the code run again.

@sghill
Copy link
Contributor Author

sghill commented May 16, 2022

Hi @krisstern, sure I can take a look. Can you push up another branch with the changes?

@krisstern
Copy link
Member

That's great @sghill! The branch with the new pom.xml is at https://github.com/krisstern/build-timeout-plugin/tree/preparing-for-new-release-2022

@krisstern
Copy link
Member

Hi @sghill New release draft version 1.21 is ready for testing: https://github.com/jenkinsci/build-timeout-plugin/releases/tag/untagged-aea95edaabab5f0616d0

@sghill
Copy link
Contributor Author

sghill commented May 26, 2022

Nice, thanks @krisstern! We have pulled in v1.21. This should be rolling out over the next couple weeks. I'll let you know if anything comes up.

Current plan is to configure like this in a custom initialization plugin:

import hudson.Extension;
import hudson.init.Initializer;
import hudson.plugins.build_timeout.BuildTimeOutOperation;
import hudson.plugins.build_timeout.global.GlobalTimeOutConfiguration;
import hudson.plugins.build_timeout.impl.AbsoluteTimeOutStrategy;
import hudson.plugins.build_timeout.operations.AbortOperation;
import hudson.plugins.build_timeout.operations.WriteDescriptionOperation;
import lombok.extern.slf4j.Slf4j;

import javax.inject.Inject;
import java.time.Duration;
import java.util.LinkedList;
import java.util.List;

import static hudson.init.InitMilestone.SYSTEM_CONFIG_ADAPTED;

@Slf4j
@Extension
public class GlobalBuildTimeoutConfigInit {
    private static final Duration GLOBAL_TIMEOUT = Duration.ofHours(30);
    private GlobalTimeOutConfiguration configuration;

    @Inject
    void setConfiguration(GlobalTimeOutConfiguration configuration) {
        this.configuration = configuration;
    }

    @Initializer(after = SYSTEM_CONFIG_ADAPTED)
    public void init() {
        long hours = GLOBAL_TIMEOUT.toHours();
        String description = String.format("Global timeout of %d hours elapsed", hours);
        configuration.setStrategy(new AbsoluteTimeOutStrategy(String.valueOf(GLOBAL_TIMEOUT.toMinutes())));
        List<BuildTimeOutOperation> operations = new LinkedList<>();
        operations.add(new AbortOperation());
        operations.add(new WriteDescriptionOperation(description));
        configuration.setOperations(operations);
        configuration.save();
        log.info("Builds will abort after {} hours and set description to '{}'", hours, description);
    }
}

@sghill
Copy link
Contributor Author

sghill commented Jun 14, 2022

Following up on this -- plugin has been rolled out and is working. We've timed out 3 runaway builds so far 🎉

Thanks so much @krisstern!

@krisstern
Copy link
Member

You are welcome @sghill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants