Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-20272] Don't monitor response on offline agents #2911

Merged
merged 3 commits into from Jul 30, 2017

Conversation

abayer
Copy link
Member

@abayer abayer commented Jun 7, 2017

Description

See JENKINS-20272.

Details: Don't monitor response time on offline agents. We know they're offline. We don't need to check.

Changelog entries

Proposed changelog entries:

  • Entry 1: JENKINS-20272, Don't monitor response time on offline agents.

Submitter checklist

  • JIRA issue is well described
  • Link to JIRA ticket in description, if appropriate
  • Appropriate autotests or explanation to why this change has no tests
  • For new API and extension points: Link to the reference implementation in open-source (or example in Javadoc)

Desired reviewers

@reviewbybees @oleg-nenashev @daniel-beck

@ghost
Copy link

ghost commented Jun 7, 2017

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this approach. isOffline() returns true when the agent is temporaryOffline:

return temporarilyOffline || getChannel()==null;
. It means that the agent is online && just disabled due to whatever reason (should we finally change this confusing terminology?). In such case the channel stays active, and the agent may be reactivated at any time without preliminary checks.

I would argue that we need to run all monitorings against temporaryOffline agents. Otherwise they will be never marked as offline by the system monitoring, which will probably lead to failures once somebody enables them. Hence 🐛

@@ -49,6 +48,9 @@
public static final AbstractNodeMonitorDescriptor<Data> DESCRIPTOR = new AbstractAsyncNodeMonitorDescriptor<Data>() {
@Override
protected Callable<Data,IOException> createCallable(Computer c) {
if (c.isOffline()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just check that the channel != null and not closing or closed

Otherwise if the agent is taken offline due to the returned metric it cannot come back on-line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original JIRA suggests not wanting to do this for agents that are marked offline manually too - at one point, I had a check here to make sure the OfflineCause wasn't ResponseTimeMonitor.Data. Think that's reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking OfflineCause would be best

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original JIRA suggests not wanting to do this for agents that are marked offline manually too

I wrote that and I just looked at it again, and there's nothing about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I misinterpreted. =)

@abayer
Copy link
Member Author

abayer commented Jun 14, 2017

@oleg-nenashev So...any suggestions? I am not objecting to your objections. =) I just don't know if there's any way to actually address this ticket that can satisfy your objections - if that's the case, fine, then this is a won't fix, I just want to be sure.

@oleg-nenashev
Copy link
Member

@abayer I think the best way to solve the ticket is to add an optional flag in monitor settings (E.g. "Do not monitor temporary offline nodes"). It would retain the current (correct) behavior and give an option to opt-out. @daniel-beck WDYT?

@daniel-beck
Copy link
Member

daniel-beck commented Jun 17, 2017

optional flag in monitor settings

No. This option will make no sense to anyone. This is how the MS Word settings dialog (ca. 2003) became such a joke. We need to try to fight this impulse to make everything configurable.

A disconnected node doesn't need to be checked. That's what JENKINS-20272 discusses, and what should be uncontroversial. The motivation for going further with nodes marked offline is entirely unclear to me.

@stephenc
Copy link
Member

I agree with DB. We should just not monitor nodes if the connection is off-line.

@abayer
Copy link
Member Author

abayer commented Jun 19, 2017

Ok, I'll switch to just not monitoring disconnected agents.

@oleg-nenashev
Copy link
Member

@abayer I am not sure if it is 100% correct. It will set a monitoring error.

if (d ==null) {
                     // if we failed to monitor, put in the special value that indicates a failure
                     e.setValue(d=new Data(get(c),-1L));
}

@oleg-nenashev oleg-nenashev added the on-hold This pull request depends on another event/release, and it cannot be merged right now label Jun 23, 2017
Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinged @abayer . I would rather block the PR till he responds. Too late for 2.60.1 anyway, but it still can get into 2.60.2

@abayer
Copy link
Member Author

abayer commented Jun 29, 2017

@oleg-nenashev Not sure I understand your previous comment?

@oleg-nenashev
Copy link
Member

@abayer Well, I need to re-review it. The cache has been already invalidated

@abayer
Copy link
Member Author

abayer commented Jul 3, 2017

heh.

@oleg-nenashev
Copy link
Member

@abayer
OK, got to it. You return null, but null is not "fine". This line handles null as a monitoring failure:

.

So, the end result of the monitor will be "failed to monitor", not "no need to monitor"

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if I disagree with the current change, I admit it's better than the original behavior.

So I approve it in order to get it in .3 after testing in the community

@oleg-nenashev oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Jul 29, 2017
@oleg-nenashev
Copy link
Member

I will handle the fallout if it happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
4 participants