Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENKINS-27708, JENKINS-27871 Ensure that identification of blocked tasks is using the live state. #1645

Merged
merged 3 commits into from Apr 15, 2015

Conversation

6 participants
@stephenc
Copy link
Member

commented Apr 13, 2015

  • The creation of a snapshot itself should be relatively cheap given the expected rate of job execution. You probably would need 100's of jobs starting execution every iteration of maintain() before this could even start to become an issue and likely the calculation of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally since the snapshot itself only ever has at most one reference originating outside of the stack it should remain in the eden space and thus be cheap to GC.
  • JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of this issue. I am favouring this approach as it is simpler and provides less scope for error as any new helper methods can just rely on the snapshot being up to date whereas with the other two candidates if a new helper method is introduced there is the potential to miss adding support for the live view. The comment 225819 has the risk of introducing extra lock contention while the comment 225906 version forces every access to the helper methods to pass a second memory barrier

@reviewbybees

[FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification…
… of blocked tasks is using the live state.

- The creation of a snapshot itself should be relatively cheap given the expected rate of
  job execution. You probably would need 100's of jobs starting execution every iteration
  of maintain() before this could even start to become an issue and likely the calculation
  of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
  since the snapshot itself only ever has at most one reference originating outside of the stack
  it should remain in the eden space and thus be cheap to GC.

- JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
  this issue. I am favouring this approach as it is simpler and provides less scope for error as any
  new helper methods can just rely on the snapshot being up to date whereas with the other
  two candidates if a new helper method is introduced there is the potential to miss adding support
  for the live view. The comment 225819 has the risk of introducing extra lock contention while
  the comment 225906 version forces every access to the helper methods to pass a second memory
  barrier
@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 13, 2015

I suppose that the approach will cause an extra performance degradation on large-scale installations. BTW, it's better than nothing in the case of the referenced critical issues.

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 13, 2015

@stephenc
Do you have a test, which reproduces the issue in the core?

@jtnord

This comment has been minimized.

Copy link
Member

commented Apr 13, 2015

From what I understand this is fixing incorrect assumptions made by plugins.
Whilst this is a workaround it will not help in finding these miss-assumptions and could actually proliferate this in the long term. Not sure I can think of an alternative approach though that both lets plugin authors know they have made incorrect assumptions and lets users continue to use plugins with buggy implementations.
so a :+¾:

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2015

@oleg-nenashev so in the normal case we would expect maintain to be running for one of two reasons:

  • A job has left the waiting/blocked state. In the general case this will be one and only one job at a time, and as such this change will introduce two additional calls to updateSnapshot()
  • A node has come on-line. In the worst case this will be as many jobs as there are executors on the node. As such this change will introduce N+1 additional calls to updateSnapshot() where N is the number of executors.

The call to updateSnapshot() itself should be relatively cheap as it is just duplicating four lists via the copy constructor. The old instance will also be cheap for GC to clean up.

Ultimately, to completely remove locking from the Queue processing we will need to move to a state where the Snapshot is actually an immutable representation of the entire Queue state that gets updated via a compareAndSwap operation. In that model this behaviour would actually be the end result. It is expected to trade some performance replicating immutable data models and re-spinning to retry a change with the aim of providing threading correctness, but the benefits are true lock free behaviour and better throughput. On that basis I rejected the other two fixes I developed on the basis that they were both overly complex and did not move us away from using locks.

Queue#lock is currently a necessary evil, but one that I hope to be able to completely remove at some later point... once some of the other threading issues are corrected.

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2015

@oleg-nenashev I do not currently have a test case in core. It would seem that JENKINS-27871 would be a good place to start. Do you want to take a shot?

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2015

@jtnord it is debatable as to whether this is a bug in core (introduced in 1.536 but masked prior to 1.607 is you have more than one CPU core... and almost everyone these days runs on at least a CPU with support for hyper-threading) or a bug in the plugins.

Certainly the job of a plugin author is simpler if you can assume that jobs get scheduled one at a time and not en-mass. On the other hand life for the Queue would be easier if it can just make an atomic batch assignment of work against executors. In any case absence a clear call out in 1.536 that the unintentional behaviour of pre-1.536 where each job would start individually was unspecified it seems to make more sense to restore the pre-1.536 effective behaviour until such time as we can provide plugins with a simpler API where they can decide on a multitude of jobs being blocked in one go... a Set<Task> filterBlockedTasks(Set<Task>) style method on QueueDecisionHandler might be one initial stab at such an API so that the QDH can just return the set of unblocked tasks and we stop once either the set is empty or we've processed all the QDH's... Then Queue can atomically assign all the unblocked tasks to executors and be done... of course that is not necessarily the API we might want, only one sketch at a potential solution

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 13, 2015

Yes, JENKINS-27871 is a right issue to be reproduced. It happens in the core w/o any external plains, so we cannot consider the issue as a bug in plugins as @jtnord proposed

Manual tests confirm that the PR fixes JENKINS-27871. Autotest - WiP

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 14, 2015

@stephenc
You can take the unit-test here: stephenc#2

Merge pull request #2 from oleg-nenashev/JENKINS-27871
[JENKINS-27871] - Added a direct unit test for the issue
@jtnord

This comment has been minimized.

Copy link
Member

commented Apr 15, 2015

👍

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 15, 2015

👍 from me as well. The change makes both TCB and upstream blocking features operational

// See https://issues.jenkins-ci.org/browse/JENKINS-27708?focusedCommentId=225819&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-225819
// or https://issues.jenkins-ci.org/browse/JENKINS-27708?focusedCommentId=225906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-225906
// for alternative fixes of this issue.
updateSnapshot();

This comment has been minimized.

Copy link
@jglick

jglick Apr 15, 2015

Member

You are intentionally doing this here even though updateSnapshot is about to be called again in the finally block (before releasing the lock)? The call on line 1340 makes sense but this looks like a mistake unless I am missing something.

This comment has been minimized.

Copy link
@stephenc

stephenc Apr 15, 2015

Author Member

The finally block could probably be split, so that here is the end of the inner-most finally but it would have applied a whole host of formatting changes... and given I want to have this as lts-candidate in the event that 1.607-1.609 are selected for the next LTS I though this was least risk

stephenc added a commit that referenced this pull request Apr 15, 2015

Merge pull request #1645 from stephenc/jenkins-27708
JENKINS-27708, JENKINS-27871  Ensure that identification of blocked tasks is using the live state.

@stephenc stephenc merged commit 152d00a into jenkinsci:master Apr 15, 2015

1 check passed

Jenkins This pull request looks good
Details
projectB.setBlockBuildWhenUpstreamBuilding(true);

final FreeStyleProject projectC = r.createFreeStyleProject(prefix+"C");
projectC.getBuildersList().add(new SleepBuilder(10000));

This comment has been minimized.

Copy link
@jglick

jglick Apr 15, 2015

Member

This forces the test to take at least 10s, but it is still going to fail randomly on a heavily loaded test machine. Would be better to use a semaphore-based builder. (Not sure offhand if we have a standard one lying around.)

@daniel-beck

This comment has been minimized.

Copy link
Member

commented Apr 18, 2015

Where's the changelog entry for this change?

@reviewbybees

This comment has been minimized.

Copy link

commented Apr 18, 2015

My bad. I'm away from my computer all day today. Can you add it? (Stephen pretending to be the review success kid)

Sent from my iPhone

On 18 Apr 2015, at 08:50, Daniel Beck notifications@github.com wrote:

Where's the changelog entry for this change?


Reply to this email directly or view it on GitHub.

You received this message because you are subscribed to the Google Groups "engineering-code-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email to engineering-code-reviews+unsubscribe@cloudbees.com.
To post to this group, send email to engineering-code-reviews@cloudbees.com.
To view this discussion on the web visit https://groups.google.com/a/cloudbees.com/d/msgid/engineering-code-reviews/jenkinsci/jenkins/pull/1645/c94141001%40github.com.

daniel-beck added a commit that referenced this pull request Apr 20, 2015

tfennelly added a commit to tfennelly/jenkins that referenced this pull request Apr 26, 2015

Merge branch 'master' into JENKINS-26445-pagination-search
* master: (71 commits)
  Noting JENKINS-28062
  print the computer's name as toString() is probably useless
  Noting jenkinsci#1667
  jesse wants a comment explaining devaition from pattern
  [FIXED JENKINS-28062] A Launcher.afterDisconnect() method that propagates an exception is a sign of a bad Launcher not a problem closing the channel.
  [FIXED JENKINS-28058] Don't send a reference to Computer.class over remoting channels
  Revert "[JENKINS-10629] - Migrate the Tar archives handling code to commons-compress"
  Revert "FIXED JENKINS-10629] - Enable BigNumber mode to support archiving of files with size >8Gb"
  [FIXED JENKINS-27218]
  Change Last Active column to reflect only commit activity.
  Allow the advertised TCP slave agent port number to be modified.
  If CLI main thread fails, kill the JVM no matter what.
  [FIXED JENKINS-28011] Amending jenkinsci#1563, ${descriptor} is not necessarily ${attrs.descriptor}!
  Made minor changes to the contributing section in the README.
  [JENKINS-27995] Added contributing section to the README.
  Noting jenkinsci#1645
  updated changelog for release
  [maven-release-plugin] prepare release jenkins-1.610
  [maven-release-plugin] prepare for next development iteration
  Integrating a new version with better diagnostics
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.