Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-28840] Deadlock between Queue.maintain and Executor.interrupt #1738

Merged
merged 4 commits into from Jun 15, 2015

Conversation

@stephenc
Copy link
Member

stephenc commented Jun 15, 2015

See JENKINS-28840

More fun here:

  • All this originates from Executor extending Thread.
  • There is funky logic in the lock handling code of the JVM that makes assumptions
    about how it might proceed with the lock when the thread holding the lock has its
    interrupt flag set.
  • Really it would be better if Executor did not extend Thread as that way we wouldn't
    have to deal with some of that complexity. But OTOH we are where we are and backwards
    compatibility may make such a change not possible without a lot of breakage.
  • Fixing the issue at hand, firstly requires that interrupting a Computer happens with the
    Queue lock held (to speed up tests we have Jenkins.cleanUp get the lock for all Computers)
    That prevents the Queue maintain thread from getting caught
  • Secondly, when removing an executor from a computer we process the removal while
    holding the Queue lock, but we move the removal itself to a separate thread if we cannot
    get the Queue lock in order to avoid deadlock.
  • Also add helper methods to wrap tasks to be performed while holding the lock
    and a helper method for Runnables that exposes the tryLock functionality

@olivergondza hopefully this fixes the issue and it's not that I didn't run the flaky test case enough times on my machine. For back-porting to 1.609.2 we should obviously mark the new API methods in Queue as @Restricted

@reviewbybees

…terrupt

More fun here:

- All this originates from Executor extending Thread.
- There is funky logic in the lock handling code of the JVM that makes assumptions
  about how it might proceed with the lock when the thread holding the lock has its
  interrupt flag set.
- Really it would be better if Executor did not extend Thread as that way we wouldn't
  have to deal with some of that complexity. But OTOH we are where we are and backwards
  compatibility may make such a change not possible without a lot of breakage.
- Fixing the issue at hand, firstly requires that interrupting a Computer happens with the
  Queue lock held (to speed up tests we have Jenkins.cleanup get the lock for all Computers)
  That prevents the Queue maintain thread from getting caught
- Secondly, when removing an executor from a computer we process the removal while
  holding the Queue lock, but we move the removal itself to a separate thread if we cannot
  get the Queue lock in order to avoid deadlock.
- Also add helper methods to wrap tasks to be performed while holding the lock
  and a helper method for Runnables that exposes the tryLock functionality
@rsandell

This comment has been minimized.

Copy link
Member

rsandell commented Jun 15, 2015

👍

@olivergondza

This comment has been minimized.

Copy link
Member

olivergondza commented Jun 15, 2015

@stephenc, did you manage to reproduce the deadlock without the fix? As I reported it did not happen every time.

public static boolean tryWithLock(Runnable runnable) {
final Jenkins jenkins = Jenkins.getInstance();
final Queue queue = jenkins == null ? null : jenkins.getQueue();
if (queue == null) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Jun 15, 2015

Member

NIT: it should not really happen (?). Otherwise, we still have a risk of surprising behaviors on Jenkins startup

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Jun 15, 2015

Member

BTW, I would always use an external lock to keep the consistency

@oleg-nenashev

This comment has been minimized.

Copy link
Member

oleg-nenashev commented Jun 15, 2015

There's a risk of new regressions and deadlocks, but 👍 for merge in order to test it as much as possible before LTS backporting timeslot ends.

@jglick

This comment has been minimized.

Copy link
Member

jglick commented Jun 15, 2015

CommandLauncherTest.commandFails is probably unrelated but should be evaluated.

@Override
public void run() {
synchronized (Computer.this) {
executors.remove(e);
addNewExecutorIfNecessary();
if (!isAlive()) // TODO except from interrupt/doYank this is called while the executor still isActive(), so how could !this.isAlive()?

This comment has been minimized.

Copy link
@jglick

jglick Jun 15, 2015

Member

Is there a reason you deleted my comment?

This comment has been minimized.

Copy link
@stephenc

stephenc Jun 15, 2015

Author Member

Yes as this call can now happen on the background thread so it may now be not alive when the thread scheduler gets around to scheduling execution

@@ -1219,6 +1219,59 @@ public static void withLock(Runnable runnable) {
}
}

/**
* Some operations require to be performed with the {@link Queue} lock held. Use one of these methods rather

This comment has been minimized.

Copy link
@jglick

jglick Jun 15, 2015

Member

grammar

* Some operations require to be performed with the {@link Queue} lock held. Use one of these methods rather
* than locking directly on Queue in order to allow for future refactoring.
* @param runnable the operation to perform.
* @return {@code true} if the lock available and the operation performed.

This comment has been minimized.

Copy link
@jglick

jglick Jun 15, 2015

Member

grammar

* @return {@code true} if the lock available and the operation performed.
* @since 1.FIXME
*/
protected boolean _tryWithLock(Runnable runnable) {

This comment has been minimized.

Copy link
@jglick

jglick Jun 15, 2015

Member

Why is this protected? Queue is not intended to be subclassed.

@jglick

This comment has been minimized.

Copy link
Member

jglick commented Jun 15, 2015

👍 without fully understanding the fix, but sounds reasonable.

@stephenc

This comment has been minimized.

Copy link
Member Author

stephenc commented Jun 15, 2015

@olivergondza over 2 hours of running

while true; do mvn surefire:test ; done

on the unpatched code and I have not got a deadlock, so no. I have not been able to reproduce your failure, but this should fix it.

@stephenc stephenc closed this Jun 15, 2015
@stephenc stephenc reopened this Jun 15, 2015
stephenc added 3 commits Jun 15, 2015
- I suspect the synchronization on Jenkins is a bug also... but not causing the test failure, so will ignore for now
stephenc added a commit that referenced this pull request Jun 15, 2015
[FIXED JENKINS-28840] Deadlock between Queue.maintain and Executor.interrupt
@stephenc stephenc merged commit 71e684a into jenkinsci:master Jun 15, 2015
1 check was pending
1 check was pending
Jenkins Jenkins is validating pull request ...
Details
@stephenc stephenc deleted the stephenc:jenkins-28840 branch Jun 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.