Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-48309] - Prevent TimeoutException in AsyncFutureImpl in the case of spurious wakeups #240

Conversation

Projects
None yet
4 participants
@oleg-nenashev
Copy link
Member

commented Nov 30, 2017

TL;DR: A large part of Jenkins’ get-with-timeout logic is a subject for improper exit before the timeout.

It happens, because "wait(timeout)" is not in the loop. Object#wait(long) is explicit about that: "A thread can also wake up without being notified, interrupted, or timing out, a so-called spurious wakeup. While this will rarely occur in practice, applications must guard against it by testing for the condition that should have caused the thread to be awakened, and continuing to wait if the condition is not satisfied. In other words, waits should always occur in loops..."

https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-

Impact:

  • AsyncFutureImpl is being used explicitly in JarCacheSupport. Jar resolution with a timeout may fail due to the issue
  • The class is overridden in Jenkins Core's public API, e.g. hudson.model.queue.FutureImpl. It's hard to estimate risks in the core, but it looks bad.

https://issues.jenkins-ci.org/browse/JENKINS-48309

@reviewbybees @rysteboe @jglick

[JENKINS-48309] - Prevent TimeoutException in AsyncFutureImpl in the …
…case of spurious wakeups.

TL;DR: A large part of Jenkins’ get-with-timeout logic is a subject for improper exit before the timeout.

@oleg-nenashev oleg-nenashev requested review from rysteboe and jglick Nov 30, 2017

@stephenc
Copy link
Member

left a comment

Currenttimemillis will cause issues with ntp clock updates. Use nanotime (and it’s more complex long comparison logic)

long timeoutMs = unit.toMillis(timeout);
// It is not very accurate, but all(proof?) production uses of this class pass seconds.
// Test cases pass 1 millisecond and other such fun, but wait(long) is not time-accurate anyway
long endWaitTime = System.currentTimeMillis() + timeoutMs;

This comment has been minimized.

Copy link
@stephenc

stephenc Nov 30, 2017

Member

🐛 will cause massive issues as the system time changes. You need to use nanotime instead... care for comparison logic though


//TODO: we may be calling a complex get() implementation in the override, which may hang the code
// The only missing behavior is "problem!=null" branch, but probably we cannot just replace the code without
// an override issue risk.

This comment has been minimized.

Copy link
@stephenc

stephenc Nov 30, 2017

Member

Then perhaps move the get outside the synchronised

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Nov 30, 2017

Author Member

I will need to investigate the implementations first. get() outside synchronized may help for sure, but I would like to find real examples before touching this code.

This comment has been minimized.

Copy link
@stephenc
@oleg-nenashev

This comment has been minimized.

Copy link
Member Author

commented Nov 30, 2017

Probably it causes https://issues.jenkins-ci.org/browse/JENKINS-20947, not 100% sure.

}

long timeToWaitMs = Math.max(1, timeToWait / 1000000);
wait(timeToWaitMs);

This comment has been minimized.

Copy link
@stephenc

stephenc Nov 30, 2017

Member

There is wait(long,int) to wait with “nanosecond” precision

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Nov 30, 2017

Author Member

The implementation in my Java does the same, actually. Decompiled code:

public final void wait(long timeout, int nanos) throws InterruptedException {
        if (timeout < 0) {
            throw new IllegalArgumentException("timeout value is negative");
        }

        if (nanos < 0 || nanos > 999999) {
            throw new IllegalArgumentException(
                                "nanosecond timeout value out of range");
        }

        if (nanos >= 500000 || (nanos != 0 && timeout == 0)) {
            timeout++;
        }

        wait(timeout);
    }

I will patch the impl though


//TODO: we may be calling a complex get() implementation in the override, which may hang the code
// The only missing behavior is "problem!=null" branch, but probably we cannot just replace the code without
// an override issue risk.

This comment has been minimized.

Copy link
@stephenc
[JENKINS-48309] - Use wait with nanos in hope there are Java implemen…
…tations which really support nano timeouts.
@rysteboe
Copy link
Contributor

left a comment

One small nit but 🐝 overall.

break;
}

wait(timeToWait / 1000000, (int)(timeToWait % 1000000));

This comment has been minimized.

Copy link
@rysteboe

rysteboe Dec 1, 2017

Contributor

nit: worth a comment here saying how large the number is, it's just a pain to read that many zeroes in a row

This comment has been minimized.

Copy link
@jtnord

jtnord Dec 5, 2017

Member

underscores in java 8 1_000_000

@oleg-nenashev

This comment has been minimized.

Copy link
Member Author

commented Dec 3, 2017

ping @stephenc

long endWaitTime = System.nanoTime() + unit.toNanos(timeout);
while (!completed) {
long timeToWait = endWaitTime - System.nanoTime();
if (timeout < 0) {

This comment has been minimized.

Copy link
@stephenc

stephenc Dec 5, 2017

Member

s/timeout/timeToWait/

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Dec 6, 2017

Author Member

oh yes :( It makes me think that some test coverage could be actually useful

@stephenc
Copy link
Member

left a comment

I think there is a typo

@oleg-nenashev

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2017

Test hanging is unrelated, it comes from the previous change. Git bisect points to [7bd389e7991efedf0cd67afb7ce821abb50891bb] Javadoc: cleanup empty tag Javadoc warnings after 2 attempts, looks promising 🤷‍♂️

@oleg-nenashev oleg-nenashev merged commit 3f07563 into jenkinsci:master Dec 7, 2017

1 check failed

continuous-integration/jenkins/pr-head The build of this commit was aborted
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.