Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8262094: Handshake timeout scaled wrong #2666

Closed
wants to merge 1 commit into from
Closed

Conversation

@robehn
Copy link
Contributor

@robehn robehn commented Feb 21, 2021

Parameter HandshakeTimeout is in milliseconds.
Internally we use nanoseconds.
HandshakeTimeout must be scaled to nanoseconds.

Passes T1


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/2666/head:pull/2666
$ git checkout pull/2666

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 21, 2021

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Feb 21, 2021

@robehn The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@robehn robehn marked this pull request as ready for review Feb 22, 2021
@openjdk openjdk bot added the rfr label Feb 22, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 22, 2021

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Hi Robbin,

This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.

This counts as a trivial change IMO.

Thanks,
David

@openjdk
Copy link

@openjdk openjdk bot commented Feb 22, 2021

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8262094: Handshake timeout scaled wrong

Reviewed-by: dholmes, stefank

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 24 new commits pushed to the master branch:

  • 8cfea7c: 8261921: ClassListParser::current should be used only by main thread
  • 991f7c1: 8210373: Deadlock in libj2gss.so when loading "j2gss" and "net" libraries in parallel.
  • 0217d69: 8261975: Missing "classpath exception" in VectorSupport.java
  • f2bde05: 8262097: Improve CompilerConfig ergonomics to fix a VM crash after JDK-8261229
  • aea474c: 8261269: When using clhsdb to "inspect" a java object, clhsdb prints "Oop for..." twice
  • a7e2e80: 8260695: The java.awt.color.ICC_Profile#getData/getData(int) are not thread safe
  • a30fb4f: 8255867: SignatureScheme JSSE property does not preserve ordering in handshake messages
  • a867288: 8211227: Inconsistent TLS protocol version in debug output
  • 7b924d8: 8261973: Shenandoah: Cleanup/simplify root verifier
  • 63f8fc8: 8259662: Don't wrap SocketExceptions into SSLExceptions in SSLSocketImpl
  • ... and 14 more: https://git.openjdk.java.net/jdk/compare/18188c2aa7d68f057d4e08d70944d5694dbe73a2...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Feb 22, 2021
@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 22, 2021

Hi Robbin,

This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.

This counts as a trivial change IMO.

Thanks,
David

Hi David, unfortunately I don't I agree that this is trivial:

open/src/hotspot/os/windows/os_windows.cpp:jlong os::elapsed_frequency() { open/src/hotspot/os/windows/os_windows.cpp- return performance_frequency; open/src/hotspot/os/windows/os_windows.cpp-}

This value comes from QueryPerformanceFrequency, which depends on what resolution the Windows thinks is reasonable given what hardware support there are.

In a virtualized env. this causes a 20.000 ms timeout to timeout after 2 s for me due to low resolution (windows only).
(all other platforms hardcoded to NANASECONDS)

We have two more suspective uses here, which I have not investigated (but we have seem wired timeouts with both ZGC and G1, but I have seen non with parGC):
open/src/hotspot/share/gc/z/zMark.cpp: _timeout(_start.value() + TimeHelper::micros_to_counter(timeout_in_micros))

open/src/hotspot/share/gc/g1/g1ServiceThread.cpp: jlong delay = TimeHelper::millis_to_counter(delay_ms);

Thanks

Copy link
Member

@stefank stefank left a comment

Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented Feb 22, 2021

G1ServiceThread is using os::elapsed_counter() but if the frequency used in TimeHelper::millis_to_counter(...) is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?

@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 22, 2021

Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

Both uses a 'java type', jlong.
For logging and logic for yielding we want a humanly understand unit that is absolute.
Since things goes pretty fast nowadays nanseconds seems reasonable.
Instead of to converting the elapsed_counter to nanos, it's much simpler to just get the nanos.
But yes it would preferably with a os::time_nanos().

@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 22, 2021

G1ServiceThread is using os::elapsed_counter() but if the frequency used in TimeHelper::millis_to_counter(...) is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?

The logging during handshake timeout doesn't work on windows for some reason, when I locally fixed the logging I notice that timeout was actually 2 seconds instead of 20 seconds. But that only seems to apply to some windows versions/(virtual hardware).

During some tests I have seem results such as:
"Error: time between events 0 ns"
When looking at the code the events are serialized with a mutex, so some time must have passed but the granularity of the system is to coarse. I don't know what else this can cause.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 22, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

Hi Robbin,

On 22/02/2021 6:19 pm, Robbin Ehn wrote:

On Mon, 22 Feb 2021 07:46:47 GMT, David Holmes <dholmes at openjdk.org> wrote:

Hi Robbin,

This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.

This counts as a trivial change IMO.

Thanks,
David

Hi David, unfortunately I don't I agree that this is trivial:

`open/src/hotspot/os/windows/os_windows.cpp:jlong os::elapsed_frequency() {
open/src/hotspot/os/windows/os_windows.cpp- return performance_frequency;
open/src/hotspot/os/windows/os_windows.cpp-}`

This value comes from QueryPerformanceFrequency, which depends on what resolution the Windows thinks is reasonable given what hardware support there are.

Yep mea culpa I completely overlooked Windows. So on Windows this change
actually alters the timeout. So a simple fix but not trivial.

In a virtualized env. this causes a 20.000 ms timeout to timeout after 2 s for me due to low resolution (windows only).
(all other platforms hardcoded to NANASECONDS)

We have two more suspective uses here, which I have not investigated (but we have seem wired timeouts with both ZGC and G1, but I have seen non with parGC):
`open/src/hotspot/share/gc/z/zMark.cpp: _timeout(_start.value() + TimeHelper::micros_to_counter(timeout_in_micros))`

`open/src/hotspot/share/gc/g1/g1ServiceThread.cpp: jlong delay = TimeHelper::millis_to_counter(delay_ms);`

These should only be heuristics in any case, but I don't understand why
we would want to express a delay/timeout this way.

Cheers,
David
-----

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 22, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 22/02/2021 7:03 pm, Stefan Karlsson wrote:

On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

Parameter HandshakeTimeout is in milliseconds.
Internally we use nanoseconds.
HandshakeTimeout must be scaled to nanoseconds.

Passes T1

Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

Runtime code (threading and sync) uses actual time units (ie
javaTimeNanos) rather than os::elapsed_counter(). And elapsed_counter()
uses javaTimeNanos() on all platforms but Windows. It's not really about
interaction with Java, just about semantics. A timeout/delay in "ticks"
would scale with CPU speed if "ticks" presented that, but
os::elapsed_counter() doesn't - so a timeout/delay of N "counts" is
somewhat arbitrary in duration. Conversely, of course, any timeout/delay
expressed in ms or ns is somewhat arbitrary in terms of how much code
may have executed in that time. But these timeouts are heuristics anyway.

Cheers,
David

@stefank
Copy link
Member

@stefank stefank commented Feb 22, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 22/02/2021 7:03 pm, Stefan Karlsson wrote:

On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn wrote:

Parameter HandshakeTimeout is in milliseconds.
Internally we use nanoseconds.
HandshakeTimeout must be scaled to nanoseconds.
Passes T1

Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

Runtime code (threading and sync) uses actual time units (ie
javaTimeNanos) rather than os::elapsed_counter(). And elapsed_counter()
uses javaTimeNanos() on all platforms but Windows.

I see. In the GC code we've tried to unify towards using one time source (elapsed_counter/elapsedTime/Ticks all use counters), and then convert to nanos/micros/millis when needed. It's a bit unfortunate that we don't use the same time source throughout the JVM.

It's not really about

interaction with Java, just about semantics.

From the GC's point-of-view it is. We use os::javaTimeNanos() just were we are forced to use it because interactions with Java. Other use-cases uses the counter based source.

A timeout/delay in "ticks"

would scale with CPU speed if "ticks" presented that, but
os::elapsed_counter() doesn't - so a timeout/delay of N "counts" is
somewhat arbitrary in duration.

For those cases you wouldn't pass down counter, but use TimeHelper::counter_to_millis (or equivalent).

But OK, given that Runtime already uses os::javaTimeNanos() I'll back off.

@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 22, 2021

I agree with that we should use the same time source.
But here I just want to fixed this bug so the time-out is working as intended on windows.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 22, 2021

Mailing list message from Robbin Ehn on hotspot-runtime-dev:

Hi David,

In a virtualized env. this causes a 20.000 ms timeout to timeout
after 2 s for me due to low resolution (windows only).
(all other platforms hardcoded to NANASECONDS)

We have two more suspective uses here, which I have not
investigated (but we have seem wired timeouts with both ZGC and G1,
but I have seen non with parGC):
`open/src/hotspot/share/gc/z/zMark.cpp:?????
_timeout(_start.value() +
TimeHelper::micros_to_counter(timeout_in_micros))`

`open/src/hotspot/share/gc/g1/g1ServiceThread.cpp:? jlong delay =
TimeHelper::millis_to_counter(delay_ms);`

FYI: I looked at these and they are fine.

These should only be heuristics in any case, but I don't understand
why
we would want to express a delay/timeout this way.

I don't either.

Thanks, Robbin

Cheers,
David
-----

@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 23, 2021

Thanks David and Stefan!

@robehn
Copy link
Contributor Author

@robehn robehn commented Feb 23, 2021

/integrate

@openjdk openjdk bot closed this Feb 23, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Feb 23, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Feb 23, 2021

@robehn Since your change was applied there have been 27 commits pushed to the master branch:

  • 29c7263: 8252709: Enable JVMCI when building linux-aarch64 at Oracle
  • 12f6ba0: 8262087: Use atomic boolean type in G1FullGCAdjustTask
  • a5c4b9a: 8260223: Handling of unnamed package in javadoc pages
  • 8cfea7c: 8261921: ClassListParser::current should be used only by main thread
  • 991f7c1: 8210373: Deadlock in libj2gss.so when loading "j2gss" and "net" libraries in parallel.
  • 0217d69: 8261975: Missing "classpath exception" in VectorSupport.java
  • f2bde05: 8262097: Improve CompilerConfig ergonomics to fix a VM crash after JDK-8261229
  • aea474c: 8261269: When using clhsdb to "inspect" a java object, clhsdb prints "Oop for..." twice
  • a7e2e80: 8260695: The java.awt.color.ICC_Profile#getData/getData(int) are not thread safe
  • a30fb4f: 8255867: SignatureScheme JSSE property does not preserve ordering in handshake messages
  • ... and 17 more: https://git.openjdk.java.net/jdk/compare/18188c2aa7d68f057d4e08d70944d5694dbe73a2...master

Your commit was automatically rebased without conflicts.

Pushed as commit 9d9bedd.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@robehn robehn deleted the 8262094 branch Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants