Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8030616: sun/management/jmxremote/bootstrap/RmiBootstrapTest fails intermittently with cannot find a free port #10322

Closed
wants to merge 3 commits into from

Conversation

jaikiran
Copy link
Member

@jaikiran jaikiran commented Sep 18, 2022

Can I please get a review of this test only change which proposes to fix the recent intermittent failures in RmiBootstrapTest reported in https://bugs.openjdk.org/browse/JDK-8030616?

The test has been intermittently failing with cannot find a free port after 10 tries. The tests uses the jdk.test.lib.Utils.getFreePort() utility method to get a free port to use in the tests. That port is then used to create and bind a JMX connector server. The issue resides in the fact that the getFreePort utility uses loopback address to identify a free port, whereas the JMX connector server uses that identified port to bind to a non-loopback address - there's logic in sun.rmi.transport.tcp.TCPEndpoint line 117 which calls InetAddress.getLocalHost() which can and does return a non-loopback address. This effectively means that the port that was identified as free (on loopback) may not really be free on (some other address) and a subsequent attempt to bind against it by the connector server will end up with a BindException.

Data collected in failures on the CI system has shown that this is indeed the case and the port that was chosen (for loopback) as free was already used by another process on a different address. The test additionally has logic to attempt retries upto a maximum of 10 times to find a new free port on such BindException. Turns out the next free port (on loopback) returned by jdk.test.lib.Utils.getFreePort() is incremental and it so happens that the systems where this failed had a process listening on a range of 10 to 12 ports, so these too ended up with BindException when the JMX connector server used that port for a different address.

The commit here makes sure that the JMX connector server in the test is now configured to loopback address as the address to bind to. That way the test uses the correct address (loopback) on which the port was free.

The commit also has a change to the javadoc of the test utility jdk.test.lib.Utils.getFreePort() to clarify that it returns a free port on loopback address. This now matches what the implementation of that method does.

Multiple test runs after this change hasn't yet run into the failure.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8030616: sun/management/jmxremote/bootstrap/RmiBootstrapTest fails intermittently with cannot find a free port

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10322/head:pull/10322
$ git checkout pull/10322

Update a local copy of the PR:
$ git checkout pull/10322
$ git pull https://git.openjdk.org/jdk pull/10322/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10322

View PR using the GUI difftool:
$ git pr show -t 10322

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10322.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 18, 2022

👋 Welcome back jpai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 18, 2022
@openjdk
Copy link

openjdk bot commented Sep 18, 2022

@jaikiran The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-runtime
  • jmx
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org jmx jmx-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Sep 18, 2022
@mlbridge
Copy link

mlbridge bot commented Sep 18, 2022

Webrevs

@msheppar
Copy link

While the change is reasonable, I’m not sure about your rationale.
The contributing factors to the intermittent failures are due to the flawed ephemeral port allocation strategy, that incurs a race condition to use the selected ephemeral port within the test before the OS uses it.
So from the time that the port has been acquired, then released via the autoclose of the serversocket,
and then reused in the test, it is possible that the OS may have also relocated that ephemeral port.

This change doesn’t alter this race condition.

The are a number of significant attributes to the port allocation strategy:
The port allocation strategy acquires an ephemeral port from the OS while binding it to a ServerSocket.
The significance of ServerSocket is it surreptitiously sets SO_REUSEADDR socket option on the server socket (Net.socket0() == Java_sun_nio_ch_Net_socket0). The fact that SO_REUSEADDR is set is important in this scenario.

The objective with using SO_REUSEADDR on a serversocket is allow for efficient server restarts due to
a shutdown for whatever reason. With SO_REUSEADDR set the restarting server, which typically uses a well known port, won’t get a BindException, while the port may still be associated with the “terminating” server i.e. it may not have been fully released by the OS, for general reuse.

Thus, this has a number of favourable consequences with your proposed change, as the address binding is the
same as that used for acquiring the port, so it is essential a “server restart scenario”, albeit with an
Ephemeral port. As such, there still exists the possibility that the port will be reused by the OS, where
conditions of heavy ephemeral port allocation is occurring.

As such, the change may reduce the rate of intermittent failure, but it won’t solve the issue fully.
There is a possibility that it may increase the rate of failure as many network tests now use
Loopback address rather than the local host address i.e. the primary configured IP address.
This will depend on the load on a test system, and the amount of concurrently executing network tests,
with the corresponding high volume of ephemeral port allocation.

Nonetheless, it is a reasonable change, provided there is no subtle change in test semantics. I think it
will ameliorate the intermittent failures.

@jaikiran
Copy link
Member Author

jaikiran commented Sep 19, 2022

Hello Mark, this specific change is merely meant to address the specific test case which is running into this problem. As you note the free port identification logic has and will continue to have the race condition. It will be applicable to all tests that use that utility. This change doesn't propose to do anything for such cases.

However, in this specific case it wasn't the race condition which was causing the issue. Data that was collected from those failures shows that the issue was because of using the wrong address while trying to bind the free port that was identified. The data collected from these failed runs has been added to the linked JBS issue as a comment to show the port usage.

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 17, 2022

@jaikiran This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@jaikiran
Copy link
Member Author

Please keep open. Waiting for reviews on this one.

Mark and me ran some experiments related to this issue and we did conclude that this change will help the issue at hand. @msheppar, please correct me if that's not an accurate summary.

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me.
Thanks,
Serguei

@openjdk
Copy link

openjdk bot commented Oct 20, 2022

@jaikiran This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8030616: sun/management/jmxremote/bootstrap/RmiBootstrapTest fails intermittently with cannot find a free port

Reviewed-by: sspitsyn, msheppar

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • b37421e: 8295564: Norwegian Nynorsk Locale is missing formatting
  • 6707bfb: 8029633: Raw inner class constructor ref should not perform diamond inference
  • 7bc9692: 8294670: Enhanced switch statements have an implicit default which does not complete normally
  • 95dd376: 8291914: generated constructors are considered compact when they shouldn't
  • 9b97162: 7039014: Confusing error message for method conflict
  • 78dc497: 8294550: Sealed check for casts isn't applied to array components
  • c08ff2c: 8294705: Disable an assertion in test/jdk/java/util/DoubleStreamSums/CompensatedSums.java
  • d5a1521: 8295470: Update openjdk.java.net => openjdk.org URLs in test code

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 20, 2022
@jaikiran
Copy link
Member Author

Hello Serguei,

Thank you for your review. I was about to integrate this when I just noticed that I had unintentionally included a new empty file in this commit.

I've now updated this PR to remove that stray file (and no other changes). Could you please review the current state of this PR once more?

@sspitsyn
Copy link
Contributor

Two reviews are required in the Serviceability area.
So, please, wait for one more.

@jaikiran
Copy link
Member Author

Thank you again Serguei for the review.

Two reviews are required in the Serviceability area.
So, please, wait for one more.

I wasn't aware of that. Will certainly wait. I have triggered some tests in our internal CI system too to make sure the latest runs too are clean.

Copy link

@msheppar msheppar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we found the change will be benefial

@jaikiran
Copy link
Member Author

Thank you Mark for the review.

The CI tests for this came back fine.

@jaikiran
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 21, 2022

Going to push as commit 8b010e0.
Since your change was applied there have been 19 commits pushed to the master branch:

  • b35922b: 8295714: GHA ::set-output is deprecated and will be removed
  • dfd2d83: 8295657: SA: Allow larger object alignments
  • a345df2: 8280131: jcmd reports "Module jdk.jfr not found." when "jdk.management.jfr" is missing
  • ef62b61: 8295703: RISC-V: Remove implicit noreg temp register arguments in MacroAssembler
  • 6240431: 8295697: Resolve conflicts between serviceability/jvmti and nsk/jvmti shared code
  • 1164258: 8295124: Atomic::add to pointer type may return wrong value
  • d3eba85: 8295414: [Aarch64] C2: assert(false) failed: bad AD file
  • 028e8b3: 8137022: Concurrent refinement thread adjustment and (de-)activation suboptimal
  • faa6b66: 8295715: Minimize disabled warnings in serviceability libs
  • de1e0c5: 8295719: Remove unneeded disabled warnings in jdk.sctp
  • ... and 9 more: https://git.openjdk.org/jdk/compare/9d0cfd1130b63f7acd67a52eb35c1ec38d43e514...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 21, 2022
@openjdk openjdk bot closed this Oct 21, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 21, 2022
@openjdk
Copy link

openjdk bot commented Oct 21, 2022

@jaikiran Pushed as commit 8b010e0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jaikiran jaikiran deleted the 8030616 branch October 21, 2022 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated jmx jmx-dev@openjdk.org serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

3 participants