Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8203359: Container level resources events #3126

Closed
wants to merge 12 commits into from

Conversation

jbachorik
Copy link

@jbachorik jbachorik commented Mar 22, 2021

With this change it becomes possible to surface various cgroup level metrics (available via jdk.internal.platform.Metrics) as JFR events.

Only a subset of the metrics exposed by jdk.internal.platform.Metrics is turned into JFR events to start with.

  • CPU related metrics
  • Memory related metrics
  • I/O related metrics

For each of those subsystems a configuration data will be emitted as well. The initial proposal is to emit the configuration data events at least once per chunk and the metrics values at 30 seconds interval.
By using these values the emitted events seem to contain useful information without increasing overhead (the metrics values are read from /proc filesystem so that should not be done too frequently).


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3126/head:pull/3126
$ git checkout pull/3126

Update a local copy of the PR:
$ git checkout pull/3126
$ git pull https://git.openjdk.java.net/jdk pull/3126/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3126

View PR using the GUI difftool:
$ git pr show -t 3126

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3126.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 22, 2021

👋 Welcome back jbachorik! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 22, 2021

@jbachorik The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-jfr

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added core-libs core-libs-dev@openjdk.org hotspot-jfr hotspot-jfr-dev@openjdk.org rfr Pull request is ready for review labels Mar 22, 2021
@mlbridge
Copy link

mlbridge bot commented Mar 22, 2021

@jerboaa
Copy link
Contributor

jerboaa commented Mar 24, 2021

@jbachorik Would it make sense for ContainerConfigurationEvent to include the underlying cgroup version info (v1 or legacy vs. v2 or unified)? Metrics.getProvider() should give that info.

@egahlin
Copy link
Member

egahlin commented Mar 25, 2021

Does each getter call result in parsing /proc, or do things aggregated over several calls or hooks?

Do you have any data how expensive the invocations are?

You could for example try to measure it by temporary making the events durational, and fetch the values between begin() and end(), and perhaps show a 'jfr print --events Container* recording.jfr' printout.

If possible, it would be interesting to get some idea about the startup cost as well

If not too much overhead, I think it would be nice to skip the "flag" in the .jfcs, and always record the events in a container environment.

I know there is a way to test JFR using Docker, maybe @mseledts could provide information? Some sanity tests would be good to have.

@jerboaa
Copy link
Contributor

jerboaa commented Mar 26, 2021

Does each getter call result in parsing /proc, or do things aggregated over several calls or hooks?

From the looks of it the event emitting code uses Metrics.java interface for retrieving the info. Each call to a method exposed by Metrics result in file IO on some cgroup (v1 or v2) interface file(s) in /sys/fs/.... I don't see any aggregation being done.

On the hotspot side, we implemented some caching for frequent calls (JDK-8232207, JDK-8227006), but we didn't do that yet for the Java side since there wasn't any need (so far). If calls are becoming frequent with this it should be reconsidered.

So +1 on getting some data on what the perf penalty of this is.

@openjdk openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Apr 1, 2021
@jbachorik
Copy link
Author

jbachorik commented Apr 1, 2021

Thanks to all for chiming in!

I have added the tests to test/hotspot/jtreg/containers/docker/TestJFREvents.java where there already were some templates for the container event data.

As for the performance - as expected, extracting the data from /proc is not exactly cheap. On my test c5.4xlarge instance I am getting an average wall-clock time to generate the usage/throttling events (one instance of each) of ~15ms.
I would argue that 15ms per 30s (the default emission period for those events) might be acceptable to start with.

Caching of cgroups parsed data would help if the emission period is shorter than the cache TTL. This is exacerbated by the fact that (almost) each container event type requires data from a different cgroups control file - hence the data will not be shared between the event type instances even if cached. Realistically, caching benefits would become visible only for sub-second emission periods.

If the caching is still required I would suggest having a follow up ticket just for that - it will require setting up some benchmarks to justify the changes that would need to be done in the metrics implementation.

@jbachorik
Copy link
Author

I tried to measure the startup regression and here are my observations:

  • Startup is not affected unless the application is started with JFR
  • The extra events and hooks take ~5ms on my work machine
  • It is possible not to register those events and hooks in a non-container env - then the overhead is 20-50us which it takes to figure out whether running in container

In order to minimize the effect this change will have on the startup I would suggest using conditional registration unless I hear strong objections to that.

@openjdk openjdk bot added rfr Pull request is ready for review and removed rfr Pull request is ready for review labels Apr 2, 2021
@egahlin
Copy link
Member

egahlin commented Apr 21, 2021

I wonder if something similar to below could be added to jdk.jfr.internal.Utils:

private static Metrics[] metrics;
public static Metrics getMetrics() {
    if (metrics == null) {
        metrics = new Metrics[] { Metrics.systemMetrics() };
    }
    return metrics[0];
}

public static boolean shouldSkipBytecode(String eventName, Class<?> superClass) {
    if (superClass != AbstractJDKEvent.class) {
        return false;
    }
    if (!eventName.startsWith("jdk.Container")) {
        return false;
    }
    return getMetrics() == null;
}

Then we could add checks to jdk.jfr.internal.JVMUpcalls::bytesForEagerInstrumentation(...)

eventName = ei.getEventName();
if (Utils.shouldSkipBytecode(eventName, superClass))) {
    return oldBytes;
}

and jdk.jfr.internal.JVMUpcalls:onRetransform(...)

if (jdk.internal.event.Event.class.isAssignableFrom(clazz) && !Modifier.isAbstract(clazz.getModifiers())) {
    if (Utils.shouldSkipBytecode(clazz.getName(), clazz.getSuperclass())) {
        return oldBytes;
    }

This way we would not pay for generating bytecode for events in a non-container environment.

Not sure if it works, but could perhaps make startup faster? We would still pay for generating the event handlers during registration, but it's much trickier to avoid since we need to store the event type somewhere.

@jbachorik
Copy link
Author

@egahlin Sounds good.
Any particular reason you are using Metrics[] array?


private static Metrics getMetrics() {
if (metrics == null) {
metrics = Metrics.systemMetrics();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this not lead to a lookup every time in an non-container environment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Now I see why you used Metrics[] - will fix.

@@ -719,6 +725,20 @@ public static String formatDuration(Duration d) {
}
}

public static boolean shouldSkipBytecode(String eventName, Class<?> superClass) {
if (!superClass.getName().equals("jdk.jfr.events.AbstractJDKEvent")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a problem checking against the class instance? If so, perhaps you could add a check that the class is in the boot class loader (null).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AbstractJDKEvent is package private so it is not accessible from here.

@jerboaa
Copy link
Contributor

jerboaa commented Apr 27, 2021

@jbachorik Has this been tested on cgroups v1 and cgroups v2 Linux systems?

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbachorik The test needs fixing.

test/hotspot/jtreg/containers/docker/TestJFREvents.java Outdated Show resolved Hide resolved
@jerboaa
Copy link
Contributor

jerboaa commented Apr 27, 2021

@jbachorik Has this been tested on cgroups v1 and cgroups v2 Linux systems?

OK. I've tested the latest iteration on both (cgroup v2 and cgroup v1). Testing looks good other than the memoryPressure issue.

Copy link
Member

@egahlin egahlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but if there are test issues they should be fixed.

@openjdk
Copy link

openjdk bot commented May 19, 2021

@jbachorik This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8203359: Container level resources events

Reviewed-by: sgehwolf, egahlin

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 161 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 19, 2021
@Label("Memory Pressure")
@Description("(attempts per second * 1000), if enabled, that the operating system tries to satisfy a memory request for any " +
"process in the current container when no free memory is readily available.")
public double memoryPressure;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this memoryPressure field go from ContainerMemoryUsageEvent class? It's not set anywhere is it? would be cgroup v1 only api so I'm not sure it should be there for a generic event like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Removing.

@jbachorik
Copy link
Author

Thanks for the review!
I've fixed the outstanding test failures and the patch is in its final form.

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbachorik
Copy link
Author

@egahlin Unfortunately, I had to make one late change in the periodic event hook registration.
If the events are registered conditionally only when running in a container the event metadata are not correct and TestDefaultConfigurations.java test will fail. When I register the hooks unconditionally, the metadata is correctly generated and the test passes.
I will hold off integration until I hear back from you whether this is acceptable or I should try to find an alternative solution.

@egahlin
Copy link
Member

egahlin commented May 19, 2021

@egahlin Unfortunately, I had to make one late change in the periodic event hook registration.
If the events are registered conditionally only when running in a container the event metadata are not correct and TestDefaultConfigurations.java test will fail. When I register the hooks unconditionally, the metadata is correctly generated and the test passes.
I will hold off integration until I hear back from you whether this is acceptable or I should try to find an alternative solution.

It's not unfortunate :-)

I think we should always register the metadata, even if you can't get the event.

That's how we handle different GCs. Users must always be able to explore events. For example, you should be able to configure container events in JMC (with correct labels/descriptions) without actually connecting to a JVM running in a Docker container.

I think you need to add the hook, for the event metadata to be correct. Otherwise, the "period" setting will not show up.

@jbachorik
Copy link
Author

I think you need to add the hook, for the event metadata to be correct. Otherwise, the "period" setting will not show up.

Yes. The failed test log would indicate also the rest of the metadata not being in a good shape. But with the hook registered everything works fine.

@jbachorik
Copy link
Author

/integrate

@openjdk openjdk bot closed this May 21, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 21, 2021
@openjdk
Copy link

openjdk bot commented May 21, 2021

@jbachorik Since your change was applied there have been 186 commits pushed to the master branch:

  • b5d32bb: 8260690: JConsole User Guide Link from the Help menu is not accessible by keyboard
  • e48d7d6: 8264218: Public method javax.swing.JMenu.setComponentOrientation() has no spec
  • 9eaa4af: 8267056: tools/jpackage/share/RuntimePackageTest.java fails with NoSuchFileException
  • e094f3f: 8266856: Make element void
  • 7a63ff7: 8267370: [Vector API] Fix several crashes after JDK-8256973
  • 83b3607: 8266642: improve ResolvedMethodTable hash function
  • 1c7a131: 8267350: Archived old interface extends interface with default method causes crash
  • 005d8a7: 8256372: [macos] Unexpected symbol was displayed on JTextField with Monospaced font
  • 81f39ed: 8261205: AssertionError: Cannot add metadata to an intersection type
  • 7b98400: 8267348: Rewrite gc/epsilon/TestClasses.java to use Metaspace with less classes
  • ... and 176 more: https://git.openjdk.java.net/jdk/compare/3af4efdfcfbbb52d38415374083c66c9e7b22604...master

Your commit was automatically rebased without conflicts.

Pushed as commit ee2651b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-jfr hotspot-jfr-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants