Record memory usage after garbage collection #6963

jack-berg · 2022-10-24T17:27:29Z

Per conversation in #6362.

rapphil · 2022-10-25T05:16:04Z

...trics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics/MemoryPools.java

+        .buildWithCallback(callback(poolBeans, MemoryPoolMXBean::getUsage, MemoryUsage::getMax));
+
+    meter
+        .upDownCounterBuilder("process.runtime.jvm.memory.usage_after_gc")


Reporting this metric as absolute value might not be helpful when you have a fleet of hosts each with JVMs with different max heap sizes. I understand that you wanted to align with JMX metric conventions but I think there is more value in a normalized metric. Maybe we can add an extra metric? process.runtime.jvm.memory.usage_after_gc_percentage?

Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.

Reminded me of open-telemetry/opentelemetry-specification#2392 (comment)

Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.

Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:

(process_runtime_jvm_memory_usage_after_gc / process_runtime_jvm_memory_limit) * 100

If we buy this argument wouldn't we also want to report the percentage for usage and committed as well? And if we report percentage instead of of the absolute value, doesn't that stand frustrate people on the other side that want to analyze their absolute figures instead of relative percentages?

Another argument against reporting utilization is that we don't actually have the data to report utilization because not all memory pools report memory usage after gc, and not all memory pools report a limit. Here's the sets of pool attribute values that are reported for usage, usage_after_gc, and limit for an app running with g1 garbage collector:

process.runtime.jvm.memory.usage reports values for 8 pools: CodeHeap 'non-nmethods', Code Heap 'non-profiled nmethods', CodeHeap 'profiled nmethods', Compressed Class Space, G1 Eden Space, G1 Old Gen, G1 Survivor Space, Metaspace

process.runtime.jvm.memory.usage_after_gc reports values for 3 pools: G1 Eden Space, G1 Old Gen, G1 Survivor Space

process.runtime.jvm.memory.limit reports values for 5 pools: CodeHeap 'non-nmethods', Code Heap 'non-profiled nmethods', CodeHeap 'profiled nmethods', G1 Old Gen

Notice how there's not usage_after_gc or limit values for all the pools. Reporting utilization would limit the set of pools to those that have both usage_after_gc and limit values, which is only G1 Old Gen. Same argument applies to reporting utilization instead of usage.

Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:

If you have multiple jvms (say hundreds) reporting with different process_runtime_jvm_memory_limit because each jvm is using a different -xmx, how do you do that without manually creating a query that will match every process_runtime_jvm_memory_usage_after_gc to the respective process_runtime_jvm_memory_limit? Moreover what if the attributes that uniquely identify the metrics are not predictable?

Having said that, you made a good point about the lack of consistency in the memory pools. It does not make sense to report a normalized metric per memory pool.

Moreover what if the attributes that uniquely identify the metrics are not predictable?

If it's not possible to recognize what instance emitted particular metrics, then I'd say that your data is nonsense anyway - if you can't correlate metric instruments with each other, or with a particular deployment, the telemetry is basically useless.
I think we should operate under an assumption that resource attributes uniquely identify the metrics source - and if this is not true, this is a broader problem that needs to be fixed across the board.

In the 10/27 java sig we discussed that it would be valuable to enumerate the attributes reported for memory pool and gc metrics when different gcs are used. I've went ahead and added a readme for the runtime metrics which includes detailed information on the attributes reported. Note that I also have the same data for gc metrics added in #6964 and #6963, but will wait to add until those PRs are merged.

…y-java-instrumentation into memory-usage-after-gc

jack-berg · 2022-11-08T22:16:24Z

FYI, the spec PR corresponding to this has been merged. I've pushed a commit that syncs this PR with the naming in the spec.

Record memory usage after garbage collection

4f4c701

jack-berg requested a review from a team as a code owner October 24, 2022 17:27

trask approved these changes Oct 24, 2022

View reviewed changes

jack-berg added 2 commits October 24, 2022 12:34

Check for null MemoryUsage

59bf66d

Continue instead of return

244d551

jack-berg mentioned this pull request Oct 24, 2022

GC JVM runtime metrics proposal #6362

Closed

rapphil reviewed Oct 25, 2022

View reviewed changes

rapphil approved these changes Oct 27, 2022

View reviewed changes

jack-berg mentioned this pull request Oct 31, 2022

Add readme for runtime-metrics #7012

Merged

mateuszrzeszutek approved these changes Nov 2, 2022

View reviewed changes

trask added this to the v1.20.0 milestone Nov 8, 2022

jack-berg added 2 commits November 8, 2022 16:13

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

ad771a3

…y-java-instrumentation into memory-usage-after-gc

Rename to usage_after_last_gc

c4bde8a

github-actions bot requested a review from theletterf November 8, 2022 22:20

trask enabled auto-merge (squash) November 8, 2022 22:22

trask merged commit 177d9cd into open-telemetry:main Nov 8, 2022

rapphil mentioned this pull request Apr 26, 2023

REQUEST: New membership for rapphil open-telemetry/community#1456

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record memory usage after garbage collection #6963

Record memory usage after garbage collection #6963

jack-berg commented Oct 24, 2022

rapphil Oct 25, 2022 •

edited

trask Oct 25, 2022

jack-berg Oct 25, 2022

rapphil Oct 27, 2022

mateuszrzeszutek Nov 2, 2022

jack-berg commented Nov 8, 2022

Record memory usage after garbage collection #6963

Record memory usage after garbage collection #6963

Conversation

jack-berg commented Oct 24, 2022

rapphil Oct 25, 2022 • edited

Choose a reason for hiding this comment

trask Oct 25, 2022

Choose a reason for hiding this comment

jack-berg Oct 25, 2022

Choose a reason for hiding this comment

rapphil Oct 27, 2022

Choose a reason for hiding this comment

mateuszrzeszutek Nov 2, 2022

Choose a reason for hiding this comment

jack-berg commented Nov 8, 2022

rapphil Oct 25, 2022 •

edited