Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous profiler #3196

Merged
merged 44 commits into from Jan 11, 2024
Merged

Continuous profiler #3196

merged 44 commits into from Jan 11, 2024

Conversation

Kielek
Copy link
Contributor

@Kielek Kielek commented Dec 21, 2023

Why

Towards #3074

What

Implementation for extendible continuous profiler for thread and allocations. It is donation of Splunk code from SignalFx repository with some dedicated adjustments.

Separate comments describe:

As there is no way to export these data in any OTel common way, I would like to keep this documentation only in PR.
Also Changelog is omitted. There is a plan to address it in the future, when open-telemetry/oteps#239 or open-telemetry/oteps#237 will be ready and merged.

Tests

CI + testing with Splunk exporter implemented by plugin: signalfx/splunk-otel-dotnet#393

Checklist

  • [ ] CHANGELOG.md is updated.
  • [Documentation is updated. Documentation is included only in this PR
  • New features are covered by tests.

How to review

As the PR is huge, my recommendation is to start with plugin implementation and the exporter itself (TestApplication folder). Then check which code is calling it. Last part - verification of the native code which do the most of the tricks.

If you need, I will be happy to make real time peer review. I have did it already with @pellared and we have found couple of things to improve.

Notes

While merging, please add following comment:

Co-authored-by: John Bley <jbley@splunk.com>
Co-authored-by: Paulo Janotti <pjanotti@splunk.com>
Co-authored-by: Robert Pająk <rpajak@splunk.com>
Co-authored-by: Mateusz Łach <mateusza@splunk.com>
Co-authored-by: Dawid Szmigielski <dszmigielski@splunk.com>

@Kielek
Copy link
Contributor Author

Kielek commented Dec 27, 2023

About the Continuous Profiling - Thread sampling

Thread sampling can be enabled by the custom plugin.
Plugin is responsible for parsing dense, data format and exporting it in the appropriate format.

How does the thread sampler work?

The profiler leverages .NET profiling to perform periodic call stack sampling. For every sampling period, the runtime is suspended and the samples for all managed thread are saved into the buffer, then the runtime resumes.

The separate managed-thread is processing data from the buffer and export it the way defined by the plugin.

To make the process more efficient, the sampler uses two independent buffers to store samples alternatively.

Requirements

  • .NET 6.0 or higher (ICorProfilerInfo12 available in runtime). Technically It could be ICorProiler10, and .NET Core 3.1/.NET 5.0, but these versions are not supported by OpenTelemetry .NET AutoInstrumentation.
  • .NET Framework is not supported. ICorProiler10 nor ICorProiler12 are not available in .NET Fx.

Enable the profiler

Implement custom plugin.

Configuration settings by plugin

threadSamplingEnabled = true;
var threadSamplingInterval = 10000u; // in ms. Splunk is using 10000. There is no possibility to set value lower than 1000.
var exportInterval = TimeSpan.FromMilliseconds(500); // Interval to read data from buffers and call exporter, common for Thread and Allocation sampling
object continuousProfilerExporter = new ConsoleExporter(); // Exporter, common for Thread and Allocation sampling

Escape hatch

The profiler limits its own behavior when both buffers used to store sampled data are full.

This scenario might happen when the data processing thread is not able
to export data the given period of time.

Thread sampler will resume when any of the buffers are empty.

Troubleshooting the .NET profiler

How do I know if it's working?

At the startup, the SignalFx Instrumentation for .NET logs the string ContinuousProfiler::StartThreadSampling at info log level.

You can grep for this in the native logs for the instrumentation to see something like this:

10/12/22 12:10:31.962 PM [12096|22036] [info] ContinuousProfiler::StartThreadSampling

How can I see Continuous Profiling configuration?

The OpenTelemetry .NET AutomaticInstrumentation logs the profiling configuration
at Debug log level during the startup. You can grep for the string Continuous profiling configuration:
to see the configuration.

What does the escape hatch do?

The escape hatch automatically discards profiling data
if the ingest limit has been reached.

If the escape hatch activates, it logs the following message:

Skipping a thread sample period, buffers are full.

You can also look for:

** THIS WILL RESULT IN LOSS OF PROFILING DATA **.

If you see these log messages, check the exporter implementation.

What if I'm on an unsupported .NET version?

None of the .NET Framework versions is supported. You have to switch to supported .NET version.

Can I tell the sampler to ignore some threads?

There is no such functionality. All managed threads are captured by the profiler.

@Kielek
Copy link
Contributor Author

Kielek commented Jan 2, 2024

About Continuous memory profiling for .NET

The profiler samples allocations, captures the call stack state for the .NET thread that triggered the allocation, and exporting it in appropriate format.

Use the memory allocation data, together with the stack traces and .NET runtime metrics, to investigate memory leaks and unusual consumption patterns in Continuous Profiling.

How does the memory profiler work?

The profiler leverages .NET profiling
to perform allocation sampling.
For every sampled allocation, allocation amount together with stack trace of the thread that triggered the allocation, and associated span context, are saved into buffer.

The managed thread shared with CPU Profiler processes the data from the buffer and exports in the way defined by the plugin..

Requirements

  • .NET 6.0 or higher (ICorProfilerInfo12 available in runtime) - technically it could be .NET5 which is not supported by OTel/MS.

Enable the profiler

Implement custom plugin.

Configuration settings by the plugin

threadSamplingEnabled, threadSamplingInterval, allocationSamplingEnabled, maxMemorySamplesPerMinute, exportInterval, continuousProfilerExporter

allocationSamplingEnabled = true
maxMemorySamplesPerMinute = 200 // minimum value: 1, Splunk is using 200 by default
exportInterval = TimeSpan.FromMilliseconds(500); // Interval to read data from buffers and call exporter, common for Thread and Allocation sampling
object continuousProfilerExporter = new ConsoleExporter(); // Exporter, common for Thread and Allocation sampling

Escape hatch

The profiler limits its own behavior when buffer
used to store allocation samples is full.

Current maximum size of the buffer is 200 KiB.

This scenario might happen when the data processing thread is not able
to export the data by the plugin in the given timeframe.

Troubleshooting the .NET profiler

How do I know if it's working?

At the startup, the OpenTelemetry .NET Automatic Instrumentation will log the string
ContinuousProfiler::MemoryProfiling started at info log level.

You can grep for this in the native logs for the instrumentation
to see something like this:

10/12/23 12:10:31.962 PM [12096|22036] [info] ContinuousProfiler::MemoryProfiling started.

How can I see Continuous Profiling configuration?

The OpenTelemetry .NET AutomaticInstrumentation logs the profiling configuration
at Debug log level during the startup. You can grep for the string Continuous profiling configuration:
to see the configuration.

What does the escape hatch do?

The escape hatch automatically discards captured allocation data
if the ingest limit has been reached.

If the escape hatch activates, it logs the following message:

Discarding captured allocation sample. Allocation buffer is full.

If you see these log messages, check the configuration and communication layer
between your process and the Collector.

What if I'm on an unsupported .NET version?

None of the .NET Framework versions is supported. You have to switch to supported .NET version.

@rajkumar-rangaraj
Copy link
Contributor

nit: I recommend placing a launchSettings.json file with environment variables in the test/test-applications/integrations/TestApplication.ContinuousProfiler/ app. This will make it easier for people to test this project.

@Kielek
Copy link
Contributor Author

Kielek commented Jan 4, 2024

nit: I recommend placing a launchSettings.json file with environment variables in the test/test-applications/integrations/TestApplication.ContinuousProfiler/ app. This will make it easier for people to test this project.

Done in 2b957d1

Copy link
Contributor

@RassK RassK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nit: may need to recheck native formatting rules, loads of missing lines, different styles etc. Too much to deal with in this branch.

return;
}

pdvEventsLow |= COR_PRF_MONITOR_THREADS | COR_PRF_ENABLE_STACK_SNAPSHOT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these flags needed if only allocation sampling is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/cor-prf-monitor-enumeration

  • COR_PRF_MONITOR_THREADS - needed to have the info about thread events. Needed to corelate stack traces with Traces/Spans. Needed for both allocation and thread sampling
  • COR_PRF_ENABLE_STACK_SNAPSHOT - Enables calls to the DoStackSnapshot method. DoStackSnapshot method is also executed for thread and allocation sampling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be worth adding a comment in code?

@@ -137,6 +167,12 @@ public static void Initialize()
Logger.Information("Initialized lazily-loaded metric instrumentations without initializing sdk.");
}
}
#if NET6_0_OR_GREATER
if (profilerEnabled && (threadSamplingEnabled || allocationSamplingEnabled))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is done last? Could this be moved to the earlier conditional blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but for traces and metrics we would like to take all possible data (at least for AlwaysOn Sampler).
For the profiler, we are heavily dropping data/sampling once per second(s). So dropping data is acceptable, both at the beggining and end of the process.

Copy link
Contributor

@zacharycmontoya zacharycmontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@Kielek Kielek mentioned this pull request Jan 10, 2024
1 task
Kielek and others added 2 commits January 10, 2024 15:44
@Kielek
Copy link
Contributor Author

Kielek commented Jan 11, 2024

I am merging this PR as we agreed yesterday. Please let me know if you have any other comments.

@Kielek Kielek merged commit 349fbb1 into open-telemetry:main Jan 11, 2024
31 checks passed
@Kielek Kielek deleted the continuous-profiler branch January 11, 2024 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants