Continuous profiler #3196

Kielek · 2023-12-21T13:14:07Z

Why

Towards #3074

What

Implementation for extendible continuous profiler for thread and allocations. It is donation of Splunk code from SignalFx repository with some dedicated adjustments.

Separate comments describe:

As there is no way to export these data in any OTel common way, I would like to keep this documentation only in PR.
Also Changelog is omitted. There is a plan to address it in the future, when open-telemetry/oteps#239 or open-telemetry/oteps#237 will be ready and merged.

Tests

CI + testing with Splunk exporter implemented by plugin: signalfx/splunk-otel-dotnet#393

Checklist

~~[ ] CHANGELOG.md is updated.~~
~~[Documentation is updated.~~ Documentation is included only in this PR
New features are covered by tests.

How to review

As the PR is huge, my recommendation is to start with plugin implementation and the exporter itself (TestApplication folder). Then check which code is calling it. Last part - verification of the native code which do the most of the tricks.

If you need, I will be happy to make real time peer review. I have did it already with @pellared and we have found couple of things to improve.

Notes

While merging, please add following comment:

Co-authored-by: John Bley <jbley@splunk.com>
Co-authored-by: Paulo Janotti <pjanotti@splunk.com>
Co-authored-by: Robert Pająk <rpajak@splunk.com>
Co-authored-by: Mateusz Łach <mateusza@splunk.com>
Co-authored-by: Dawid Szmigielski <dszmigielski@splunk.com>

lack of call for memory samples in rare cases

to avoid any exceptions

Kielek · 2023-12-27T13:31:53Z

About the Continuous Profiling - Thread sampling

Thread sampling can be enabled by the custom plugin.
Plugin is responsible for parsing dense, data format and exporting it in the appropriate format.

How does the thread sampler work?

The profiler leverages .NET profiling to perform periodic call stack sampling. For every sampling period, the runtime is suspended and the samples for all managed thread are saved into the buffer, then the runtime resumes.

The separate managed-thread is processing data from the buffer and export it the way defined by the plugin.

To make the process more efficient, the sampler uses two independent buffers to store samples alternatively.

Requirements

.NET 6.0 or higher (ICorProfilerInfo12 available in runtime). Technically It could be ICorProiler10, and .NET Core 3.1/.NET 5.0, but these versions are not supported by OpenTelemetry .NET AutoInstrumentation.
.NET Framework is not supported. ICorProiler10 nor ICorProiler12 are not available in .NET Fx.

Enable the profiler

Implement custom plugin.

Configuration settings by plugin

threadSamplingEnabled = true;
var threadSamplingInterval = 10000u; // in ms. Splunk is using 10000. There is no possibility to set value lower than 1000.
var exportInterval = TimeSpan.FromMilliseconds(500); // Interval to read data from buffers and call exporter, common for Thread and Allocation sampling
object continuousProfilerExporter = new ConsoleExporter(); // Exporter, common for Thread and Allocation sampling

Escape hatch

The profiler limits its own behavior when both buffers used to store sampled data are full.

This scenario might happen when the data processing thread is not able
to export data the given period of time.

Thread sampler will resume when any of the buffers are empty.

Troubleshooting the .NET profiler

How do I know if it's working?

At the startup, the SignalFx Instrumentation for .NET logs the string ContinuousProfiler::StartThreadSampling at info log level.

You can grep for this in the native logs for the instrumentation to see something like this:

10/12/22 12:10:31.962 PM [12096|22036] [info] ContinuousProfiler::StartThreadSampling

How can I see Continuous Profiling configuration?

The OpenTelemetry .NET AutomaticInstrumentation logs the profiling configuration
at Debug log level during the startup. You can grep for the string Continuous profiling configuration:
to see the configuration.

What does the escape hatch do?

The escape hatch automatically discards profiling data
if the ingest limit has been reached.

If the escape hatch activates, it logs the following message:

Skipping a thread sample period, buffers are full.

You can also look for:

** THIS WILL RESULT IN LOSS OF PROFILING DATA **.

If you see these log messages, check the exporter implementation.

What if I'm on an unsupported .NET version?

None of the .NET Framework versions is supported. You have to switch to supported .NET version.

Can I tell the sampler to ignore some threads?

There is no such functionality. All managed threads are captured by the profiler.

Kielek · 2024-01-02T08:04:03Z

About Continuous memory profiling for .NET

The profiler samples allocations, captures the call stack state for the .NET thread that triggered the allocation, and exporting it in appropriate format.

Use the memory allocation data, together with the stack traces and .NET runtime metrics, to investigate memory leaks and unusual consumption patterns in Continuous Profiling.

How does the memory profiler work?

The profiler leverages .NET profiling
to perform allocation sampling.
For every sampled allocation, allocation amount together with stack trace of the thread that triggered the allocation, and associated span context, are saved into buffer.

The managed thread shared with CPU Profiler processes the data from the buffer and exports in the way defined by the plugin..

Requirements

.NET 6.0 or higher (ICorProfilerInfo12 available in runtime) - technically it could be .NET5 which is not supported by OTel/MS.

Enable the profiler

Implement custom plugin.

Configuration settings by the plugin

threadSamplingEnabled, threadSamplingInterval, allocationSamplingEnabled, maxMemorySamplesPerMinute, exportInterval, continuousProfilerExporter

allocationSamplingEnabled = true
maxMemorySamplesPerMinute = 200 // minimum value: 1, Splunk is using 200 by default
exportInterval = TimeSpan.FromMilliseconds(500); // Interval to read data from buffers and call exporter, common for Thread and Allocation sampling
object continuousProfilerExporter = new ConsoleExporter(); // Exporter, common for Thread and Allocation sampling

Escape hatch

The profiler limits its own behavior when buffer
used to store allocation samples is full.

Current maximum size of the buffer is 200 KiB.

This scenario might happen when the data processing thread is not able
to export the data by the plugin in the given timeframe.

Troubleshooting the .NET profiler

How do I know if it's working?

At the startup, the OpenTelemetry .NET Automatic Instrumentation will log the string
ContinuousProfiler::MemoryProfiling started at info log level.

You can grep for this in the native logs for the instrumentation
to see something like this:

10/12/23 12:10:31.962 PM [12096|22036] [info] ContinuousProfiler::MemoryProfiling started.

How can I see Continuous Profiling configuration?

The OpenTelemetry .NET AutomaticInstrumentation logs the profiling configuration
at Debug log level during the startup. You can grep for the string Continuous profiling configuration:
to see the configuration.

What does the escape hatch do?

The escape hatch automatically discards captured allocation data
if the ingest limit has been reached.

If the escape hatch activates, it logs the following message:

Discarding captured allocation sample. Allocation buffer is full.

If you see these log messages, check the configuration and communication layer
between your process and the Collector.

What if I'm on an unsupported .NET version?

None of the .NET Framework versions is supported. You have to switch to supported .NET version.

src/OpenTelemetry.AutoInstrumentation/Instrumentation.cs

src/OpenTelemetry.AutoInstrumentation/ContinuousProfiler/ContinuousProfilerProcessor.cs

src/OpenTelemetry.AutoInstrumentation/Plugins/PluginManager.ContinuousProfiler.cs.cs

rajkumar-rangaraj · 2024-01-03T20:48:05Z

nit: I recommend placing a launchSettings.json file with environment variables in the test/test-applications/integrations/TestApplication.ContinuousProfiler/ app. This will make it easier for people to test this project.

…r is enabled

Kielek · 2024-01-04T06:46:09Z

nit: I recommend placing a launchSettings.json file with environment variables in the test/test-applications/integrations/TestApplication.ContinuousProfiler/ app. This will make it easier for people to test this project.

Done in 2b957d1

RassK

LGTM

nit: may need to recheck native formatting rules, loads of missing lines, different styles etc. Too much to deal with in this branch.

src/OpenTelemetry.AutoInstrumentation/Plugins/PluginManager.ContinuousProfiler.cs

src/OpenTelemetry.AutoInstrumentation/ContinuousProfiler/ContinuousProfilerProcessor.cs

zacharycmontoya · 2024-01-09T21:51:00Z

src/OpenTelemetry.AutoInstrumentation.Native/cor_profiler.cpp

+        return;
+    }
+
+    pdvEventsLow |= COR_PRF_MONITOR_THREADS | COR_PRF_ENABLE_STACK_SNAPSHOT;


Are these flags needed if only allocation sampling is enabled?

Ref: https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/cor-prf-monitor-enumeration

COR_PRF_MONITOR_THREADS - needed to have the info about thread events. Needed to corelate stack traces with Traces/Spans. Needed for both allocation and thread sampling

COR_PRF_ENABLE_STACK_SNAPSHOT - Enables calls to the DoStackSnapshot method. DoStackSnapshot method is also executed for thread and allocation sampling.

Wouldn't it be worth adding a comment in code?

zacharycmontoya · 2024-01-09T22:06:22Z

src/OpenTelemetry.AutoInstrumentation/Instrumentation.cs

@@ -137,6 +167,12 @@ public static void Initialize()
                    Logger.Information("Initialized lazily-loaded metric instrumentations without initializing sdk.");
                }
            }
+#if NET6_0_OR_GREATER
+            if (profilerEnabled && (threadSamplingEnabled || allocationSamplingEnabled))


Is there a reason this is done last? Could this be moved to the earlier conditional blocks?

Technically yes, but for traces and metrics we would like to take all possible data (at least for AlwaysOn Sampler).
For the profiler, we are heavily dropping data/sampling once per second(s). So dropping data is acceptable, both at the beggining and end of the process.

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp

zacharycmontoya

Overall LGTM

Co-authored-by: John Bley <johnbley@gmail.com>

Kielek · 2024-01-11T05:47:34Z

I am merging this PR as we agreed yesterday. Please let me know if you have any other comments.

Kielek added 10 commits December 21, 2023 13:31

Continuous Profiler - configuration scaffolding

72ee52f

Continuous Profiler - set CLR Profiler masks

9d67f19

Continuous Profiler - scaffolding for execution native code

31e0d15

Continuous Profiler - native code part

46af0bc

ContinuousProfiler - read buffers by managed code

574e027

Propagate activity id to native code

d27ee64

Initialize ContinuousProfiler only when CLR profiler enabled

4b91545

Increase Time sleep for test application

d5ccded

Continuous profiling base test

6c405fd

remove splunk from todo comment

4433d24

Kielek added the do NOT merge label Dec 21, 2023

Kielek added 4 commits December 21, 2023 14:18

native format

9072a25

Rename test and fix compilation

11c8d5a

Use Env.NewLine as line separator is tests

14499ce

Fix Names for pointer attributes

d3f2eb2

Kielek mentioned this pull request Dec 22, 2023

Continuous profiler - Splunk exporter signalfx/splunk-otel-dotnet#393

Merged

Kielek added 3 commits December 22, 2023 11:34

PR feedback - refactor processing samples to avoid

9263077

lack of call for memory samples in rare cases

PR feedback - Activity_CurrentChanged use TryParse

41bb209

to avoid any exceptions

PR feedback - better contract for GetFirstContinuousConfiguration

dfcadbc

Kielek force-pushed the continuous-profiler branch from e291bd5 to dfcadbc Compare December 27, 2023 07:56

Kielek added 6 commits December 27, 2023 09:48

Log configuration

67f2709

PR feedback - reuse buffer to avoid buffer allocation

1ebd8df

Fix name for profiling

e0efffe

Allocation fix and test

ffa108f

native format

f93eccb

TestApllication - avoid inlining

0ea5829

Kielek added 2 commits January 2, 2024 10:38

TestApp - move plugin and exporter to subfolders

715d25e

Merge branch 'main' into continuous-profiler

64f7525

Kielek added 2 commits January 3, 2024 15:31

Merge branch 'main' into continuous-profiler

2ec7aae

Merge branch 'main' into continuous-profiler

05d24d0

rajkumar-rangaraj reviewed Jan 3, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation/Instrumentation.cs Outdated Show resolved Hide resolved

rajkumar-rangaraj reviewed Jan 3, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation/ContinuousProfiler/ContinuousProfilerProcessor.cs Show resolved Hide resolved

rajkumar-rangaraj reviewed Jan 3, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation/ContinuousProfiler/ContinuousProfilerProcessor.cs Show resolved Hide resolved

rajkumar-rangaraj reviewed Jan 3, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation/Plugins/PluginManager.ContinuousProfiler.cs.cs Outdated Show resolved Hide resolved

Kielek added 3 commits January 4, 2024 06:53

PR feedback - fix double extension file

37c7c5f

PR feedback - try to enable continuous profiler only when CLR profile…

d00db39

…r is enabled

PR feedback - add launchSettings configuration for test application

2b957d1

RassK approved these changes Jan 5, 2024

View reviewed changes

This was referenced Jan 5, 2024

Revisit native code formatting tool #3215

Closed

Continuous profiler - exporter thread - graceful shutdown #3216

Closed

PR review - TODO comments

93a5e85

pjanotti approved these changes Jan 9, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation/Plugins/PluginManager.ContinuousProfiler.cs Show resolved Hide resolved

src/OpenTelemetry.AutoInstrumentation/ContinuousProfiler/ContinuousProfilerProcessor.cs Show resolved Hide resolved

Merge branch 'main' into continuous-profiler

613ce0a

zacharycmontoya reviewed Jan 9, 2024

View reviewed changes

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp Outdated Show resolved Hide resolved

zacharycmontoya approved these changes Jan 9, 2024

View reviewed changes

ICorProfilerInfo10 -> ICorProfilerInfo12

ec30f15

Kielek mentioned this pull request Jan 10, 2024

Continuous profiler - documentation #3219

Merged

1 task

Kielek and others added 2 commits January 10, 2024 15:44

fix locking comment

9df87af

Co-authored-by: John Bley <johnbley@gmail.com>

Merge branch 'main' into continuous-profiler

3b0865e

Kielek merged commit 349fbb1 into open-telemetry:main Jan 11, 2024
31 checks passed

Kielek deleted the continuous-profiler branch January 11, 2024 05:48

Kielek mentioned this pull request Jan 31, 2024

[Donation Proposal]: Continuous Profiling Agent open-telemetry/community#1918

Open

noahfalk mentioned this pull request Mar 5, 2024

Correlate EventPipe events with trace spans dotnet/diagnostics#3313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous profiler #3196

Continuous profiler #3196

Kielek commented Dec 21, 2023 •

edited

Kielek commented Dec 27, 2023

Kielek commented Jan 2, 2024

rajkumar-rangaraj commented Jan 3, 2024

Kielek commented Jan 4, 2024

RassK left a comment

zacharycmontoya Jan 9, 2024

Kielek Jan 10, 2024

pellared Jan 10, 2024

zacharycmontoya Jan 9, 2024

Kielek Jan 10, 2024

zacharycmontoya left a comment

Kielek commented Jan 11, 2024

Continuous profiler #3196

Continuous profiler #3196

Conversation

Kielek commented Dec 21, 2023 • edited

Why

What

Tests

Checklist

How to review

Notes

Kielek commented Dec 27, 2023

About the Continuous Profiling - Thread sampling

How does the thread sampler work?

Requirements

Enable the profiler

Configuration settings by plugin

Escape hatch

Troubleshooting the .NET profiler

How do I know if it's working?

How can I see Continuous Profiling configuration?

What does the escape hatch do?

What if I'm on an unsupported .NET version?

Can I tell the sampler to ignore some threads?

Kielek commented Jan 2, 2024

About Continuous memory profiling for .NET

How does the memory profiler work?

Requirements

Enable the profiler

Configuration settings by the plugin

Escape hatch

Troubleshooting the .NET profiler

How do I know if it's working?

How can I see Continuous Profiling configuration?

What does the escape hatch do?

What if I'm on an unsupported .NET version?

rajkumar-rangaraj commented Jan 3, 2024

Kielek commented Jan 4, 2024

RassK left a comment

Choose a reason for hiding this comment

zacharycmontoya Jan 9, 2024

Choose a reason for hiding this comment

Kielek Jan 10, 2024

Choose a reason for hiding this comment

pellared Jan 10, 2024

Choose a reason for hiding this comment

zacharycmontoya Jan 9, 2024

Choose a reason for hiding this comment

Kielek Jan 10, 2024

Choose a reason for hiding this comment

zacharycmontoya left a comment

Choose a reason for hiding this comment

Kielek commented Jan 11, 2024

Kielek commented Dec 21, 2023 •

edited