Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TelemetryClient.Flush deadlocks #1186

Open
mlugosan opened this issue Aug 1, 2019 · 29 comments
Open

TelemetryClient.Flush deadlocks #1186

mlugosan opened this issue Aug 1, 2019 · 29 comments
Assignees
Milestone

Comments

@mlugosan
Copy link

mlugosan commented Aug 1, 2019

If you are reporting bug/issue, please provide detailed Repro instructions.

Repro Steps

  1. netcoreapp2.2 console app
  2. telemetry configuration:
  • ServerTelemetryChannel
  • DependencyTrackingTelemetryModule and UnobservedExceptionTelemetryModule
  1. call TelemetryClient.Flush on workload completion (async Task Main)

Actual Behavior

The app deadlocks during Flush operation.
Occurrence is reliably about 1 in 8000 runs in a consistent hardware and deployment environment.

Expected Behavior

not deadlock

Version Info

SDK Version : appinsights 2.10.0
.NET Version : netcore 2.2
How Application was onboarded with SDK(VisualStudio/StatusMonitor/Azure Extension) :
OS : win10-x64
Hosting Info (IIS/Azure WebApps/ etc) : console

Two stacks seem relevant for this:
Thread1

ntdll.dll!NtWaitForMultipleObjects�()	Unknown
KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
[Managed to Native Transition]	
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricManager.Flush(bool flushDownstreamPipeline)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.TelemetryClient.Flush()	Unknown

Thread 2

ntdll.dll!NtDelayExecution�()	Unknown
KERNELBASE.dll!SleepEx()	Unknown
[Managed to Native Transition]	
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold) Line 169	C#
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double>.GetAndResetValue(int index)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MeasurementAggregator.UpdateAggregate_Stage1(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer, int minFlushIndex, int maxFlushIndex)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.FetchAndTrackMetrics()	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.Run()	Unknown
System.Threading.Thread.dll!System.Threading.Thread.ThreadMain_ThreadStart() Line 93	C#
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167	C#
[Native to Managed Transition]	
kernel32.dll!BaseThreadInitThunk�()	Unknown
ntdll.dll!RtlUserThreadStart�()	Unknown
@mlugosan
Copy link
Author

I'm encountering this issue on consistent basis.
Can anyone advise on a solution or workaround?

@TimothyMothra
Copy link
Member

@macrogreg Can you take a look at this?

I think it's related to the lock in MetricSeriesAggregatorBase.UpdateAggregate():

// This lock is only contended if a user called CreateAggregateUnsafe or CompleteAggregation.
// This is very unlikely to be the case in a tight loop.
lock (buffer)

@TimothyMothra
Copy link
Member

As a workaround, if you're not using the GetMetric().Track() api you can remove the Metrics pre aggregator Processor from your configuration.

@mlugosan
Copy link
Author

As per recommendation in API documentation, which discourages old metric patterns, we're using GetMetric() and TrackValue(). Hence, unfortunately, the above workaround would not help.

@lmolkova
Copy link
Member

This is one of the problems with Flush API and while this particular issue is pretty bad and we want to fix it along with Flush. This cycle we won't have the bandwidth for it, so moving it to the next milestone.

@lmolkova lmolkova modified the milestones: 2.11, 2.12 Aug 26, 2019
@rafist
Copy link

rafist commented Oct 11, 2019

I think I see the same issue with AI 2.4.
Since I am not using GetMetric().Track(), how can I remove the Metrics pre aggregator Processor from my configuration as suggested above?

@cijothomas
Copy link
Contributor

@rafist Are you on asp.net core? then do the following to remove metric pre aggregator:
public void ConfigureServices(IServiceCollection services)
{
Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions aiOptions
= new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions();
// disable autocollected metric extractor
aiOptions.AddAutoCollectedMetricExtractor= false;
services.AddApplicationInsightsTelemetry(aiOptions);
}

If you are on Asp.Net, then comment out the following processor from AI.Config

@cijothomas
Copy link
Contributor

moving to next milestone, as there are no cycles to investigate this.

@cijothomas cijothomas modified the milestones: 2.12, 2.13 Dec 6, 2019
@rafist
Copy link

rafist commented Dec 31, 2019

Thanks @cijothomas, my issue was not related to the Flush after all.

@anjanchidige
Copy link

@rafist , Could you please share more details on actual issue. Could you please share details, if it is resolved for you.

@anjanchidige
Copy link

anjanchidige commented Jan 16, 2020

@cijothomas / @lmolkova / @TimothyMothra , We are using Log4Net, Application Insights Appender to log to App Insights. TelemetryClient.Flush method is called after every TelemetryClient.TrackTrace method. After a few hours/sometimes immediately, all other application API calls in my application are failing with Task Cancelled exceptions. When we disable the Application Insights Appender in the Log4net.config file, all my API calls are succeeding. We are using .NET Framework 4.6.2 and Application Insights Appender v2.11.0.0. it seems the Application Insights is blocking other API calls. My Application is an Azure Worker Role.

Could you please let us know if this is the known issue in the Flush method. If yes, please kindly share any workaround and possible fix timelines.

Thanks
Anjan

@TimothyMothra TimothyMothra modified the milestones: 2.13, 2.15 Mar 20, 2020
@cijothomas cijothomas modified the milestones: 2.15, Future Sep 1, 2020
@cijothomas
Copy link
Contributor

Moving further from current milestones.

@mladedav
Copy link

mladedav commented Nov 23, 2020

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")
image
image

@cijothomas
Copy link
Contributor

+1 A similar/probably same issue was reported outside of this repo.

The issue is already tagged p1, but not yet assigned a milestone. Will mark for the next release.

@cijothomas cijothomas modified the milestones: Future, 2.17 Nov 23, 2020
@cijothomas
Copy link
Contributor

2.17 release date is not yet announced. Will update here, once a firm date is set.

@ninlar
Copy link

ninlar commented Dec 2, 2020

@cijothomas are you sure this will be resolved in 2.17? I've noticed the milestone for this issue keeps getting pushed out. As a potential work around, do you think we could try invoking Flush() from a background thread, and if it hasn't completed in a reasonable time, attempt to Interrupt() the thread. This occurs for us when a service is shutting down. The deadlock or livelock is actually preventing the service from shutting down. So while dispatching on a background thread and interrupting if needed is hacky, it would potentially be a solution until there is a more concrete fix. The reason why we call Flush() on shutdown, is so that we ensure any buffered Telemetry is sent to AI before the process exits.

@shoguns6
Copy link

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")
image
image

Similar this is happening for my code as well. Have used the below settings

var aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions
{
InstrumentationKey = instrumentationKey,
EnableAdaptiveSampling = false,
EnableQuickPulseMetricStream = false,
EnablePerformanceCounterCollectionModule = false,
};

but still getting high number of threads and results in 100% cpu. See below for thread details.
image

any updates would be helpful. In the meantime , will try 'AddAutoCollectedMetricExtractor' to false and monitory.

@cijothomas
Copy link
Contributor

@shoguns6 It looks like you are reporting a different issue than the Flush deadlocks. Metrics aggregator (DefaultAggregationPeriodCycle) is only expected to create one Thread. If you are seeing more than one (461 in your screenshot), it likely means TelemetryConfig/Client is being recreated instead of re-using. Could you check if this is possible in your application.

@TimothyMothra TimothyMothra modified the milestones: 2.17, 2.18 Mar 4, 2021
@cijothomas
Copy link
Contributor

One more +1 as this was reported by another user.

@cijothomas cijothomas modified the milestones: 2.18, 2.19 Jul 14, 2021
@cijothomas
Copy link
Contributor

A tactical fix can be made to prevent Flush from waiting (Spinning) indefinitely. It should avoid the currently reported deadlock situations.
As 2.18 is nearing stable release, adding this to the next milestone 2.19. The ETA for the stable release of 2.19 is not finalized, but it'd be before Nov 2021.
The 1st beta containing this fix is expected in the very 1st beta of 2.19.

@TimothyMothra TimothyMothra modified the milestones: 2.19, 2.20 Oct 13, 2021
@TimothyMothra TimothyMothra modified the milestones: 2.20, 2.21 Dec 14, 2021
@viblo
Copy link

viblo commented Mar 15, 2022

Just another data point, we also have this problem. We have experienced it randomly since at least early 2021. Quite frustrating that it has not been possible to fix it even after 20 point releases of AI.

@toddfoust
Copy link

Customer request to add this item to 2.21 release if possible.

@BertscheS
Copy link

Same problem here. We had it two weeks ago for the first time using 2.18. We had is consistently multiple times during the day on different servers so we upgraded to 2.20 and yesterday this issue occured again. We have to leave Application Insights disabled for now, which is not ideal.
Please include this in 2.21

@TimothyMothra
Copy link
Member

Hello All,
We released v2.21-beta2 today which contains a fix for this issue.
If anyone can test this new beta, please write back and share if the issue is resolved for you.

@TimothyMothra
Copy link
Member

Hello All,
We released v2.21.0 today, which contains a fix for this issue.
We're going to leave this issue open until we get confirmation that this issue has been resolved.

@ninlar
Copy link

ninlar commented Jul 26, 2022

We have created a backlog item to upgrade the SDK and test the fix.

@cijothomas
Copy link
Contributor

If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.

@ninlar
Copy link

ninlar commented Sep 16, 2022 via email

@BertscheS
Copy link

We have the latest version now running for 5 months, and the issue did not occur anymore. Thanks for looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests