Application Insights and shared process hosting model issues #887

mkosieradzki · 2018-02-28T15:44:54Z

I am observing very strange behavior from how Reverse Proxy deals with Health subsystem.

My case scenario is:

Single partition is reporting problems with being unable to create a proper backup (probably my mistake to make it Error instead of Warning, but I really need to think about this one carefully)
The Error propagates up (as described in the documentation rendering Service and entire Application Errored)
Reverse proxy refuses to route some requests (fast timeout) inside this application, I can't exactly determine the mechanism as it seems to be non-deterministic (I believe that it has something to do with the cached name-resolutions).

I might be completely wrong and making invalid assumptions here...

Documentation here would be really helpful.

oanapl · 2018-02-28T16:04:19Z

Reverse proxy doesn't use health subsystem to make decisions, so there is no interaction between the two. There must be another reason why the request routing is not working.
@kavyako to help with reverse proxy.

mkosieradzki · 2018-02-28T16:15:35Z

Thanks Oana,

I will try to diagnose this a bit more and will come back. Any chance it has an indirect interaction for example when using Naming Service for address resolution?

oanapl · 2018-02-28T16:28:40Z

Currently, we don't take any action based on health other than during upgrades. There can be repair services written on top of health subsystem that restart unhealthy nodes etc, but there is no such repair service in the cluster out of the box.
To address your question, there is no interaction between Naming Service and Health Subsystem.

mkosieradzki · 2018-02-28T16:34:33Z

Thanks a lot - this is exactly how I though everything is working in the first place, so apparently there is something else going on and really a lot of coincidence.

I will add more telemetry to try to catch the root cause.

kavyako · 2018-02-28T17:27:11Z

@mkosieradzki Was the service endpoint up when request routing through the reverse proxy failed?
Is this an on-prem cluster or Azure cluster?
What error status code/message did you see at the reverse proxy, fast timeout =>?

mkosieradzki · 2018-02-28T17:55:48Z

Yes, It was receiving requests on different urls (I have not enough telemetry to figure out the exact cause).

From the telemetry I have I can say following:

There is a high probability I was receiving 500 when making POST on a specific url on the reverse proxy
This happens only if the target service in the Errored State (due to manually reported fault, not affecting the actual endpoint) - might be a COINCIDENCE
No request was received by the actual service (or at least non has failed)
Some other urls on the SAME service are working correctly (probably served by a proxy on a different node)

Unfortunately I don't have the new SF AI integration enabled yet - it would definitely help to get the proper telemetry. I also have some troubles tracking down Reverse Proxy with the DependencyTrackingTelemetryModule.

I will try to reproduce this one with Warning instead of an Error.

kavyako · 2018-02-28T18:13:04Z

Can you look at the reverse proxy diagnostic logs
It will give some more insights into which requests failed and why.

mkosieradzki · 2018-02-28T19:35:59Z

Thanks @kavyako, now I am sure it's health-unrelated. I will track down the root cause now :).

mkosieradzki · 2018-03-01T14:47:41Z

@kavyako @oanapl
I am sorry for an invalid bug report - but that was all I could figure out basing on a broken telemetry. There is a high probability that the root cause was.... ApplicationInsights itself (breaking HttpRequests sent to different services! or at least ruining the telemetry). They have a pending bug manifesting when you register ApplicationInsights multiple times microsoft/ApplicationInsights-dotnet#613 (comment) what is obviously quite common in Service Fabric... (SharedProcess/mutliple listeners)

Thank you very much for helping me out.

I am trying now to find a proper workaround and confirm that this was the root cause.

mkosieradzki · 2018-03-01T21:33:30Z

There is a serious bug in Application Insights making it incompatible with Service Fabric shared process hosting model.

Long story short:

DependencyTrackingTelemetryModule should work as a singleton as it is subscribing as global DiagnosticListener.
It is by default added when creating new WebHost (using UseApplicationInsights or AddApplicationInsightsTelemetry).
It is not added as an instance (instances are responsible for their own disposal) but as a factory - so lifetime is managed by the WebHost (and this class is disposed during WebHost shutdown)
So during Hosting Process lifetime service instances are added and removed. And services are exposing their WebHosts.
After some rounds there are DesktopDiagnosticSourceSubscriber attached to the listener that had been already disposed...
And then when you try to use HttpClient it crashes IT!!!! Causing total mindf**ck because the telemetry finally in AI makes NO SENSE. It took me 2 days to figure this out.
It does not reproduce on single node dev cluster because you need to have at least one round of service instance migrations accross the cluster.

I am testing a workaround that I suppose will finally work.

@nizarq Please help. To overcome this issue.

Object reference not set to an instance of an object. ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.ApplicationInsightsUrlFilter.get_EndpointLeftPart()
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.ApplicationInsightsUrlFilter.IsApplicationInsightsUrlImpl(String url)
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.HttpDesktopDiagnosticSourceSubscriber.b__7_0(String evnt, Object r, Object _)
   at System.Diagnostics.DiagnosticListener.IsEnabled(String name, Object arg1, Object arg2)
   at System.Diagnostics.HttpHandlerDiagnosticListener.RaiseRequestEvent(HttpWebRequest request)
   at System.Diagnostics.HttpHandlerDiagnosticListener.HttpWebRequestArrayList.Add(Object value)
   at System.Net.Connection.StartRequest(HttpWebRequest request, Boolean canPollRead)
   at System.Net.Connection.SubmitRequest(HttpWebRequest request, Boolean forcedsubmit)
   at System.Net.ServicePoint.SubmitRequest(HttpWebRequest request, String connName)

mkosieradzki · 2018-03-12T15:37:46Z

This is the upstream bug: microsoft/ApplicationInsights-aspnetcore#621

mkosieradzki · 2019-03-04T01:49:54Z

Fixed in upstream.

oanapl assigned kavyako Feb 28, 2018

oanapl added the question label Feb 28, 2018

mkosieradzki changed the title ~~[ReverseProxy] Please add documentation for Health subsystem and Reverse Proxy interactions~~ Application Insights and shared process hosting model issues Mar 1, 2018

kavyako added external and removed question labels Mar 15, 2018

mkosieradzki closed this as completed Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application Insights and shared process hosting model issues #887

Application Insights and shared process hosting model issues #887

mkosieradzki commented Feb 28, 2018

oanapl commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

oanapl commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

kavyako commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

kavyako commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

mkosieradzki commented Mar 1, 2018

mkosieradzki commented Mar 1, 2018

mkosieradzki commented Mar 12, 2018

mkosieradzki commented Mar 4, 2019

Application Insights and shared process hosting model issues #887

Application Insights and shared process hosting model issues #887

Comments

mkosieradzki commented Feb 28, 2018

oanapl commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

oanapl commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

kavyako commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

kavyako commented Feb 28, 2018

mkosieradzki commented Feb 28, 2018

mkosieradzki commented Mar 1, 2018

mkosieradzki commented Mar 1, 2018

mkosieradzki commented Mar 12, 2018

mkosieradzki commented Mar 4, 2019