Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

Application Insights and shared process hosting model issues #887

Closed
mkosieradzki opened this issue Feb 28, 2018 · 12 comments
Closed

Application Insights and shared process hosting model issues #887

mkosieradzki opened this issue Feb 28, 2018 · 12 comments
Assignees
Labels

Comments

@mkosieradzki
Copy link

I am observing very strange behavior from how Reverse Proxy deals with Health subsystem.

My case scenario is:

  1. Single partition is reporting problems with being unable to create a proper backup (probably my mistake to make it Error instead of Warning, but I really need to think about this one carefully)
  2. The Error propagates up (as described in the documentation rendering Service and entire Application Errored)
  3. Reverse proxy refuses to route some requests (fast timeout) inside this application, I can't exactly determine the mechanism as it seems to be non-deterministic (I believe that it has something to do with the cached name-resolutions).

I might be completely wrong and making invalid assumptions here...

Documentation here would be really helpful.

@oanapl
Copy link

oanapl commented Feb 28, 2018

Reverse proxy doesn't use health subsystem to make decisions, so there is no interaction between the two. There must be another reason why the request routing is not working.
@kavyako to help with reverse proxy.

@mkosieradzki
Copy link
Author

Thanks Oana,

I will try to diagnose this a bit more and will come back. Any chance it has an indirect interaction for example when using Naming Service for address resolution?

@oanapl
Copy link

oanapl commented Feb 28, 2018

Currently, we don't take any action based on health other than during upgrades. There can be repair services written on top of health subsystem that restart unhealthy nodes etc, but there is no such repair service in the cluster out of the box.
To address your question, there is no interaction between Naming Service and Health Subsystem.

@mkosieradzki
Copy link
Author

Thanks a lot - this is exactly how I though everything is working in the first place, so apparently there is something else going on and really a lot of coincidence.

I will add more telemetry to try to catch the root cause.

@kavyako
Copy link

kavyako commented Feb 28, 2018

@mkosieradzki Was the service endpoint up when request routing through the reverse proxy failed?
Is this an on-prem cluster or Azure cluster?
What error status code/message did you see at the reverse proxy, fast timeout =>?

@mkosieradzki
Copy link
Author

Yes, It was receiving requests on different urls (I have not enough telemetry to figure out the exact cause).

From the telemetry I have I can say following:

  • There is a high probability I was receiving 500 when making POST on a specific url on the reverse proxy
  • This happens only if the target service in the Errored State (due to manually reported fault, not affecting the actual endpoint) - might be a COINCIDENCE
  • No request was received by the actual service (or at least non has failed)
  • Some other urls on the SAME service are working correctly (probably served by a proxy on a different node)

Unfortunately I don't have the new SF AI integration enabled yet - it would definitely help to get the proper telemetry. I also have some troubles tracking down Reverse Proxy with the DependencyTrackingTelemetryModule.

I will try to reproduce this one with Warning instead of an Error.

@kavyako
Copy link

kavyako commented Feb 28, 2018

Can you look at the reverse proxy diagnostic logs
It will give some more insights into which requests failed and why.

@mkosieradzki
Copy link
Author

Thanks @kavyako, now I am sure it's health-unrelated. I will track down the root cause now :).

@mkosieradzki
Copy link
Author

@kavyako @oanapl
I am sorry for an invalid bug report - but that was all I could figure out basing on a broken telemetry. There is a high probability that the root cause was.... ApplicationInsights itself (breaking HttpRequests sent to different services! or at least ruining the telemetry). They have a pending bug manifesting when you register ApplicationInsights multiple times microsoft/ApplicationInsights-dotnet#613 (comment) what is obviously quite common in Service Fabric... (SharedProcess/mutliple listeners)

Thank you very much for helping me out.

I am trying now to find a proper workaround and confirm that this was the root cause.

@mkosieradzki mkosieradzki changed the title [ReverseProxy] Please add documentation for Health subsystem and Reverse Proxy interactions Application Insights and shared process hosting model issues Mar 1, 2018
@mkosieradzki
Copy link
Author

There is a serious bug in Application Insights making it incompatible with Service Fabric shared process hosting model.

Long story short:

  1. DependencyTrackingTelemetryModule should work as a singleton as it is subscribing as global DiagnosticListener.
  2. It is by default added when creating new WebHost (using UseApplicationInsights or AddApplicationInsightsTelemetry).
  3. It is not added as an instance (instances are responsible for their own disposal) but as a factory - so lifetime is managed by the WebHost (and this class is disposed during WebHost shutdown)
  4. So during Hosting Process lifetime service instances are added and removed. And services are exposing their WebHosts.
  5. After some rounds there are DesktopDiagnosticSourceSubscriber attached to the listener that had been already disposed...
  6. And then when you try to use HttpClient it crashes IT!!!! Causing total mindf**ck because the telemetry finally in AI makes NO SENSE. It took me 2 days to figure this out.
  7. It does not reproduce on single node dev cluster because you need to have at least one round of service instance migrations accross the cluster.

I am testing a workaround that I suppose will finally work.

@nizarq Please help. To overcome this issue.

Object reference not set to an instance of an object. ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.ApplicationInsightsUrlFilter.get_EndpointLeftPart()
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.ApplicationInsightsUrlFilter.IsApplicationInsightsUrlImpl(String url)
   at Microsoft.ApplicationInsights.DependencyCollector.Implementation.HttpDesktopDiagnosticSourceSubscriber.b__7_0(String evnt, Object r, Object _)
   at System.Diagnostics.DiagnosticListener.IsEnabled(String name, Object arg1, Object arg2)
   at System.Diagnostics.HttpHandlerDiagnosticListener.RaiseRequestEvent(HttpWebRequest request)
   at System.Diagnostics.HttpHandlerDiagnosticListener.HttpWebRequestArrayList.Add(Object value)
   at System.Net.Connection.StartRequest(HttpWebRequest request, Boolean canPollRead)
   at System.Net.Connection.SubmitRequest(HttpWebRequest request, Boolean forcedsubmit)
   at System.Net.ServicePoint.SubmitRequest(HttpWebRequest request, String connName)

@mkosieradzki
Copy link
Author

This is the upstream bug: microsoft/ApplicationInsights-aspnetcore#621

@kavyako kavyako added external and removed question labels Mar 15, 2018
@mkosieradzki
Copy link
Author

Fixed in upstream.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants