Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various potential bugs #17

Closed
lukos opened this issue Sep 30, 2021 · 6 comments
Closed

Various potential bugs #17

lukos opened this issue Sep 30, 2021 · 6 comments

Comments

@lukos
Copy link

lukos commented Sep 30, 2021

The snapshot collector seems to work fine in production but we are having trouble in development/build since the snapshot collector is making a significant number of external calls that cannot route through a firewall.

Without the nuget package installed, everything runs as expected, the only app insights events I see in the debug log are what I expect using the correct iKey.

With the package installed, I see a significant number of events like these two:

Application Insights Telemetry: {"name":"AppEvents","time":"2021-09-30T09:20:56.8831014Z","iKey":"8a28de91-bcf6-44ac-9242-f7a64f1146b8","tags":{"ai.application.ver":"1.4.0.0","ai.device.osVersion":"Microsoft Windows 10.0.19043","ai.cloud.roleInstance":"/PU6k0iJgzHB","ai.session.id":"dYYeFjWs/Le7","ai.internal.sdkVersion":"dotnetc:2.18.0-315","ai.internal.nodeName":"DEV-01.smartsurvey.local"},"data":{"baseType":"EventData","baseData":{"ver":2,"name":"TelemetryProcessorCreated","properties":{"Runtime":".NET Core 3.1.19","TelemetryProcessor":"SnapshotCollectorTelemetryProcessor Problems:Count=0"}}}}

Application Insights Telemetry: {"name":"AppDependencies","time":"2021-09-30T09:20:56.8818670Z","iKey":"xxxxxxxx-xxxx-xxxx-xxxx-62d47594c409","tags":{"ai.application.ver":"15.0.0.0","ai.cloud.roleInstance":"DEV-01","ai.operation.id":"b7a8c124c63ab14db3955c8083467fce","ai.internal.sdkVersion":"rdddsc:2.18.0-315","ai.internal.nodeName":"DEV-01.ss.local"},"data":{"baseType":"RemoteDependencyData","baseData":{"ver":2,"name":"GET /metadata/instance/compute","id":"b2d10a6620b6864d","data":"http://169.254.169.254/metadata/instance/compute?api-version=2017-08-01","duration":"00:00:00.3702150","resultCode":"Canceled","success":false,"type":"Http","target":"169.254.169.254","properties":{"AspNetCoreEnvironment":"Development","DeveloperMode":"true","_MS.ProcessedByMetricExtractors":"(Name:'Dependencies', Ver:'1.1')"}}}}

Note that the first one has an incorrect iKey, which I cannot find anywhere in my code or on github. The second has the correct key, which I have masked. The second one is occuring "correctly" in that the snapshot collector is calling out to http://169.254.169.254/metadata/instance/compute, whatever that is.

I then tried to set IsEnabled to false in appsettings and it didn't disable these calls.

"SnapshotCollectorConfiguration": {
    "IsEnabled": false,
    "IsEnabledInDeveloperMode": false
  } 

Note that we are not hosting these on Azure App Service and don't have any App Services running for any of our apps, which I am mentioning because the first refers to a roleinstance and the second to metadata/instance, which I believe is a cloud service endpoint for local metrics data.

@pharring
Copy link
Member

pharring commented Oct 1, 2021

Both of those are coming from the Snapshot Collector itself.

The iKey on the first one is the Snapshot Collector's own internal telemetry which is used for usage metrics and troubleshooting. You can disable it (and others) by setting ProvideAnonymousTelemetry to false in SnapshotCollectorConfiguration.

The dependency is coming from Snapshot Collector attempting to detect whether it's running in an Azure VM via the Instance Metadata Service. It's expected to fail unless the IMDS is available.

You aid:

the snapshot collector is making a significant number of external calls that cannot route through a firewall.

The Snapshot Collector's internal telemetry is sent to the public global Application Insights endpoint: https://dc.services.visualstudio.com on port 443. If that address+port are blocked at the firewall, then no telemetry will get through. I presume, in that case, you've set the ingestion endpoint explicitly for your own telemetry.

With that explanation out of the way, are there actually bugs here? The title says "potential" bugs, so I'd like to understand if there are any blocking issues.

@lukos
Copy link
Author

lukos commented Oct 1, 2021

Thanks @pharring

OK, so the internal iKey makes sense, although it would be good to document this because it was not obvious. It also shows a very old ai version but I guess that is OK.

Although not necessarily a true "bug", I would argue that even a snapshot collector shouldn't sit there endlessly polling a non-existent endpoint. I've realised this IP address is a loopback so it is not an external call, but at most it should try, say, 5 times and then switch off. It is hard to debug some network issues when a lot of failed calls take 20 seconds to fail, at which point it could be causing issues in the application that are spawning all of these calls and being blocked waiting for a reply.

This is currently the sort of noise I am talking about:

image

@lukos
Copy link
Author

lukos commented Oct 1, 2021

Also, it doesn't explain why setting IsEnabled to false doesn't stop these calls being made.

@pharring
Copy link
Member

pharring commented Oct 1, 2021

Thanks for elaborating. I'll try and unpack:

... it would be good to document [the internal iKey].

Let this issue be the documentation. We could maybe add some exposition to the main docs on the ProvideAnonymousTelemetry option.

It also shows a very old ai version but I guess that is OK.

I see "ai.internal.sdkVersion":"dotnetc:2.18.0-315". That's the latest version. The SnapshotCollector nuget uses the same Application Insights SDK that the rest of your application uses.

snapshot collector shouldn't sit there endlessly polling a non-existent endpoint. [...] at most it should try, say, 5 times and then switch off.

I agree. Snapshot Collector tries to hit that endpoint exactly once. The timeout is set to 5 seconds after which the call is canceled. It doesn't endlessly poll. If Snapshot Collector is enabled, then there will be a second identical call from the SnapshotUploader(64) process as it starts up.

Now, the screenshot from Fiddler shows a couple of interesting things.

  1. Some of those IMDS calls are NOT from Snapshot Collector. They've been truncated, but I can say for sure that the ones that have compute?format=json in them (about half) are NOT from Snapshot Collector. The most likely source is the AzureMetadataRequestor component from base Application Insights SDK which is used to generate heartbeat telemetry.
  2. Repeated calls. Again, half of them are definitely NOT from Snapshot Collector, but the other half could be (304, 306, 310, etc.). I'd be interested to see how many of those disappear if you remove Snapshot Collector from your project. If about half of them disappear (indicating that Snapshot Collector is responsible for half of them), then we're probably looking at a bug in your application where more than one TelemetryConfiguration is being created.

setting IsEnabled to false doesn't stop these calls being made.

That's probably a misunderstanding about what IsEnabled does. IsEnabled prevents the Snapshot Collector from contacting the snapshot service "front door", starting the SnapshotUploader process and triggering process snapshots. If you know about TelemetryProcessors, there's basically an IsEnabled check at the top of the SnapshotCollectorTelemetryProcessor.Next method to see if we should do any processing. It is an effective "gate" to stop all snapshotting.
To prevent usage telemetry, set ProvideAnonymousTelemetry to false.
However, neither of these will stop the IMDS call. There's no way to prevent those short of not adding the Snapshot Collector in the first place (either remove the Nuget package, or remove the code that calls "AddSnapshotCollector"). The 502s are safe to ignore.

So, that's a bit of a dive under the hood, but I want to repeat an earlier point, in case you missed it: From the Fiddler trace, it looks like you have a bug in your application that is creating more than one TelemetryConfiguration.

@pharring
Copy link
Member

pharring commented Oct 1, 2021

I read in the linked issue that you're running unit tests. Assuming a telemetry pipeline is created for each test, that could explain why we see multiple calls to IMDS.

@lukos
Copy link
Author

lukos commented Oct 4, 2021

Hi @pharring

The version I was referring to was "ai.application.ver":"1.4.0.0", but that was just an observation. The normal AI ones show application as v15.0 so 1.4 seemed old.

Setting ProvideAnonymousTelemetry doesn't make a noticeable difference, I was trying to get rid of the high number of failed connections. I suspect that the issue here is that the unit tests, using the WebApplicationFactory are effectively starting up new instances frequently which cause the issue.

Also worth noting is that we have tests hitting an App, which itself calls an API and both have the snapshot collector enabled, which is probably the duplicates you mentioned above.

When I removed the snapshot collector, the calls, obviously, disappear but we need them in production so that is not a solution.

It turns out these weren't the cause of the slowness I was seeing, there is something weird on the hosting provider that causes connections from one specific IP address to randomnly drop. There, I have to assume that these are not bugs, just noise and there is probably not much we can do about them.

@lukos lukos closed this as completed Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants