Troubleshooting NetworkObserver #61

markwragg · 2020-05-05T16:26:15Z

Hi,

I've just configured NetworkObserver to monitor an Azure SQL connection endpoint (port 1433). Immediately after adding the config it went into warning state as follows:

'FO022' reported Warning for property 'NetworkHealth'.
NetworkObserver detected Warning threshold breach. Outbound Internet connection failure detected for endpoint <redacted>.database.windows.net

I've logged on to the nodes where the app I configured this check for is running and can do a Test-NetConnection on each of them verifying the endpoint/port can be reached. I've turned on verbose logging on one node and it just shows the connection test occurring but not any errors.

What can I do to further troubleshoot why its alerting? Does the test run from all the nodes where the app runs? Does it go into warning if just 1 of them doesn't connect?

The documentation makes reference that it can do ICMP or port checks. ICMP does fail to this address but I can't see any example of how you select whether the test is ICMP or port based.

Thanks,
Mark

The text was updated successfully, but these errors were encountered:

GitTorre · 2020-05-05T17:03:51Z

Thanks for reporting this.

TCP is the right protocol for NetworkObserver tests generally. ICMP is useful for low/no data network communication, like ping or traceroute. For SQL and frankly most other networking you do from an SF app, for example, TCP is the default and built-in protocol. You can't modify this outside of the FO source code.

Can you run FO locally on your dev machine and debug NetworkObserver with the failing hostname/port pair config? NetworkObserver runs on all nodes. The health report is Application level.

Let me spin up a test SQL Azure db and see if this is a bug in NetworkObserver.

GitTorre · 2020-05-05T18:17:28Z

I will re-implement this scenario support with TcpClient. It’s a bug.

markwragg · 2020-05-05T18:25:44Z

Thanks @GitTorre good to know I wasn’t just doing something wrong!

GitTorre · 2020-05-05T23:31:22Z

@markwragg Pushed fix to develop if you want to test it out. DiskObs fix is also there.

markwragg · 2020-05-05T23:58:11Z

I haven't tested the code under develop yet, but one further bug i've noticed (with the existing master code) is that when a warning fires for a particular endpoint it appears under every application that has endpoints configured for monitoring in NetworkObserver rather than just those that it was explicitly listed for.

That is to say if i'm monitoring endpoints for app1, app2, app3 and I have the database.windows.net endpoint under just app2, its now alerting for all 3 apps, even though its not listed as an endpoint for app1 or app3.

This may already be resolved by your latest code changes but I thought i'd mention it in case you think it may not be.

(Note: I didn't notice this previously because I had only added the endpoint for one app, but now i've added further endpoint config for additional apps its become apparent).

GitTorre · 2020-05-06T00:20:01Z

Hmm. I will look into this. Can you share your config?

GitTorre · 2020-05-06T00:37:57Z

Wait a while. I need to do a few more things with this stuff. (thus develop...)

Edit: Fixed a few things. Thanks for the bugs. We really appreciate it! Should push to develop tonight.

markwragg · 2020-05-06T15:57:50Z

I had a further thought on this observer. The fact that the connectivity is tested from every node isn't an issue in my environment, but it seems like it could be in other environments. For example one of my apps runs on just one of my scale sets. If I had set up firewall rules for that individual scale set to be able to access its specific endpoints then the tests would fail on other nodes.

Some way to configure endpoint testing to run only on specific scalesets might be useful. Let me know if you want me to split this into another issue as I know this is more of a feature request now.

Also how does it work if you configure the same endpoint under multiple applications? Does it result in multiple checks for the same endpoint, or does it deduplicate them before it executes them?

GitTorre · 2020-05-06T17:22:15Z

@markwragg Re: duplication, configuration is up to you. If you have shared endpoints, then just add them to a single app versus all of the related ones. In terms of node types (I assume that is what you mean), then you would just configure FO that you deploy to those nodes differently. Node type differentiation is already implicitly supported. That is, you can configure the same app differently on different node types. Or am I misunderstanding your suggestion? Thanks for the thoughts! Let me know. Let's wait to gain clarity here before creating another issue.

markwragg · 2020-05-06T17:27:40Z

I wasn’t aware you could configure the same app differently on different node types, that does sound like it solves it though so I’ll look into it.

It feels like a shame to have to just declare an endpoint once when it’s used by multiple apps, I like that it would show all of the apps impacted when that endpoint was down.

Can you declare endpoints in a way that they apply to all apps?

GitTorre · 2020-05-06T17:39:45Z

The way you do that would be to add the endpoint/port pairs to all the targetApps you want to warn when the services are unreachable. That takes you back to the way it is currently implemented. :)

If what you are asking is for NetObs to filter duplicate endpoints out before conducting tests (which it does for each target app, of course), that can be done, but it will involve some work (not hard, just time and testing). Please create a new Issue for this and I will add it my queue. It is a fine suggestion.

Re: all apps, one thing that could be done is to add support for appTargetType, which would mean apply the settings to all apps of the same type (and thus all of their services). This is what you can do in AppObserver, for example, which is a more complex observer that does multiple things, unlike network observer which is simply an app-specific endpoint reachability tester, nothing more.

markwragg · 2020-05-07T10:13:35Z

All my network endpoint tests are working correctly now since deploying 2.0.6 and having correctly assigned each as either http or tcp with the new "protocol" setting in the config file. Thanks for the quick turnaround!

GitTorre · 2020-05-09T00:25:37Z

@markwragg can you verify this impl works for you? I also wrote a slightly different approach with retries and a remote write-only test that’s been tested and is currently in develop branch. The shipped (Release) impl will also work, but without retries which may be questionable (and it employs both a read and a write test which I think is unnecessary). Of course, the real connection logic (so, using the sql azure sdk and your connection string, firewall rules, etc) is up to your code where you should always employ your own retry logic. NetworkObserver just tries to prove that the service is reachable and in an efficient way.

markwragg · 2020-05-09T10:40:43Z

The current implementation works but for some non standard port HTTP endpoints I had to use TCP. I suspect maybe they didn’t work because of the write part of the query because they are authenticated endpoints so maybe switching to just read only would allow them to be tested via HTTP. Doing retries sounds good too. I say go for it.

GitTorre · 2020-05-09T17:16:59Z

For http, the web request test should work when auth fails. The server will respond in that case and that’s all we care about. Can you let me know more about this particular scenario?

markwragg · 2020-05-09T17:47:19Z

It was two public endpoints our app listens on. One runs on port 8088 the other on 8120. They run on the same SF cluster and I pointed them to the public addresses of their Azure LBs. They require a valid token to be provided in the query string to auth.

GitTorre · 2020-05-09T17:56:54Z

I see. Did you supply a prefix in the config? That is, "https://...". If the supplied port is not 443 but is secured, then you have to add https:// in the hostName string.
https://github.com/microsoft/service-fabric-observer/blob/master/Documentation/Observers.md#networkobserver

markwragg · 2020-05-09T18:21:04Z

No I’m pretty sure I didn’t, so can try that on Monday.

GitTorre · 2020-05-09T18:29:12Z

For endpoints in the the same cluster, you should test with local network IP address for hostname. Either http or tcp.

E.g.,

{
"hostname": "https://10.0.0.10",
"port": 10200,
"protocol": "http"
}

Or

{
"hostname": "10.0.0.10",
"port": 10200,
"protocol": "tcp"
}

Have a great weekend and thank you for all the help!

markwragg · 2020-05-11T17:17:03Z

I’m running 2.0.7 and it looks fine to me. Switching to adding an https:// prefix for my non standard port tests is working fine also.

GitTorre · 2020-05-11T17:32:30Z

Right on. Thanks again for all the feedback, suggestions and good bugs. Please let me know how FO/CO work out for you in production. You can always reach me at ctorre at microsoft.com

GitTorre self-assigned this May 5, 2020

GitTorre added the bug Something isn't working label May 5, 2020

GitTorre added this to the 2.0.6 milestone May 5, 2020

markwragg mentioned this issue May 6, 2020

NetworkObserver: Filter duplicate endpoints before testing #62

Closed

GitTorre linked a pull request May 7, 2020 that will close this issue

FO 2.0.6 #63

Merged

GitTorre closed this as completed May 7, 2020

GitTorre reopened this May 9, 2020

GitTorre modified the milestones: 2.0.6, 2.0.7 May 9, 2020

GitTorre linked a pull request May 9, 2020 that will close this issue

2.0.7 #65

Merged

GitTorre closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting NetworkObserver #61

Troubleshooting NetworkObserver #61

markwragg commented May 5, 2020 •

edited

Loading

GitTorre commented May 5, 2020 •

edited

Loading

GitTorre commented May 5, 2020

markwragg commented May 5, 2020

GitTorre commented May 5, 2020

markwragg commented May 5, 2020 •

edited

Loading

GitTorre commented May 6, 2020 •

edited

Loading

GitTorre commented May 6, 2020 •

edited

Loading

markwragg commented May 6, 2020 •

edited

Loading

GitTorre commented May 6, 2020

markwragg commented May 6, 2020

GitTorre commented May 6, 2020 •

edited

Loading

markwragg commented May 7, 2020

GitTorre commented May 9, 2020

markwragg commented May 9, 2020

GitTorre commented May 9, 2020

markwragg commented May 9, 2020

GitTorre commented May 9, 2020 •

edited

Loading

markwragg commented May 9, 2020

GitTorre commented May 9, 2020 •

edited

Loading

markwragg commented May 11, 2020

GitTorre commented May 11, 2020 •

edited

Loading

Troubleshooting NetworkObserver #61

Troubleshooting NetworkObserver #61

Comments

markwragg commented May 5, 2020 • edited Loading

GitTorre commented May 5, 2020 • edited Loading

GitTorre commented May 5, 2020

markwragg commented May 5, 2020

GitTorre commented May 5, 2020

markwragg commented May 5, 2020 • edited Loading

GitTorre commented May 6, 2020 • edited Loading

GitTorre commented May 6, 2020 • edited Loading

markwragg commented May 6, 2020 • edited Loading

GitTorre commented May 6, 2020

markwragg commented May 6, 2020

GitTorre commented May 6, 2020 • edited Loading

markwragg commented May 7, 2020

GitTorre commented May 9, 2020

markwragg commented May 9, 2020

GitTorre commented May 9, 2020

markwragg commented May 9, 2020

GitTorre commented May 9, 2020 • edited Loading

markwragg commented May 9, 2020

GitTorre commented May 9, 2020 • edited Loading

markwragg commented May 11, 2020

GitTorre commented May 11, 2020 • edited Loading

markwragg commented May 5, 2020 •

edited

Loading

GitTorre commented May 5, 2020 •

edited

Loading

markwragg commented May 5, 2020 •

edited

Loading

GitTorre commented May 6, 2020 •

edited

Loading

GitTorre commented May 6, 2020 •

edited

Loading

markwragg commented May 6, 2020 •

edited

Loading

GitTorre commented May 6, 2020 •

edited

Loading

GitTorre commented May 9, 2020 •

edited

Loading

GitTorre commented May 9, 2020 •

edited

Loading

GitTorre commented May 11, 2020 •

edited

Loading