Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting NetworkObserver #61

Closed
markwragg opened this issue May 5, 2020 · 21 comments · Fixed by #63 or #65
Closed

Troubleshooting NetworkObserver #61

markwragg opened this issue May 5, 2020 · 21 comments · Fixed by #63 or #65
Assignees
Labels
bug Something isn't working
Milestone

Comments

@markwragg
Copy link

markwragg commented May 5, 2020

Hi,

I've just configured NetworkObserver to monitor an Azure SQL connection endpoint (port 1433). Immediately after adding the config it went into warning state as follows:

'FO022' reported Warning for property 'NetworkHealth'.
NetworkObserver detected Warning threshold breach. Outbound Internet connection failure detected for endpoint <redacted>.database.windows.net

I've logged on to the nodes where the app I configured this check for is running and can do a Test-NetConnection on each of them verifying the endpoint/port can be reached. I've turned on verbose logging on one node and it just shows the connection test occurring but not any errors.

What can I do to further troubleshoot why its alerting? Does the test run from all the nodes where the app runs? Does it go into warning if just 1 of them doesn't connect?

The documentation makes reference that it can do ICMP or port checks. ICMP does fail to this address but I can't see any example of how you select whether the test is ICMP or port based.

Thanks,
Mark

@GitTorre
Copy link
Member

GitTorre commented May 5, 2020

Thanks for reporting this.

TCP is the right protocol for NetworkObserver tests generally. ICMP is useful for low/no data network communication, like ping or traceroute. For SQL and frankly most other networking you do from an SF app, for example, TCP is the default and built-in protocol. You can't modify this outside of the FO source code.

Can you run FO locally on your dev machine and debug NetworkObserver with the failing hostname/port pair config? NetworkObserver runs on all nodes. The health report is Application level.

Let me spin up a test SQL Azure db and see if this is a bug in NetworkObserver.

@GitTorre
Copy link
Member

GitTorre commented May 5, 2020

I will re-implement this scenario support with TcpClient. It’s a bug.

@GitTorre GitTorre self-assigned this May 5, 2020
@GitTorre GitTorre added the bug Something isn't working label May 5, 2020
@markwragg
Copy link
Author

Thanks @GitTorre good to know I wasn’t just doing something wrong!

@GitTorre GitTorre added this to the 2.0.6 milestone May 5, 2020
@GitTorre
Copy link
Member

GitTorre commented May 5, 2020

@markwragg Pushed fix to develop if you want to test it out. DiskObs fix is also there.

@markwragg
Copy link
Author

markwragg commented May 5, 2020

I haven't tested the code under develop yet, but one further bug i've noticed (with the existing master code) is that when a warning fires for a particular endpoint it appears under every application that has endpoints configured for monitoring in NetworkObserver rather than just those that it was explicitly listed for.

That is to say if i'm monitoring endpoints for app1, app2, app3 and I have the database.windows.net endpoint under just app2, its now alerting for all 3 apps, even though its not listed as an endpoint for app1 or app3.

This may already be resolved by your latest code changes but I thought i'd mention it in case you think it may not be.

(Note: I didn't notice this previously because I had only added the endpoint for one app, but now i've added further endpoint config for additional apps its become apparent).

@GitTorre
Copy link
Member

GitTorre commented May 6, 2020

Hmm. I will look into this. Can you share your config?

@GitTorre
Copy link
Member

GitTorre commented May 6, 2020

Wait a while. I need to do a few more things with this stuff. (thus develop...)

Edit: Fixed a few things. Thanks for the bugs. We really appreciate it! Should push to develop tonight.

@markwragg
Copy link
Author

markwragg commented May 6, 2020

I had a further thought on this observer. The fact that the connectivity is tested from every node isn't an issue in my environment, but it seems like it could be in other environments. For example one of my apps runs on just one of my scale sets. If I had set up firewall rules for that individual scale set to be able to access its specific endpoints then the tests would fail on other nodes.

Some way to configure endpoint testing to run only on specific scalesets might be useful. Let me know if you want me to split this into another issue as I know this is more of a feature request now.

Also how does it work if you configure the same endpoint under multiple applications? Does it result in multiple checks for the same endpoint, or does it deduplicate them before it executes them?

@GitTorre
Copy link
Member

GitTorre commented May 6, 2020

@markwragg Re: duplication, configuration is up to you. If you have shared endpoints, then just add them to a single app versus all of the related ones. In terms of node types (I assume that is what you mean), then you would just configure FO that you deploy to those nodes differently. Node type differentiation is already implicitly supported. That is, you can configure the same app differently on different node types. Or am I misunderstanding your suggestion? Thanks for the thoughts! Let me know. Let's wait to gain clarity here before creating another issue.

@markwragg
Copy link
Author

I wasn’t aware you could configure the same app differently on different node types, that does sound like it solves it though so I’ll look into it.

It feels like a shame to have to just declare an endpoint once when it’s used by multiple apps, I like that it would show all of the apps impacted when that endpoint was down.

Can you declare endpoints in a way that they apply to all apps?

@GitTorre
Copy link
Member

GitTorre commented May 6, 2020

The way you do that would be to add the endpoint/port pairs to all the targetApps you want to warn when the services are unreachable. That takes you back to the way it is currently implemented. :)

If what you are asking is for NetObs to filter duplicate endpoints out before conducting tests (which it does for each target app, of course), that can be done, but it will involve some work (not hard, just time and testing). Please create a new Issue for this and I will add it my queue. It is a fine suggestion.

Re: all apps, one thing that could be done is to add support for appTargetType, which would mean apply the settings to all apps of the same type (and thus all of their services). This is what you can do in AppObserver, for example, which is a more complex observer that does multiple things, unlike network observer which is simply an app-specific endpoint reachability tester, nothing more.

@GitTorre GitTorre linked a pull request May 7, 2020 that will close this issue
@GitTorre GitTorre closed this as completed May 7, 2020
@markwragg
Copy link
Author

All my network endpoint tests are working correctly now since deploying 2.0.6 and having correctly assigned each as either http or tcp with the new "protocol" setting in the config file. Thanks for the quick turnaround!

@GitTorre
Copy link
Member

GitTorre commented May 9, 2020

@markwragg can you verify this impl works for you? I also wrote a slightly different approach with retries and a remote write-only test that’s been tested and is currently in develop branch. The shipped (Release) impl will also work, but without retries which may be questionable (and it employs both a read and a write test which I think is unnecessary). Of course, the real connection logic (so, using the sql azure sdk and your connection string, firewall rules, etc) is up to your code where you should always employ your own retry logic. NetworkObserver just tries to prove that the service is reachable and in an efficient way.

@GitTorre GitTorre reopened this May 9, 2020
@markwragg
Copy link
Author

The current implementation works but for some non standard port HTTP endpoints I had to use TCP. I suspect maybe they didn’t work because of the write part of the query because they are authenticated endpoints so maybe switching to just read only would allow them to be tested via HTTP. Doing retries sounds good too. I say go for it.

@GitTorre
Copy link
Member

GitTorre commented May 9, 2020

For http, the web request test should work when auth fails. The server will respond in that case and that’s all we care about. Can you let me know more about this particular scenario?

@markwragg
Copy link
Author

It was two public endpoints our app listens on. One runs on port 8088 the other on 8120. They run on the same SF cluster and I pointed them to the public addresses of their Azure LBs. They require a valid token to be provided in the query string to auth.

@GitTorre
Copy link
Member

GitTorre commented May 9, 2020

I see. Did you supply a prefix in the config? That is, "https://...". If the supplied port is not 443 but is secured, then you have to add https:// in the hostName string.
https://github.com/microsoft/service-fabric-observer/blob/master/Documentation/Observers.md#networkobserver

@markwragg
Copy link
Author

No I’m pretty sure I didn’t, so can try that on Monday.

@GitTorre
Copy link
Member

GitTorre commented May 9, 2020

For endpoints in the the same cluster, you should test with local network IP address for hostname. Either http or tcp.

E.g.,

{
"hostname": "https://10.0.0.10",
"port": 10200,
"protocol": "http"
}

Or

{
"hostname": "10.0.0.10",
"port": 10200,
"protocol": "tcp"
}

Have a great weekend and thank you for all the help!

@GitTorre GitTorre modified the milestones: 2.0.6, 2.0.7 May 9, 2020
@GitTorre GitTorre linked a pull request May 9, 2020 that will close this issue
@markwragg
Copy link
Author

I’m running 2.0.7 and it looks fine to me. Switching to adding an https:// prefix for my non standard port tests is working fine also.

@GitTorre
Copy link
Member

GitTorre commented May 11, 2020

Right on. Thanks again for all the feedback, suggestions and good bugs. Please let me know how FO/CO work out for you in production. You can always reach me at ctorre at microsoft.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants