Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically perform at least local checks for all uplink interfaces #2885

Merged
merged 1 commit into from
Nov 4, 2022

Conversation

milan-zededa
Copy link
Contributor

@milan-zededa milan-zededa commented Oct 25, 2022

When device has multiple interfaces, the connectivity testing, as
currently implemented, stops as soon as it finds one working uplink
interface. However, this means that some interfaces may not get retested
for a long time, potentially having some old (and now obsolete) error
still reported.

This is often the case when device has eth0 + wwan0 for management.
After a device boot, it may take a while for eth0 to get an IP address
from DHCP and during that time wwan0 will be tested (because eth0 is not
yet providing controller connectivity). But wwan0 can also spend some
time registering and connecting into a cellular network (likely more
than eth0 needs to get an IP address), producing intermittent testing
failures as a result.

For ethernet interface the failure is removed from the port status as
soon as it gets an IP address. It has zero cost so it will be retested
periodically. However, once eth0 starts working, wwan0 is no longer
being tested and may end up having a (potentially obsolete) error
reported indefinitely (or as long as eth0 is working).

Another possible scenario when this can be a problem, is when a EVE
user is troubleshooting LTE connectivity while using eth0 for management
at the same time. The user might not realize that wwan0 is not being
retested at that time, hence any changes made by the user (like
adding/replacing antennas or moving device to another location) will not
have any impact on the reported port status.

We can think of more such scenarious where this is an issue, for example
with multiple ethernet interfaces. Lower priority ones (even if
effectively free) might not be tested for a long time and have a false
negative or false positive information reported.

Simply put, the user is not getting a real-time feedback about fixed
(or broken) connectivity for lower-priority uplink interfaces and may get
confused by an obsolete reported status.

The challenge is that we want to limit the amount of traffic sent or
received via non-free interfaces. The idea of this commit is to change
the connectivity testing algorithm such that all interfaces are at
least verified using local-only checks - i.e. tests that do not involve
sending or receiving any traffic. For example:

  • Does interface (i.e. link) exist in the kernel?
  • Does interface have a routable IP address?
  • Is there any DNS server associated with the interface?
  • If there is a proxy config, is it valid?

Additionally, for wwan we already run (quite low cost) connectivity
probing and we could use the status determined using that here as well.
As part of this commit, the reported probing error was enhanced to
include all detected issues (not responding proxy and/or DNS server, etc.)

Signed-off-by: Milan Lenco milan@zededa.com

@eriknordmark
Copy link
Contributor

It seems a bit assymetric to re-test failed (cost > 0) interfaces but not periodically re-test working (cost > 0) interfaces; if they are fallback one would want to know whether or not they are working before they are needed. For this one can envision a periodic timer (default to never to match current behavior?) so that they can be re-tested e.g., once a day.

Are the typical failures you see local. as in not yet having an IP address? I wonder if we should have specific logic to clear such errors when the interface appears, when it gets an IP address, and when it gets a DNS address. That wouldn't need communicating over LTE.

@milan-zededa
Copy link
Contributor Author

milan-zededa commented Oct 31, 2022

@eriknordmark Yes, those errors are typically local. I will change this PR to flag/unflag these errors without generating traffic.

@milan-zededa milan-zededa changed the title Retest (even non-free) interface if it has been marked with error Periodically perform at least local checks for all uplink interfaces Nov 2, 2022
@milan-zededa milan-zededa force-pushed the interface-retest branch 3 times, most recently from 4d825c3 to 06a485b Compare November 2, 2022 16:23
When device has multiple interfaces, the connectivity testing, as
currently implemented, stops as soon as it finds one working uplink
interface. However, this means that some interfaces may not get retested
for a long time, potentially having some old (and now obsolete) error
still reported.

This is often the case when device has eth0 + wwan0 for management.
After a device boot, it may take a while for eth0 to get an IP address
from DHCP and during that time wwan0 will be tested (because eth0 is not
yet providing controller connectivity). But wwan0 can also spend some
time registering and connecting into a cellular network (likely more
than eth0 needs to get an IP address), producing intermittent testing
failures as a result.

For ethernet interface the failure is removed from the port status as
soon as it gets an IP address. It has zero cost so it will be retested
periodically. However, once eth0 starts working, wwan0 is no longer
being tested and may end up having a (potentially obsolete) error
reported indefinitely (or as long as eth0 is working).

Another possible scenario when this can be a problem, is when a EVE
user is troubleshooting LTE connectivity while using eth0 for management
at the same time. The user might not realize that wwan0 is not being
retested at that time, hence any changes made by the user (like
adding/replacing antennas or moving device to another location) will not
have any impact on the reported port status.

We can think of more such scenarious where this is an issue, for example
with multiple ethernet interfaces. Lower priority ones (even if
effectively free) might not be tested for a long time and have a false
negative or false positive information reported.

Simply put, the user is not getting a real-time feedback about fixed
(or broken) connectivity for lower-priority uplink interfaces and may get
confused by an obsolete reported status.

The challenge is that we want to limit the amount of traffic sent or
received via non-free interfaces. The idea of this commit is to change
the connectivity testing algorithm such that all interfaces are at
least verified using local-only checks - i.e. tests that do not involve
sending or receiving any traffic. For example:
 * Does interface (i.e. link) exist in the kernel?
 * Does interface have a routable IP address?
 * Is there any DNS server associated with the interface?
 * If there is a proxy config, is it valid?

Additionally, for wwan we already run (quite low cost) connectivity
probing and we could use the status determined using that here as well.
AS part of this commit, the reported probing error was enhanced to
include all detected issues (not responding proxy and/or DNS server, etc.)

Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa
Copy link
Contributor Author

milan-zededa commented Nov 2, 2022

@eriknordmark I have updated the commit and the PR description. Algorithm for interface verification was enhanced to periodically perform at least local checks for all uplink interfaces.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great; let's run this through the tests.

@milan-zededa
Copy link
Contributor Author

Integration tests look OK, no idea why arm build has failed though - there is some cache hit issue even after restarting all build jobs.

@giggsoff
Copy link
Contributor

giggsoff commented Nov 3, 2022

Integration tests look OK, no idea why arm build has failed though - there is some cache hit issue even after restarting all build jobs.

Hope the problem will be resolved soon: #2891

@eriknordmark eriknordmark merged commit 65a36ad into lf-edge:master Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants