sled-agent: Change timeout waiting to find switch zones #9699
+78
−98
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
sled-agent attempts to find the IPs of all switch zones in two places:
Prior to this PR, both of these would attempt to find both switches for 5 minutes, and after that point would proceed as long as they'd found at least one. This is sorta-okay-but-not-really (more on this in another issue shortly) fine for NAT entries, because Nexus has a background task that will come back around and sync NAT entries for services eventually. But it's not fine for rack setup: if we proceed with only one switch found when the RSS config specifies uplinks for both, we'll fail to hand off to Nexus (details in #9678).
After this change, we change rack setup to wait forever for all switches which have a configured uplink. This means if a switch hasn't come up yet RSS won't proceed, but that should be okay. (It seems better if we could come up with one switch then have Nexus reconcile things after the fact, but that will be a larger change with more risk and more testing difficulty, I think.)
Fixes #9678.