-
Notifications
You must be signed in to change notification settings - Fork 62
IPv6 Multicast over UDP to share addresses #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let socket = new_ipv6_udp_socket()?; | ||
| socket.set_multicast_loop_v6(loopback)?; | ||
| socket.set_multicast_if_v6(interface)?; | ||
| let address = SocketAddrV6::new(Ipv6Addr::UNSPECIFIED, 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could potentially set up the sender with a more specific address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the /64 primary underlay address assigned to the server (the one referenced in this comment) would be the address to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use link-local, we'll need some mechanism to know when that stable address is allocated. Should we just poll the interface using something like ipadm show-addr? Is there a better, more recommended mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a reasonable way to start. I imagine there may be some sort of SMF machinery that could be helpful here, like starting the sled-agent bootstrapping process after the on-server router has emitted a signal that the basic underlay network setup is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #443 to follow-up on this.
I'm personally okay moving forward with "UNSPEC" for the moment, with plans to replace this.
The "assignment of the prefix from a router" seems like it requires having a programmable router, which, AFAIK, we don't yet have (physically or virtually). What would it take to create a virtual router that we could run on one of the test sleds, until we have a real sidecar that can be used?
| // TODO: I tried binding on the input value of "addr.ip()", but doing so | ||
| // returns errno 22 ("Invalid Input"). | ||
| // | ||
| // This may be binding to a larger address range than we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to highlight this TODO - we join a multicast group based on the IP address, but leave it UNSPECIFIED here in the receiver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using unspec is fine here. The relevant bit of the bind here is for the port. Because the socket already has a multicast address specified, that is the "bound" address.
| let address = SocketAddrV6::new( | ||
| Ipv6Addr::new(scope, 0, 0, 0, 0, 0, 0, 0x1), | ||
| 7645, | ||
| 0, | ||
| 0, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the scope - which is intentional - the rest of this address is 100% arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use either Admin-Local or Site-Local scope here. It's not guaranteed that there will be an L2 domain across the rack. RFD 63 lays out two possible paths for bootstrapping the rack, one that has an L2 broadcast domain for starting up, and another that starts in L3. I'm personally leaning toward the latter so we do not have to change the shape of the network as a part of starting up. Admin-Local or Site-Local scopes should work for either alternative, but Link-Local will only work for the former.
I'd suggest coming up with a set of constant/well-known multicast group addresses that correspond to particular communication domains. For example for Rift peering in Maghemite we use ff02::a1f7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment definitely led me to some required reading. Thank you for the feedback @rcgoodfellow. I still have a few questions on this though.
- I keep reading that site-local IPv6 addresses are deprecated. What impact would that have on choosing site-local as a multicast scope?
- If we use a site-local multicast scope, does that mean that the "stable" addresses assigned to the link must be in the site-local format?
- It looks like admin-local multicast scope doesn't have a corresponding format for addresses. Are these just global unicast addresses then?
- Will the router on the switches prevent admin-local or site-local multicast from leaving a single rack? Is a "site" or smallest administrative domain used for admin-local just something that we are allowed to determine? In other words, can Oxide just go ahead and say any site-local or admin-local traffic must be contained within a single rack?
- If we went with an L2 domain across the rack, and allowed use of link-local addresses, we could ensure that bootstrap traffic never left the rack, and also ensure that the sled-agent bootstrap server was inaccessible from outside the rack automatically. With global unicast addresses (assuming that's what we use for site-local/admin-local), how do we ensure this? Firewall rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- My understanding is that site-local ULAs were deprecated, but site-local multicast addresses are still OK. However, it seems to me that some of the discussion in RFC 3879 also applies to site-local multicast, in particular section 2.5. We do completely own our networks, so we can take it upon ourselves to come up with a definition of "site" that is useful to us (and understood by our routers).
- I do not believe so, but this is a good question to get a concrete answer for.
- By corresponding, do you mean the source address that will be used to send messages to the multicast group? If so I think the answer is tied to (2) e.g. I do not believe multicast scope constrains source address scope, but I'll find out for sure.
- I believe this is the very purpose of the admin-local scope, to let admins decide what is "local". And I believe in either case the answer is yes.
- For global unicast addresses sending to some sort of scoped multicast address (or any multicast address for that matter) in order for that traffic to leave the rack, some router will need to route it out of the rack. We can constrain the propagation of multicast to only live within a single rack in a number of ways (only allowing multicast to route to servers, limiting TTLs, etc.) For unicast traffic, we could similarly have an address space that is only routed within a rack. Just to be clear, we are talking about the server NICs and not the SP NICs right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ry
By corresponding, do you mean the source address that will be used to send messages to the multicast group? If so I think the answer is tied to (2) e.g. I do not believe multicast scope constrains source address scope, but I'll find out for sure.
Yes, exactly. There is a site specific unicast address and a multicast scope, but only a multicast cope for admin, but no related unicast address.
Just to be clear, we are talking about the server NICs and not the SP NICs right?
Yes.
| 0, | ||
| ); | ||
| let loopback = false; | ||
| let interface = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to pick a more specific interface, if there was a good way to do so. Feedback welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky question. I have not given much thought to multicast routing yet. At a basic level there are two potentially viable interfaces for this software to chose from, and I think the right answer might be both, so not specifying the interface for now seems reasonable. This means that traffic will egress on both interfaces and also ingress on both interfaces for the receiving servers.
| Ok((_, addr)) => { | ||
| info!(log, "Bootstrap Peer Monitor: Successfully received an address: {}", addr); | ||
| sleds.lock().await.insert(addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would ideally be the address we use for subsequent communication with the sled.
Admittedly the sled agent / bootstrap servers are using an SMF file configured to use IPv4 addresses, but this is an IPv6 address. That presents a bit of a challenge - presumably we'd want everyone to be communicating over IPv6 in the long-term, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there is no plan to have any IPv4 on the underlay.
I think it's still an open question how stable addresses will find their way onto servers. The approach I've been taking in my testing setups is the following.
- Each router running on a rack switch is started with an IPv6 /56 prefix, who provides that? ... not sure.
- When routers running on servers peer with a rack-level router, the rack-level router delegates a /64 to them and then the router running on the server automatically assigns the first address in that /64 to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the change to use IPv6 for the bootstrap server explicitly within this change.
I hear you on the "no IPv4 in the underlay" - we should trend in that direction - but that transition can be more gradual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filing #442 to track
|
|
||
| // "-1" to account for ourselves. | ||
| // | ||
| // NOTE: Clippy error exists while the compile-time unlock | ||
| // threshold is "1", because we basically don't require any | ||
| // peers to unlock. | ||
| #[allow(clippy::absurd_extreme_comparisons)] | ||
| if other_agents.len() < UNLOCK_THRESHOLD - 1 { | ||
| return Err(BackoffError::Transient( | ||
| BootstrapError::NotEnoughPeers, | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is admittedly a bit of a "trick". The low UNLOCK_THRESHOLD means that practically, "other_agents.len()" is zero in a single-sled setup, so "sled quorum" is reached by doing very little.
|
Hey @rcgoodfellow , @andrewjstone - this PR is somewhat WIP, with my largest concerns highlighted in comments above. I'm interested in setting up a configuration for testing this locally on my illumos machine, but I think I might need some assistance. I want to create a Zone which has a unique IPv6 address that I can ping from the Global. So far, I set up a non-global Zone with an exclusive IP stack, gave it a VNIC, created an interface on that vnic, tried to create an IPv6 address, but I couldn't ping that IPv6 address from the global zone. root@trial:~# dladm show-vnic
LINK OVER SPEED MACADDRESS MACADDRTYPE VID
trial0 ? 1000 2:8:20:fa:6f:be random
root@trial:~# ipadm show-if
IFNAME STATE CURRENT PERSISTENT
lo0 ok -m-v------46 ---
trial0 down bm--------46 -46
root@trial:~# ipadm create-addr -T addrconf trial0/v6
root@trial:~# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
lo0/v6 static ok ::1/128
trial0/v6 addrconf ok fe80::8:20ff:fefa:6fbe/10
trial0/? addrconf ok 2600:6c4a:7c7f:f510:8:20ff:fefa:6fbe/64
trial0/v6 addrconf ok 2600:6c4a:7c7f:f510::869/128
root@trial:~# routeadm -e ipv6-routing
root@trial:~# routeadm -uMeanwhile, in the global zone, pinging any of those IPv6 address just returns "No route to host". I'd be interested in getting this tested, at least manually, but that configuration seems like a bit of a blocker. |
andrewjstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sean, thank you very much for implementing this! This is a great step forward towards rack bootstrapping.
While there are some open questions and definitely some discussion to be had about further integration work, I'm fine merging this in and iterating.
| // encrypted, authenticated, or otherwise verified. We're just using | ||
| // it as a starting point for swapping addresses. | ||
| let message = | ||
| b"We've been trying to reach you about your car's extended warranty"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
| let address = SocketAddrV6::new( | ||
| Ipv6Addr::new(scope, 0, 0, 0, 0, 0, 0, 0x1), | ||
| 7645, | ||
| 0, | ||
| 0, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment definitely led me to some required reading. Thank you for the feedback @rcgoodfellow. I still have a few questions on this though.
- I keep reading that site-local IPv6 addresses are deprecated. What impact would that have on choosing site-local as a multicast scope?
- If we use a site-local multicast scope, does that mean that the "stable" addresses assigned to the link must be in the site-local format?
- It looks like admin-local multicast scope doesn't have a corresponding format for addresses. Are these just global unicast addresses then?
- Will the router on the switches prevent admin-local or site-local multicast from leaving a single rack? Is a "site" or smallest administrative domain used for admin-local just something that we are allowed to determine? In other words, can Oxide just go ahead and say any site-local or admin-local traffic must be contained within a single rack?
- If we went with an L2 domain across the rack, and allowed use of link-local addresses, we could ensure that bootstrap traffic never left the rack, and also ensure that the sled-agent bootstrap server was inaccessible from outside the rack automatically. With global unicast addresses (assuming that's what we use for site-local/admin-local), how do we ensure this? Firewall rules?
| let socket = new_ipv6_udp_socket()?; | ||
| socket.set_multicast_loop_v6(loopback)?; | ||
| socket.set_multicast_if_v6(interface)?; | ||
| let address = SocketAddrV6::new(Ipv6Addr::UNSPECIFIED, 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use link-local, we'll need some mechanism to know when that stable address is allocated. Should we just poll the interface using something like ipadm show-addr? Is there a better, more recommended mechanism?
| pub async fn initialize(&self) -> Result<(), BootstrapError> { | ||
| info!(&self.log, "bootstrap service initializing"); | ||
|
|
||
| self.establish_sled_quorum().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there needs to be a determination about phases here. If a sled does not already know the group of existing sleds (saved in a file) where each sled id is a public key or key fingerprint, that sled is not capable of unlocking itself, or doing much of anything except listening on its multicast address for a request from the primary. ( Note that we use the public key as sled id, since we want to allow movement of sleds throughout the rack and for their IP addresses to change. )
The above scenario is going to be the case of all sleds except the primary during initial rack setup. The primary will start a PeerMonitor and wait for (in the demo case) a predefined number of sleds to respond. The primary will then connect over TCP to a predetermined SPDM port at each of the sled-agents that responded, using the responded IP address. The primary will then become an SPDM requester and attempt to create a secure channel (and pretend to attest measurements), over that TCP connection. When all the sleds are connected via secure SPDM channels, a secret will be generated and distributed to each sled-agent along with its individual key-share and group information. This is Phase 1 of the protocol and only happens during rack initialization. We aren't yet considering what it means to add a sled to the group at runtime.
Phase 2 is what I believe establish_sled_quorum was meant to encapsulate. This is where each sled already has a key share and knows the group members. In this case, when a sled restarts it will run a PeerMonitor to get the IPs of a threshold of sleds and then create a secure SPDM channel to each of those sled-agents. Both sides of the SPDM channel should ensure that any received certs or digests actually match the group information, although its unclear if we need to do this for the demo. This is because while SPDM supports mutual authentication, it's not yet implemented, and so we are going to pretend to setup a secure channel by running the protocol up to the implemented challenge authentication phase. Once our pseudo-SPDM secured channel is established the remote sled can send the requested key share. When a quorum is retrieved the rack secret can be reconstructed and the sled unlocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this code - as implemented - does not have a phase explicitly for sharing keys.
I don't have a strong opinion about whether or not this is implemented in establish_sled_quorum or not - but as long as it happens after the PeerMonitor is up and running, I'm happy.
rcgoodfellow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Happy to iterate on evolving how addressing works as the systems that will be providing that addressing come together more.
In order to allow for encrypted storage on individual sleds without the need for a user to type a password at bootup, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys from individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieiving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be destributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. We fake rack initialization through the completely insecure use of a configuration file provided as part of the `omicron-package` install that contains all key shares. The configuration file disables the trust quorum by default, so that the sled-agent continues to run on a single node. When enabled, share retrieval attempts will begin and when a quorum of shares are received, the rack secret will be reconstructed, and the rest of the control plane will begin to boot. In order for this to work, the user also has to edit the config file to ensure that a different `sled_index` (which points to a given unique share) exists in each config file, and then the sled-agent must be restarted with `svcadm restart sled-agent`. The included config file only includes shares for 2 sleds, but a new one can be generated with the provided `gen_trust_quorum_config` program. Lastly, the location of the config file is given in the sled-agent smf file and passed through as `rack_secret_dir` in the `BootstrapConfig` struct. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.
In order to allow for encrypted storage on individual sleds without the need for a user to type a password at bootup, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys from individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieiving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be destributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. We fake rack initialization through the completely insecure use of a configuration file provided as part of the `omicron-package` install that contains all key shares. The configuration file disables the trust quorum by default, so that the sled-agent continues to run on a single node. When enabled, share retrieval attempts will begin and when a quorum of shares are received, the rack secret will be reconstructed, and the rest of the control plane will begin to boot. In order for this to work, the user also has to edit the config file to ensure that a different `sled_index` (which points to a given unique share) exists in each config file, and then the sled-agent must be restarted with `svcadm restart sled-agent`. The included config file only includes shares for 2 sleds, but a new one can be generated with the provided `gen_trust_quorum_config` program. Lastly, the location of the config file is given in the sled-agent smf file and passed through as `rack_secret_dir` in the `BootstrapConfig` struct. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.
In order to allow for encrypted storage on individual sleds without the need for a user to type a password at boot, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys for individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be distributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. Instead shares are individually distributed along with metadata as a `ShareDistribution` stored in a `share.json` file in the `sled_agent/pkg` directory under the install directory configured for `omicron-package install`. Share generation must be done manually now, but a follow up commit is coming for a deployment system that will generate the rack secret and distribute the shares along with the install of omicron. If the `share.json` file is not present, the server operates in single-node mode, and does not try to form a a trust quorum. This is behavior required for current development backwards compatibility and will eventually be removed. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.
This PR establishes the following:
multicast.rs: A sender/receiver pair of UDP IPv6 multicast sockets which can be used for address discovery.discovery.rs: A wrapper around the multicast pair in the form ofPeerMonitor, which just sends/receives packets non-stop.PeerMonitorintobootstrap/agent.rs.As implemented, this PR uses a constant
UNLOCK_THRESHOLDof1, meaning that sleds will not be blocked behind this discovery/unlock protocol.