IPv6 Multicast over UDP to share addresses #404

smklein · 2021-11-19T17:20:51Z

This PR establishes the following:

multicast.rs: A sender/receiver pair of UDP IPv6 multicast sockets which can be used for address discovery.
discovery.rs: A wrapper around the multicast pair in the form of PeerMonitor, which just sends/receives packets non-stop.
Integration of the PeerMonitor into bootstrap/agent.rs.

As implemented, this PR uses a constant UNLOCK_THRESHOLD of 1, meaning that sleds will not be blocked behind this discovery/unlock protocol.

smklein · 2021-11-19T17:24:29Z

sled-agent/src/bootstrap/multicast.rs

+    let socket = new_ipv6_udp_socket()?;
+    socket.set_multicast_loop_v6(loopback)?;
+    socket.set_multicast_if_v6(interface)?;
+    let address = SocketAddrV6::new(Ipv6Addr::UNSPECIFIED, 0, 0, 0);


I could potentially set up the sender with a more specific address?

I think using the /64 primary underlay address assigned to the server (the one referenced in this comment) would be the address to use.

If we don't use link-local, we'll need some mechanism to know when that stable address is allocated. Should we just poll the interface using something like ipadm show-addr? Is there a better, more recommended mechanism?

That seems like a reasonable way to start. I imagine there may be some sort of SMF machinery that could be helpful here, like starting the sled-agent bootstrapping process after the on-server router has emitted a signal that the basic underlay network setup is complete.

I filed #443 to follow-up on this.

I'm personally okay moving forward with "UNSPEC" for the moment, with plans to replace this.

The "assignment of the prefix from a router" seems like it requires having a programmable router, which, AFAIK, we don't yet have (physically or virtually). What would it take to create a virtual router that we could run on one of the test sleds, until we have a real sidecar that can be used?

smklein · 2021-11-19T17:25:14Z

sled-agent/src/bootstrap/multicast.rs

+    // TODO: I tried binding on the input value of "addr.ip()", but doing so
+    // returns errno 22 ("Invalid Input").
+    //
+    // This may be binding to a larger address range than we want.


I want to highlight this TODO - we join a multicast group based on the IP address, but leave it UNSPECIFIED here in the receiver.

Using unspec is fine here. The relevant bit of the bind here is for the port. Because the socket already has a multicast address specified, that is the "bound" address.

smklein · 2021-11-19T17:25:43Z

sled-agent/src/bootstrap/discovery.rs

+        let address = SocketAddrV6::new(
+            Ipv6Addr::new(scope, 0, 0, 0, 0, 0, 0, 0x1),
+            7645,
+            0,
+            0,
+        );


Other than the scope - which is intentional - the rest of this address is 100% arbitrary.

I think we should use either Admin-Local or Site-Local scope here. It's not guaranteed that there will be an L2 domain across the rack. RFD 63 lays out two possible paths for bootstrapping the rack, one that has an L2 broadcast domain for starting up, and another that starts in L3. I'm personally leaning toward the latter so we do not have to change the shape of the network as a part of starting up. Admin-Local or Site-Local scopes should work for either alternative, but Link-Local will only work for the former.

I'd suggest coming up with a set of constant/well-known multicast group addresses that correspond to particular communication domains. For example for Rift peering in Maghemite we use ff02::a1f7

This comment definitely led me to some required reading. Thank you for the feedback @rcgoodfellow. I still have a few questions on this though.

I keep reading that site-local IPv6 addresses are deprecated. What impact would that have on choosing site-local as a multicast scope?

If we use a site-local multicast scope, does that mean that the "stable" addresses assigned to the link must be in the site-local format?

It looks like admin-local multicast scope doesn't have a corresponding format for addresses. Are these just global unicast addresses then?

Will the router on the switches prevent admin-local or site-local multicast from leaving a single rack? Is a "site" or smallest administrative domain used for admin-local just something that we are allowed to determine? In other words, can Oxide just go ahead and say any site-local or admin-local traffic must be contained within a single rack?

If we went with an L2 domain across the rack, and allowed use of link-local addresses, we could ensure that bootstrap traffic never left the rack, and also ensure that the sled-agent bootstrap server was inaccessible from outside the rack automatically. With global unicast addresses (assuming that's what we use for site-local/admin-local), how do we ensure this? Firewall rules?

My understanding is that site-local ULAs were deprecated, but site-local multicast addresses are still OK. However, it seems to me that some of the discussion in RFC 3879 also applies to site-local multicast, in particular section 2.5. We do completely own our networks, so we can take it upon ourselves to come up with a definition of "site" that is useful to us (and understood by our routers).

I do not believe so, but this is a good question to get a concrete answer for.

By corresponding, do you mean the source address that will be used to send messages to the multicast group? If so I think the answer is tied to (2) e.g. I do not believe multicast scope constrains source address scope, but I'll find out for sure.

I believe this is the very purpose of the admin-local scope, to let admins decide what is "local". And I believe in either case the answer is yes.

For global unicast addresses sending to some sort of scoped multicast address (or any multicast address for that matter) in order for that traffic to leave the rack, some router will need to route it out of the rack. We can constrain the propagation of multicast to only live within a single rack in a number of ways (only allowing multicast to route to servers, limiting TTLs, etc.) For unicast traffic, we could similarly have an address space that is only routed within a rack. Just to be clear, we are talking about the server NICs and not the SP NICs right?

Thanks Ry

By corresponding, do you mean the source address that will be used to send messages to the multicast group? If so I think the answer is tied to (2) e.g. I do not believe multicast scope constrains source address scope, but I'll find out for sure.

Yes, exactly. There is a site specific unicast address and a multicast scope, but only a multicast cope for admin, but no related unicast address.

Just to be clear, we are talking about the server NICs and not the SP NICs right?
Yes.

smklein · 2021-11-19T17:26:06Z

sled-agent/src/bootstrap/discovery.rs

+            0,
+        );
+        let loopback = false;
+        let interface = 0;


I'd be happy to pick a more specific interface, if there was a good way to do so. Feedback welcome.

This is a tricky question. I have not given much thought to multicast routing yet. At a basic level there are two potentially viable interfaces for this software to chose from, and I think the right answer might be both, so not specifying the interface for now seems reasonable. This means that traffic will egress on both interfaces and also ingress on both interfaces for the receiving servers.

smklein · 2021-11-19T17:28:20Z

sled-agent/src/bootstrap/discovery.rs

+                    Ok((_, addr)) => {
+                        info!(log, "Bootstrap Peer Monitor: Successfully received an address: {}", addr);
+                        sleds.lock().await.insert(addr);


This would ideally be the address we use for subsequent communication with the sled.

Admittedly the sled agent / bootstrap servers are using an SMF file configured to use IPv4 addresses, but this is an IPv6 address. That presents a bit of a challenge - presumably we'd want everyone to be communicating over IPv6 in the long-term, no?

Currently there is no plan to have any IPv4 on the underlay.

I think it's still an open question how stable addresses will find their way onto servers. The approach I've been taking in my testing setups is the following.

Each router running on a rack switch is started with an IPv6 /56 prefix, who provides that? ... not sure.

When routers running on servers peer with a rack-level router, the rack-level router delegates a /64 to them and then the router running on the server automatically assigns the first address in that /64 to the server.

I will make the change to use IPv6 for the bootstrap server explicitly within this change.

I hear you on the "no IPv4 in the underlay" - we should trend in that direction - but that transition can be more gradual.

Filing #442 to track

smklein · 2021-11-19T17:29:05Z

sled-agent/src/bootstrap/agent.rs

+
+                // "-1" to account for ourselves.
+                //
+                // NOTE: Clippy error exists while the compile-time unlock
+                // threshold is "1", because we basically don't require any
+                // peers to unlock.
+                #[allow(clippy::absurd_extreme_comparisons)]
+                if other_agents.len() < UNLOCK_THRESHOLD - 1 {
+                    return Err(BackoffError::Transient(
+                        BootstrapError::NotEnoughPeers,
+                    ));
+                }


This bit is admittedly a bit of a "trick". The low UNLOCK_THRESHOLD means that practically, "other_agents.len()" is zero in a single-sled setup, so "sled quorum" is reached by doing very little.

smklein · 2021-11-19T17:35:30Z

Hey @rcgoodfellow , @andrewjstone - this PR is somewhat WIP, with my largest concerns highlighted in comments above.

I'm interested in setting up a configuration for testing this locally on my illumos machine, but I think I might need some assistance. I want to create a Zone which has a unique IPv6 address that I can ping from the Global.

So far, I set up a non-global Zone with an exclusive IP stack, gave it a VNIC, created an interface on that vnic, tried to create an IPv6 address, but I couldn't ping that IPv6 address from the global zone.

root@trial:~# dladm show-vnic                                                                                          
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID                                             
trial0       ?            1000   2:8:20:fa:6f:be   random              
root@trial:~# ipadm show-if                                                                                            
IFNAME     STATE    CURRENT      PERSISTENT                                                                                                                                                                                                   
lo0        ok       -m-v------46 ---                       
trial0     down     bm--------46 -46  
root@trial:~# ipadm create-addr -T addrconf trial0/v6 
root@trial:~# ipadm show-addr                              
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
trial0/v6         addrconf ok           fe80::8:20ff:fefa:6fbe/10
trial0/?          addrconf ok           2600:6c4a:7c7f:f510:8:20ff:fefa:6fbe/64
trial0/v6         addrconf ok           2600:6c4a:7c7f:f510::869/128
root@trial:~# routeadm -e ipv6-routing
root@trial:~# routeadm -u

Meanwhile, in the global zone, pinging any of those IPv6 address just returns "No route to host".

I'd be interested in getting this tested, at least manually, but that configuration seems like a bit of a blocker.

andrewjstone

Sean, thank you very much for implementing this! This is a great step forward towards rack bootstrapping.

While there are some open questions and definitely some discussion to be had about further integration work, I'm fine merging this in and iterating.

andrewjstone · 2021-11-21T20:31:50Z

sled-agent/src/bootstrap/discovery.rs

+    // encrypted, authenticated, or otherwise verified. We're just using
+    // it as a starting point for swapping addresses.
+    let message =
+        b"We've been trying to reach you about your car's extended warranty";


andrewjstone · 2021-11-22T02:53:08Z

sled-agent/src/bootstrap/discovery.rs

+        let address = SocketAddrV6::new(
+            Ipv6Addr::new(scope, 0, 0, 0, 0, 0, 0, 0x1),
+            7645,
+            0,
+            0,
+        );


This comment definitely led me to some required reading. Thank you for the feedback @rcgoodfellow. I still have a few questions on this though.

I keep reading that site-local IPv6 addresses are deprecated. What impact would that have on choosing site-local as a multicast scope?

If we use a site-local multicast scope, does that mean that the "stable" addresses assigned to the link must be in the site-local format?

It looks like admin-local multicast scope doesn't have a corresponding format for addresses. Are these just global unicast addresses then?

Will the router on the switches prevent admin-local or site-local multicast from leaving a single rack? Is a "site" or smallest administrative domain used for admin-local just something that we are allowed to determine? In other words, can Oxide just go ahead and say any site-local or admin-local traffic must be contained within a single rack?

If we went with an L2 domain across the rack, and allowed use of link-local addresses, we could ensure that bootstrap traffic never left the rack, and also ensure that the sled-agent bootstrap server was inaccessible from outside the rack automatically. With global unicast addresses (assuming that's what we use for site-local/admin-local), how do we ensure this? Firewall rules?

andrewjstone · 2021-11-22T03:17:50Z

sled-agent/src/bootstrap/multicast.rs

+    let socket = new_ipv6_udp_socket()?;
+    socket.set_multicast_loop_v6(loopback)?;
+    socket.set_multicast_if_v6(interface)?;
+    let address = SocketAddrV6::new(Ipv6Addr::UNSPECIFIED, 0, 0, 0);


If we don't use link-local, we'll need some mechanism to know when that stable address is allocated. Should we just poll the interface using something like ipadm show-addr? Is there a better, more recommended mechanism?

andrewjstone · 2021-11-22T03:57:24Z

sled-agent/src/bootstrap/agent.rs

+    pub async fn initialize(&self) -> Result<(), BootstrapError> {
+        info!(&self.log, "bootstrap service initializing");
+
+        self.establish_sled_quorum().await?;


I believe there needs to be a determination about phases here. If a sled does not already know the group of existing sleds (saved in a file) where each sled id is a public key or key fingerprint, that sled is not capable of unlocking itself, or doing much of anything except listening on its multicast address for a request from the primary. ( Note that we use the public key as sled id, since we want to allow movement of sleds throughout the rack and for their IP addresses to change. )

The above scenario is going to be the case of all sleds except the primary during initial rack setup. The primary will start a PeerMonitor and wait for (in the demo case) a predefined number of sleds to respond. The primary will then connect over TCP to a predetermined SPDM port at each of the sled-agents that responded, using the responded IP address. The primary will then become an SPDM requester and attempt to create a secure channel (and pretend to attest measurements), over that TCP connection. When all the sleds are connected via secure SPDM channels, a secret will be generated and distributed to each sled-agent along with its individual key-share and group information. This is Phase 1 of the protocol and only happens during rack initialization. We aren't yet considering what it means to add a sled to the group at runtime.

Phase 2 is what I believe establish_sled_quorum was meant to encapsulate. This is where each sled already has a key share and knows the group members. In this case, when a sled restarts it will run a PeerMonitor to get the IPs of a threshold of sleds and then create a secure SPDM channel to each of those sled-agents. Both sides of the SPDM channel should ensure that any received certs or digests actually match the group information, although its unclear if we need to do this for the demo. This is because while SPDM supports mutual authentication, it's not yet implemented, and so we are going to pretend to setup a secure channel by running the protocol up to the implemented challenge authentication phase. Once our pseudo-SPDM secured channel is established the remote sled can send the requested key share. When a quorum is retrieved the rack secret can be reconstructed and the sled unlocked.

Sure, this code - as implemented - does not have a phase explicitly for sharing keys.

I don't have a strong opinion about whether or not this is implemented in establish_sled_quorum or not - but as long as it happens after the PeerMonitor is up and running, I'm happy.

rcgoodfellow

This LGTM. Happy to iterate on evolving how addressing works as the systems that will be providing that addressing come together more.

In order to allow for encrypted storage on individual sleds without the need for a user to type a password at bootup, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys from individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieiving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be destributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. We fake rack initialization through the completely insecure use of a configuration file provided as part of the `omicron-package` install that contains all key shares. The configuration file disables the trust quorum by default, so that the sled-agent continues to run on a single node. When enabled, share retrieval attempts will begin and when a quorum of shares are received, the rack secret will be reconstructed, and the rest of the control plane will begin to boot. In order for this to work, the user also has to edit the config file to ensure that a different `sled_index` (which points to a given unique share) exists in each config file, and then the sled-agent must be restarted with `svcadm restart sled-agent`. The included config file only includes shares for 2 sleds, but a new one can be generated with the provided `gen_trust_quorum_config` program. Lastly, the location of the config file is given in the sled-agent smf file and passed through as `rack_secret_dir` in the `BootstrapConfig` struct. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.

In order to allow for encrypted storage on individual sleds without the need for a user to type a password at boot, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys for individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be distributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. Instead shares are individually distributed along with metadata as a `ShareDistribution` stored in a `share.json` file in the `sled_agent/pkg` directory under the install directory configured for `omicron-package install`. Share generation must be done manually now, but a follow up commit is coming for a deployment system that will generate the rack secret and distribute the shares along with the install of omicron. If the `share.json` file is not present, the server operates in single-node mode, and does not try to form a a trust quorum. This is behavior required for current development backwards compatibility and will eventually be removed. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.

smklein added 7 commits November 17, 2021 15:07

WIP multicast

66a9566

Merge branch 'main' into multicast

fffce49

it's a mess, but it works on my machine

efda1da

Integrate multicast into bootstrap agent

cba49e3

Clippy

acb8793

Comment cleanup

d497b70

Merge branch 'main' into multicast

7dde7ef

smklein commented Nov 19, 2021

View reviewed changes

smklein requested review from andrewjstone and rcgoodfellow November 19, 2021 17:29

mac friendly

26d6fa1

andrewjstone approved these changes Nov 22, 2021

View reviewed changes

rcgoodfellow approved these changes Nov 23, 2021

View reviewed changes

smklein added 7 commits November 24, 2021 09:32

Merge branch 'main' into multicast

4f60a80

Use IPv6 for SMF bootstrap address, format parseable addresses

f55ed57

Merge branch 'main' into multicast

678ca39

Copyright

0aa0a03

not mac

8817ce0

Ignore IPv6 test by default, with comment

e7f6ba9

Add comment

0156df2

smklein merged commit 617071a into main Nov 29, 2021

smklein deleted the multicast branch November 29, 2021 17:23

smklein mentioned this pull request Nov 30, 2021

Bump the unlock threshold back down to one #456

Merged

andrewjstone mentioned this pull request Dec 6, 2021

Add initial trust quorum support #487

Merged

IPv6 Multicast over UDP to share addresses #404

IPv6 Multicast over UDP to share addresses #404

Uh oh!

Conversation

smklein commented Nov 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein commented Nov 19, 2021

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rcgoodfellow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants