Reliable state machines (concept)

Frederik Walk edited this page Feb 12, 2014 · 7 revisions

Step 1: basic reliable, ordered datagrams

The current Hexabus system suffers from a relatively high degree of packet loss, and as such, the distributed state machine running on devices can easily enter inconsistent states. Furthermore, packet loss from sporadic sensors (such as buttons used for lighting control) is a big problem - losing a "turn on the lights" packet on the way to any actor will not satisfy anybody. A simple state machine for lighting control might not be able to turn of all lights afterwards without resetting the entire distributed state machine, which is also not a good solution. Thus, a multicast packet that can trigger state changes anywhere in the system must be received by all nodes, or by none at all.

Ordering of packets is also important: a multicast packet that trigger state changes must be received by all nodes in the system at approximately the same time, and reordering of such packets must be avoided at all costs. A simple retransmit protocol cannot guarantee consistent ordering of packets for all devices, which can also lead to inconsistent states.

To guarantee reliable transmission, we use a retransmit protocol. To guarantee ordering of multicast packets that may alter the system state, we allow only one node in the system to send such packets. In such a system, individual nodes or groups of nodes may fail (by power failure, temporary loss of network connectivity, or other modes), which must be accounted for. The protocol described here borrows many ideas from Gbcast, but is much simpler. For our purposes, two-phase commit is adequate, and nodes sending state-changing packets need not be notified of the result of their actions.

Changes to the packet format

To allow for retransmission and acknowledgement of packets and for detection of retransmission, a sequence number field is added to the general hexabus packet header. To allow for ordering of packets, this sequence number must be monotonic for any given tuple of (source IP, source port, destination IP, destination port).

The basic hexabus packet header after this addition then looks like this:

-----------------------------------------
| 0..3 | 4    | 5     | 6..7            |
|------|------|-------|-----------------|
| HX0D | type | flags | sequence number |
-----------------------------------------

The HX0D prefix field, the type field and the flags field retain their original meaning. While the original packet format did not define any flags, the new format defines one flag:

WANT_ACK (0x01): sending node requires acknowledgement from recipients in any case
WANT_UL_ACK (0x02): upper layer decides if there is an ACK to be sent, e.g. if the packet is relevant for the state machine.

The packet footer will be removed, since it only contains a 2-byte CRC, which will also be removed.

The other packet types are described by their content and value of the type flag in the packet header, omitting the rest of the header and the footer.

ERROR (0x00)

Additionally to the error code as present now, the ERROR packet also contains the sequence number of the packet that caused the error on the device:

--------------------------------------
| 0          | 1..2                  |
|------------|-----------------------|
| error code | cause sequence number |
--------------------------------------

When sent to a node in response to a unicast packet that caused an error on the recipient node, the cause sequence number field is filled with the sequence number of the packet that caused the error. If a node spontaneously reports an error, the error code must indicate that the cause sequence number field is not valid, and the cause sequence number field must be set to 0.

QUERY (0x02), EPQUERY (0x0A)

These two packets retain their original shape:

--------
| 0..3 |
|------|
| eid  |
--------

INFO (0x01), WRITE (0x04), EPINFO (0x09)

Also retain their original shape:

--------------------------
| 0..3 | 4        | 5..n |
|------|----------|------|
| eid  | datatype | data |
--------------------------

The INFO and EPINFO packets must not be sent in response to QUERY/EPQUERY packets, since this would discard ordering information. Two new packet types are introduced for replies to QUERY and EPQUERY packets.

REPORT (0x03), EPREPORT (0x0B)

Newly introduced to allow ordering of QUERY/EPQUERY queries and responses. These packets contain all information contained in INFO/EPINFO packets, and additionally contain the sequence number of the query packet that caused these packets to be sent.

--------------------------------------
| 0..n       | n+1..n+2              |
|------------|-----------------------|
| <see INFO> | cause sequence number |
--------------------------------------

PINFO (0x11)

This packet is used to multicast information that may change the state of the distributed state machine. Additionally to the contents of a normal INFO packet, it also contains the IP address of the node that caused the PINFO packet to be sent. See the description of the retransmission and ordering protocol below for more information.

--------------------------------------
| 0..n       | n+1..n+16             |
|------------|-----------------------|
| <see INFO> | originator IP address |
--------------------------------------

ACK (0x10)

This new packet is introduced to allow acknowledgements of packets for which no implicit acknowledgement is possible. For WRITE, QUERY and EPQUERY packets an implicit acknowledgement can be sent - reception of the corresponding report packets is an implicit acknowledgement of the write or query command in question. For INFO, EPINFO and PINFO packets, no such thing is possible, thus the need for an explicit ACK packet.

The ACK packet contains only the sequence number of the packet that required acknowledgement:

-------------------------
| 0..1                  |
|-----------------------|
| cause sequence number |
-------------------------

Retransmission and ordering protocol

Most operations in a hexabus network that are not related to the operation of the state machine have much less stringent reliability requirements than operations pertaining to the state machine. For queries and writes to single devices, the requesting node may simply retry the operation if an acknowledgement was requested but not received. Writes should never be done to more than one device at once and not at all by the user, and queries to many nodes at once will be implicitly acknowledged by answers to the queries.

Only INFO packets that may change the state of the distributed state machine must be delivered to all devices in the network reliably and in some specific order that is the same for every device in the network. To ensure this, packets that may change the state of the distributed state machine may no longer be sent by any node in the system, but only by one node: the master. Hexabus device continue to send INFO packets as they do now, but other device must not change their state upon reception of INFO packets. A new packet type, PINFO is introduced for information that may change state, and devices must act upon PINFO packets only if they were sent by the master.

It is then the responsibility of the master node to ensure that the distributed state machine is in a consistent state, that packets affecting the state machine are totally ordered, and that all devices have either acknowledged a packet affecting the state machine or have failed in some manner.

Protocol state machine on the device

To each packet sent to a destination, the device assigns a monotonically increasing sequence number. If the packet to be sent requires acknowledgement, the device waits for this acknowledgement and retransmits the packet as appropriate, until an ACK has been received or the maximum number of retransmission is exceeded, in which case the device considers itself failed. While one packet is being transmitted reliably, no other packet with reliability requirements may be transmitted.

If a device considers itself failed and receives any packet from the master, it will initiate recovery procedures. Until recovery is complete, the state machine is stopped. If it considers itself failed, it will not dequeue and send any packets, and no packets may be enqueued. Successful recovery includes a flush of the entire send and receive queues of the device.

Since packets without reliability constraints may be sent at any time, protocol state machines for the reliability mechanism need not concern themselves with such packets: instead of queueing them for transmission, just send them directly. Pictured below are the protocol state machines for one device and one remote endpoint. Since devices will generally communicate with only the base station, for most of the operational lifetime of a device only one state machine of each type will run. Send and receive state machines run in parallel and communicate via shared variables.

Protocol state machine (send) Protocol state machine (receive)

enqueue_recv is used to trigger appropriate handling for the received packet. If the sender of a packet requests an acknowledgement from the device and implicit ACKs are possible (such as sending an REPORT packet in response to a QUERY packet), enqueue_recv must at some point in the future send a packet that constitutes an implicit acknowledgement.

Note that the sending state machine specifies a recovery procedure. In the simplest case, recovery will consist of only a global, reliable reset of the distributed state machine. More advanced procedures may include state machine action replay on the master, which will allow recovery of failed nodes by giving them their state explicitly after a failure event.

Protocol state machine on the master

On the master, the same state machines run in parallel for each device known to the state machine. Since the master will often send packets via multicast and, for a given packet, all state machine timeouts that do expire will expire at the same time, the master may send a multicast packet only once instead of once for each known destination. Additionally, when the master receives a reset command for the distributed state machine, all send/receive state machines are reset, all queues are cleared and reinitialized with the reset command packet for the distributed state machine.

Example communications

Assume a system with a master M, a sensor S and two actors, A and B. The full specification of the distributed state machine running on the devices is not important for the discussion of the reliability mechanism, we need only know that S may send INFO packets for an endpoint e that may trigger changes in the actors.

In the following diagrams, packets will be denoted as type (seq#, rest). Hence, INFO packets will be INFO (seq#, <endpoint>), ACK packets will be ACK(seq#, cause#) and PINFO packets will be PINFO(seq#, senderIP, <endpoint>).

The simplest case of course does not experience packet loss:

M                         S                         A                         B
|   INFO(1, e)            |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|   ACK(42, 1)            |                         |                         |
|------------------------>|                         |                         |
|   PINFO(43, S, e)       |                         |                         |
|------------------------>|------------------------>|------------------------>| (A, B act)
|                         |    ACK(43)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |      ACK(43)            |
|<------------------------+-------------------------+-------------------------|
|                         |                         |                         |

If packet loss on the path from S to M occurs, S will retransmit the packet packet until an ACK is received or S must assume failure. If the INFO packet is lost, a possible communication will look much like this:

M                         S                         A                         B
|   INFO(1, e)            |                         |                         |
|<--//////////------------|------------------------>|------------------------>| (A, B ignore INFO)
|                         |                         |                         |
Z   (timeout occurs)      Z                         Z                         Z
|                         |                         |                         |
|   INFO(1, e)            |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|   ACK(42, 1)            |                         |                         |
|------------------------>|                         |                         |
|   PINFO(43, S, e)       |                         |                         |
|------------------------>|------------------------>|------------------------>| (A, B act)
|                         |    ACK(43)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |      ACK(43)            |
|<------------------------+-------------------------+-------------------------|
|                         |                         |                         |

A lost ACK packet from the master to sensor S will behave much the same: when the master detects a retransmittd INFO packet, it will send another ACK, but no new PINFO packets:

M                         S                         A                         B
|   INFO(1, e)            |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|   ACK(42, 1)            |                         |                         |
|---//////////----------->|                         |                         |
|   PINFO(43, S, e)       |                         |                         |
|------------------------>|------------------------>|------------------------>| (A, B act)
|                         |    ACK(43)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |      ACK(43)            |
|<------------------------+-------------------------+-------------------------|
|                         |                         |                         |
Z   (timeout occurs)      Z                         Z                         Z
|                         |                         |                         |
|   INFO(1, e)            |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|   ACK(42, 1)            |                         |                         |
|------------------------>|                         |                         |
|                         |                         |                         |

Now let a PINFO packet be lost on its ways to B, and let the sensor S send another value before B has acknowledges reception of the first packet:

M                         S                         A                         B
|   INFO(1, e1)           |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|   ACK(42, 1)            |                         |                         |
|------------------------>|                         |                         |
|   PINFO(43, S, e1)      |                         |                         |
|------------------------>|------------------------>|-----/////////---------->| (A acts, PINFO to B lost)
|                         |    ACK(43)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |                         |
|   INFO(2, e2)           |                         |                         |
|<------------------------|------------------------>|------------------------>| (A, B ignore INFO)
|                         |                         |                         |
Z   (timeout occurs)      Z                         Z                         Z
|                         |                         |                         |
|   PINFO(43, S, e1)      |                         |                         |
|------------------------>|------------------------>|------------------------>| (B acts, A detects retransmission)
|                         |    ACK(43)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |      ACK(43)            |
|<------------------------+-------------------------+-------------------------|
|                         |                         |                         |
|   PINFO(44, S, e2)      |                         |                         |
|------------------------>|------------------------>|------------------------>| (A, B act)
|                         |    ACK(44)              |                         |
|<------------------------+-------------------------|                         |
|                         |                         |      ACK(44)            |
|<------------------------+-------------------------+-------------------------|
|                         |                         |                         |

Step n+1: but wait, there's more!

Additionally to the basic reliability mechanism, a number of advanced features that require the reliability mechanism should be implemented.

Device groups for the state machine

Instead of repeating the IP address of the originating device in PINFO packets, assign to each device one or more unique group IDs. When generating PINFO packets from INFO packets, the master will generate one PINFO packet for each group ID assigned to the originating device. All of these PINFO packets must then be sent reliably to the devices in the system.

Allow independent packets to be sent concurrently

Since the master can know the entire state machine and the current state of every device, the master can determine when it is safe to send more than one packet with strong reliability requirements. For example, two completely indepent instances of the state machine "staircase light" might be running in one installation. The master will then know that two packets can be sent concurrently if one is only valid for the first staircase light, and the second is only valid for the other. We need not introduce artificial delays by totally ordering these two packets when partially ordering them based on which state machines they affect is sufficient.

Smart recovery mechanism

The recovery mechanism as implemented the current hexabus system (as of 18. Nov. 2013) and as sketched in Step 1 is very brutal. Instead, let the master silently execute the state machine of the entire system and, whenenver a falure is detected and subsequently fixed, recover the failed device. Since the master knows in which state the failed device should be and knows the values of each endpoint modified by the state machine, the master can then WRITE the values to the device and tell it to restart its state machine at the state it should be in.

Upgrade path for the master process

Allow the master process to be updated without losing state or packets. This will require some form of saving the current state of the master process, executing the new binary and reading the state of the old instance without dropping packets. systemd can do it, we should be able to do that too.

Failsafe master

If the master crashes, the distributed state machine should not be required to reset in its entirety. This is closely related to the previous item: if the master were to save it's state to some location that can outlive the master process itself, a new copy of the process can restart where the old left off with minimal interruption.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.