Skip to content

Algorithm

HagarMeir edited this page Sep 25, 2019 · 35 revisions

The normal execution path is primarily derived from the PBFT paper. As in the paper, we assume that the system is asynchronous and distributed, nodes are connected by a network. The network may fail to deliver messages, delay them, duplicate them, or deliver them out of order, however we assume authenticated communication channels (such as usage of TLS). We use a Byzantine failure model, i.e., faulty nodes may behave arbitrarily.

The algorithm is used to implement a deterministic state replication service. Clients issue requests to the service and wait for a reply. The service is implemented by n replicas (nodes). The algorithm provides both safety and liveness assuming no more than replicas are faulty.

The nodes move through a succession of configurations called views. In a view one node is the leader and the others are followers. Views are numbered consecutively. The leader of a view is node i such that , where v is the view number. View changes are carried out when it appears that the leader has failed.

The next figure is taken from the PBFT paper and it shows the operation of the algorithm in the normal case. Replica 0 is the leader, replica 3 is faulty, and C is the client.

The client sends its request to all nodes (instead of sending only to the leader to help prevent censorship). As opposed to PBFT, the client does not need to be aware of the leader for submitting requests. Clients' requests are batched together and the leader starts a three-phase protocol. The three phases are pre-prepare, prepare, and commit. The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the leader, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views.

In the pre-prepare phase, the leader batches requests to create a proposal, and it then assigns a sequence number to the proposal. Next the leader sends a pre-prepare message to all followers containing the proposal, its sequence number and the current view number v.

A follower accepts the pre-prepare message if the proposal is valid, it is in view v, it is expecting this sequence number, and it has not accepted a different pre-prepare message for this view and sequence number. Once a follower accepts a pre-prepare message it enters the prepare phase by broadcasting a prepare message. The prepare message contains a digest of the accepted proposal, the sequence number, and v.

Next each node waits for a quorum of prepare messages with the same digest, view v, and sequence number. The quorum size should be big enough to ensure that in the intersection of any two quorums there is at least one correct node. So for example, if n=3f+1 then the optimal quorum size is 2f+1.

After the node is "prepared" it sends a commit message with the same content as the prepare message, and a signature piggybacked to it. The node collects a quorum of commit messages and their attached signatures. These signatures are later used as proof for the client upon delivery and for checkpointing. Upon receiving a quorum of validated commit messages the node delivers its response to the client.

The view-change protocol provides liveness by allowing the system to make progress when the leader fails. The algorithm is described in the view-change page.