Key-Value Raft Server

The following is an implementation of a distributed key-value store using rust, raft, and tokio.

Architecture

There is a client and server. These communicate using the public protobuf requests and responses. There is just a single message type and it uses oneof to determine which command is desired.

This is done with a framed codec against tcp.

The client will have a single ip address to connect to first, if there's time we could have a command that will fetch the current peer list and it could use those to distribute it's queries. As is, it will use tokio to do the request and response but will do so in a blocking manner on the main thread.

The server is comprised of a few processes:

client listener
peer listener
database & raft

Client listener

The client listener will create a new connection struct for each connection that comes in. A connection struct will handle the stream of public proto requests. For each request it will translate it into a command and forward the command to the database's channel.

A connection contains:

Framed stream
Framed sink
Database handle (a Sender channel wrapped in a newtype)
Connection handle (a Receiver channel and Sender channel)

When the client has a request, it will generate a query for the DB and send it on the database handle. The command will contain a connection handle so that the database can reply with a response when it's ready.

The connection handle will deliver replies from the db to the connection. The replies will be mapped to a proper protobuf response and then forwarded to the sink.

Peer listener

The peer listener will open up a listener on a different port and handle messages delivered from the network. These don't have replies so it just forwards them on to the database as raft messages.

Database & Raft

The main server is a database struct that holds the raft node and other state. The database will have an in-memory representation of the data (if this OOM it dies, I'm not coding around the size of memory). The raft log (the binary data) will be composed of protobuf data. Since I only care about the state of the hash map, each normal entry will basically map to the basic edit commands (set and delete) and be a protobuf. The snapshot will also be protobuf to inflate the hashmap.

Each command will drive either a proposal from a connection (pushing a callback onto a lookup table with the command id to handle when the command becomes committed. If it's just a read command, we can just generate and send the callback immediately.

The read commands could be distributed to other threads if the state of the system is placed behind a rwlock. This is an optimization we can handle later!

If a raft message is received it's driven into the raft node, then we go through the readiness checks. Timeouts are done on the command channel.

Storage

While everything above can be handled with memory storage, the goal is also to persist changes to disk. To do that we'll need to implement the storage trait for raft. We'll just write things to various protobufs.

The storage module is written so that only upon snapshots are things considered persisted. This guarantees that the data in a snapshot is what will be restored.

It uses the provided memory storage to manage the entries, instead of rebuilding it. However, this is wrapped behind a layer that writes to disk. It's the snap shots that actually persist and they will block the main execution loop.

How to use

Clone the repo and build it to start. Then use the docker-compose file to launch and up the service:

docker-compose up --build

This will start 3 servers that are not connected at first. You can get them to connect by connecting them in order. We need to get them to agree on the leader and that can take some restarts to get correct. 2 nodes can end up going back and forth on the election.

To simplify it, there's a reset.sh script. Run this from the cli to build the docker cluster and stand it up.

The trick appears to be to stand up one, get it to elect itself. This will allow you to propose conf changes like adding a node. Add the node and it will try to start the election. However, if both nodes think that they are the leader we get into a cycle which requires an interrupt such one wins, that's what I'm using a restart for.

All commands can be done using the client binary, include set, delete, scan, add_node, remove_node, info, and ping.

These were mapped to the CLI as well allowing you to interact from the CLI. info is very useful for connecting new machines because it tells you the status of whatever node you are asking for.

When a node dies, it can rejoin and "catch up" in the logs. However, if it's very far behind, it can be faster to remove it complete, kill it, and add it as a new node. This will allow it to catch-up via a snap shot from another node instead.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
example		example
proto		proto
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
build.rs		build.rs
docker-compose.yml		docker-compose.yml
reset.sh		reset.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example

example

proto

proto

src

src

.dockerignore

.dockerignore

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

Dockerfile

Dockerfile

README.md

README.md

build.rs

build.rs

docker-compose.yml

docker-compose.yml

reset.sh

reset.sh

Repository files navigation

Key-Value Raft Server

Architecture

Client listener

Peer listener

Database & Raft

Storage

How to use

About

Releases

Packages

Languages

choubacha/kv-raft

Folders and files

Latest commit

History

Repository files navigation

Key-Value Raft Server

Architecture

Client listener

Peer listener

Database & Raft

Storage

How to use

About

Resources

Stars

Watchers

Forks

Languages