Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev Guide for Cluster Sharding/Migration #364

Merged
merged 15 commits into from
May 22, 2024
131 changes: 131 additions & 0 deletions website/docs/dev/cluster/migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
id: slot-migration
sidebar_label: Migration
title: Migration Overview
---

# Slot Migration Overview

In Garnet, slot migration describes the process of reassigning slot ownership and transferring the associated key-value pairs from one primary node to another within a fully operational cluster.
This operation allows for efficient resource utilization and load balancing across the cluster when adding or removing nodes.
The migration operation is only available in cluster mode.
Slot migration can be initiated by the owner (referred to as the *source* node) of a given slot, and addressed towards another already known, and trusted primary node (referred to as the *target* node).
Actual data migration can be initiated by using the ```MIGRATE ``` command that operates in two modes: (1) Migrate individual keys (2) Migrate entire slots or range of slots.
This page is focused on the slot migration implementation details.
For more information about using the associated command refer to the slot migration [user guide](../../cluster/key-migration).

# Implementation Details

The implementation of the migration operation is separated into two components:

1. Command parsing and validation component.
2. Migration session operation and management component.

vazois marked this conversation as resolved.
Show resolved Hide resolved
The first component is responsible for parsing and validating the migration command arguments.
Validation involves the following:

1. Parse every migration option and validate arguments associated with that option.
2. Validate ownership of associated slot that is about to be migrated.
3. For ```KEYS``` option, validate that keys to be migrated do not hash across different slots.
4. For ```SLOTS/SLOTSRANGE``` option, validate no slot is referenced multiple times.
5. For ```SLOTSRANGE``` option, make sure a range is defined by a pair of values.
6. Validate that the target of the migration is known, trusted and has the role of a primary.

When parsing completes succesfully, a *migration* session is created and executed.
Depending on the chosen option, the *migration* session executes either as a foreground task (using ```KEYS``` option) or a background task (```SLOTS/SLOTSRANGE``` option).

The second component is separated into the following sub-components:

1. ```MigrationManager```
2. ```MigrateSessionTaskStore```
3. ```MigrateSession```

The ```MigrationManager``` is responsible for managing the active ```MigrateSession``` tasks.
It uses the ```MigrateSessionTaskStore``` to atomically add or remove new instances of ```MigrateSession```.
It is also responsible for ensuring that existing sessions do not conflict with sessions that are about to be added, by checking if the referred slots in each session do not conflict.
Finally, it provideds information on the number of ongoing migrate tasks.

## Migration Data Integrity

Since slot migration can be initiated while a Garnet cluster is operational, it needs to be carefully orchestrated to avoid any data integrity issues when clients attempt to write new data.
In addition, whatever solution is put forth should not hinder data avaibility while migration is in progress.

Our implementation leverages on the concept of slot-based sharding to ensure that keys mapping to the related slot cannot be modified while a migrate task is active.
This is achieved by setting the slot state to ***MIGRATING*** in the source node.
This prevents any write requests though it still allows for reads.
Write requests can be issued towards the target node using ***ASKING***, though there are not consistency guarantees from Garnet if this option is used.
Garnet guarantees that no keys can be lost during regular migration (i.e. with using ***ASKING***).

Because Garnet operates in a multi-threaded environment, the transition from ```STABLE``` to ```MIGRATING``` needs to happen safely, so every thread has a chance to observe that state change.
This can happen using epoch protection.
Specifically, when a slot transitions to a new state, the segment that implements the actual state transition will have to spin-wait after making the change and return only after all active client sessions have moved to the next epoch.

An excerpt of the code related to epoch protection during slot state transition is shown below.
Every client session, before processing an incoming packet, will first acquire the current epoch.
This also happens for the client session that issues a slot state transition command (i.e. CLUSTER SETSLOT).
On success of updating the local config command, the processing thread of the previous command will perform the following:

1. release the current epoch (in order to avoid waiting for itselft to transition).
2. spin-wait until all thread transition to next epoch.
3. re-acquire the next epoch.

This series of steps ensures that on return, the processing thread guarantees to the command issuer (or the method caller if state transition happens internally), that the state transition is visible to all threads that were active.
Therefore, any active client session will process subsequent commands considering the new slot state.

```csharp reference title="Wait for Config"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/ClusterProvider.cs#L271-L296
```

During migration, the change in slot state (i.e., ```MIGRATING```) is transient from the perspective of the source node.
This means that until migration completes the slot is still owned by the source node.
However, because it is necessary to produce ```-ASK``` redirect messages, the *workerId* is set to the *workerId* of the target node, without bumping the current local epoch (to avoid propagation of this transient update to the whole cluster).
Therefore, depending on the context of the operation being executed, the actual owner of the node can be determined by accessing *workerId* property, while the target node for migration is determined through *_workerId* variable.
For example, the ```CLUSTER NODES``` will use *workerId* property (through GetSlotRange(*workerId*)) since it has to return actual owner of the node even during migration.
At the same, time it needs to return all nodes that are in ```MIGRATING``` or ```IMPORTING``` state and the node-id associated with that state which can be done by inspecting the *_workerId* variable (through GetSpecialStates(*workerId*)).

```csharp reference title="CLUSTER NODES Implementation"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/ClusterConfig.cs#L499-L521
```

## Migrate KEYS Implementation Details

Using the ```KEYS``` option, Garnet will iterate through the provided list of keys and migrate them in batches to the target node.
When using this option, the issuer of the migration command will have to make sure that the slot state is set appropriately in the source, and target node.
In addition, the issuer has to provide all keys that map to a specific slot either in one call to ```MIGRATE``` or across multiple call before the migration completes.
When all key-value pairs have migrated to the target node, the issuer has to reset the slot state and assign ownership of the slot to the new node.

```MigrateKeys``` is the main driver for the migration operation using the ```KEYS``` option.
This method iterates over the list of provided keys and sends them over to the target node.
This occurs in the following two phases:
1. find all keys provided by ```MIGRATE``` command and send them over to the *target* node.
2. for all remaining keys not found in the main store, lookup into the object store, and if they are found send them over to the *target* node.
It is possible that a given key cannot be retrieved from either store, because it might have expired.
In that case, execution proceeds to the next available key and no specific error is raised.
When data transmission completes, and depending if COPY option is enabled, ```MigrateKeys``` deletes the keys from the both stores.

```csharp reference title="MigrateKeys Main Method"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/Migration/MigrateSessionKeys.cs#L146-L169
```

## Migrate SLOTS Details

The ```SLOTS``` or ```SLOTSRANGE``` options enables Garnet to migrate a collection of slots and all the associated keys mapping to these slots.
These options differ from the ```KEYS``` options in the following ways:

1. There is no need to have specific knowledge of the individual keys that are being migrated and how they map to the associated slots. The user simply needs to provide just a slot number.
2. State transitions are handled entirely on the server side.
3. For the migration operation to complete, we have to scan both main and object stores to find and migrate all keys associated with a given slot.

It might seem, based on the last bullet point from above that the migration operation using ```SLOTS``` or ```SLOTSRANGE``` is more expensive, especially if the slot that is being migrated contains only a few keys.
However, it is generally less expensive compared to the ```KEYS``` option which requires multiple roundtrips between client and server (so any relevant keys can be used as input to the ```MIGRATE``` command), in addition to having to perform a full scan of both stores.

As shown in the code excerpt below, the ```MIGRATE SLOTS``` task will safely transition the state of a slot in the remote node config to ```IMPORTING``` and the slot state of the local node config to ```MIGRATING```, by relying on the epoch protection mechanism as described previously.
Following, it will start migrating the data to the target node in batches.
On completion of data migration, the task will conclude with performing next slot state transition where ownership of the slots being migrates will be handed to the target node.
The slot ownership exchange becomes visible to the whole cluster by bumping the local node's configuration epoch (i.e. using RelinquishOwnership command).
Finally, the source node will issue ```CLUSTER SETSLOT NODE``` to the target node to explicitly make it an owner of the corresponding slot collection.
This last step is not necessary and it is used only to speed up the config propagation.

```csharp reference title="MigrateSlots Background Task"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/Migration/MigrationDriver.cs#L54-L126
```
21 changes: 21 additions & 0 deletions website/docs/dev/cluster/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
id: overview
sidebar_label: Cluster Overview
title: Cluster Overview
---

# Cluster Implementation Overview

The cluster is designed and implemented as an independent plugin component within the standalone Garnet server.
It consists of the following two subcomponents:

1. Cluster session (```IClusterSession```) subcomponent
This component implements the RESP parsing logic for the relevant cluster commands and the associated primitives
which are used to perform slot verification checks and generate the appropriate redirection messages (i.e. ```-MOVED```, ```-ASK``` etc.).

2. Cluster provider (```IClusterProvider```) subcomponent
This components implements the core cluster functionality and features such as gossip, key migration, replication, and sharding.

The decision to partition the cluster into distinct components which are separate from the standalone Garnet server serves a dual purpose.
First, it allows for an initial implementation supporting essential cluster features while also enabling developers to contribute new functionality as needed.
Second, developers may opt-in to use their own cluster implementation if they find it necessary or our default implementation does not fit their needs.
96 changes: 96 additions & 0 deletions website/docs/dev/cluster/sharding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
id: sharding
sidebar_label: Sharding
title: Sharding Overview
---

# Sharding Overview

## Cluster Configuration

Every running cluster instance maintains a local copy of the cluster configuration.
This copy maintains information about the known cluster workers and the related slot assignment.
Both pieces of information are represented using an array of structs as shown below.
Changes to the local copy are communicated to the rest of the cluster nodes through gossiping.

Note that information related to the node characteristics can be updated only by the node itself by issuing the relevant cluster commands.
For example, a node cannot become a **REPLICA** by receiving a gossip message.
It can only change its current role only after receiving a ```CLUSTER REPLICATE``` request.
We follow this constrain to avoid having to deal with cluster misconfiguration in the event of network partitions.
This convention also extends to slot assignment, which is managed through direct requests to cluster instances made using the ```CLUSTER [ADDSLOTS|DELSLOTS]``` and ```CLUSTER [ADDSLOTSRANGE|DELSLOTSRANGE]``` commands.

```csharp reference title="Hashlot & Worker Array Declaration"
https://github.com/microsoft/garnet/blob/8856dc3990fb0863141cb902bbf64c13202d5f85/libs/cluster/Server/ClusterConfig.cs#L16-L42
```

Initially, the cluster nodes are empty, taking the role of a **PRIMARY**, having no assigned slots and with no knowledge of any other node in the cluster.
The local node contains information only about itself stored at workers[1] while workers[0] is reserved for special use to indicate unassigned slots.
Garnet cluster nodes are connected to each other through the ```CLUSTER MEET``` command which generates a special kind of gossip message.
This message forces a remote node to add the sender to its list of trusted nodes.
Remote nodes are stored in any order starting from workers[2].

```csharp reference title="Worker Definition"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/Worker.cs#L28-L85
```

Information about the individual slot assignment is captured within the configuration object using an array of HashSlot struct type.
It maintains information about the slot state and corresponding owner.
The slot owner is represented using the offset in the local copy of the workers array.
The slot state is used to determine how to serve requests for keys that map to the relevant slot.

```csharp reference title="HashSlot Definition"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/HashSlot.cs#L43-L61
```

At cluster startup slots are are unassigned, hence their initial state is set to **OFFLINE** and workerId to 0.
When a slot is assigned to a specific node, its state is set to **STABLE** and workerId (from the perspective of the local configuration copy) to the corresponding offset of the owner node in workers array.
Owners of a slot can perform read/write and migrate operations on the data associated with that specific slot.
Replicas can serve read requests only for keys mapped to slots owned by their primary.

```csharp reference title="SlotState Definition"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/cluster/Server/HashSlot.cs#L11-L37
```

### Configuration Update Propagation

A given node will accept gossip messages from trusted nodes.
The gossip message will contain a serialized byte array representation which represents a snapshot of the remote node's local configuration.
The receiving node will try to atomically merge the incoming configuration to its local copy by comparing the relevant workers' configuration epoch.
Hence any changes to the cluster's configuration can happen at the granularity of a single worker.
We leverage this mechanism to control when local updates become visiable to the rest of the cluster.
This is extremely useful for operations that are extended over a long duration and consist of several phases (e.g. [data migration](slot-migration)).
Such operations are susceptible to interruptions and require protective measures to prevent any compromise of data integrity.

As mentioned previously, local updates are propagated through gossiping which can operate in broadcast mode or gossip sampling mode.
In the former case, we broadcast the configuration to all nodes in the cluster periodically, while in the latter case we pick randomly a subset of nodes to gossip with.
This can be configured at server start-up by using ***--gossip-sp*** flag.

## Slot Verification

RESP data commands operate either on a single key or a collection of keys.
In addition, they can be classified either as readonly (e.g. *GET* mykey) or read-write (e.g. *SET* mykey foo).
When operating in cluster mode and before processing any command Garnet performs an extra slot verification step.
Slot verification involves inspecting the key or keys associated with a given command and validating that it maps to a slot that can be served by the node receiving the associated request.
Garnet primary nodes can serve *read* and *read-write* requests for slots that they own, while Garnet replica nodes can only serve read requests for slots that their primary owns.
On failure of the slot verification step, the corresponding command will not be processed and the slot verification method will write a redirection message directly to the network buffer.

```csharp reference title="Slot Verification Methods"
https://github.com/microsoft/garnet/blob/951cf82c120d4069e940e832db03bfa018c688ea/libs/server/Cluster/IClusterSession.cs#L47-L67
```
## Redirection Messages

From the perspective of a single node, any requests for keys mapping towards an unassigned slot will result in ```-CLUSTERDOWN Hashlot not served``` message.
For a single key request an assigned slot is considered ***LOCAL*** if the receiving node owns that slot, otherwise it is classified as a ***REMOTE*** slot since it is owned by a remote node.
In the table below, we provide a summary of the different redirection messages that are generated depending on the slot state and the type of operation being performed.
Read-only and read-write requests for a specific key mapping to a ***REMOTE*** slot will result in ```-MOVED <slot> <address>:<port>``` redirection message, pointing to the endpoint that claims ownership of the associated slot.
A slot can also be in a special state such as ```IMPORTING``` or ```MIGRATING```.
These states are used primarily during slot migration, with the ```IMPORTING``` state assigned to the slot map of the target node and the ```MIGRATING``` state to the slot map of the source node.
Should a slot be in the ```MIGRATING``` state and the key is present (meaning it has not yet been migrated), the read requests can continue to be processed as usual.
Otherwise the receiving node (for both read-only and read-write requests) will return ```-ASK <slot> <address>:<port>``` redirection message pointing to the target node.
Note read-write key requests on existing keys are not allowed in order to ensure data integrity during migration.

| Operation/State | ASSIGNED LOCAL | ASSIGNED REMOTE | MIGRATING EXISTS | MIGRATING ~EXISTS | IMPORTING ASKING | IMPORTING ~ASKING |
| --------------- | ---------------- | ---------------- | ---------------- | ----------------- | ---------------- | ----------------- |
| Read-Only | OK | -MOVED | OK | -ASK | OK | -MOVED |
| Read-Write | OK | -MOVED | -MIGRATING | -ASK | OK | -MOVED |

11 changes: 9 additions & 2 deletions website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ const config = {
</p>`,
},
prism: {
additionalLanguages: ['csharp'],
theme: prismThemes.github,
darkTheme: prismThemes.dracula,
},
Expand All @@ -159,9 +160,15 @@ const config = {
clarity: {
ID: "loh6v65ww5",
},
// github codeblock theme configuration
codeblock: {
showGithubLink: true,
githubLinkLabel: 'View on GitHub',
showRunmeLink: false,
runmeLinkLabel: 'Checkout via Runme'
},
}),

themes: ['@docusaurus/theme-mermaid'],
themes: ['@docusaurus/theme-mermaid', 'docusaurus-theme-github-codeblock'],
};

export default config;