-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add placement change operational guide #998
Conversation
Codecov Report
@@ Coverage Diff @@
## master #998 +/- ##
==========================================
+ Coverage 77.82% 77.85% +0.03%
==========================================
Files 411 411
Lines 34516 34516
==========================================
+ Hits 26863 26874 +11
+ Misses 5778 5770 -8
+ Partials 1875 1872 -3
Continue to review full report at Codecov.
|
|
||
## Overview | ||
|
||
M3DB was designed from the ground up to a be a distributed / clustered database that is isolation group aware. Clusters will seamlessly scale with your data, and you can start with a cluster as small as 3 nodes and grow it to a size of several hundred nodes with no downtime or expensive migrations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that the 3 nodes
comment is referring to clusters, but I wonder if this will confuse people in thinking it's not possible to test M3DB on their own (single) machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, perhaps just say "small number of nodes".
|
||
In other words, all you have to do is issue the desired instruction and the M3 stack will take care of making sure that your data is distributed with appropriate replication and isolation. | ||
|
||
In the case of the M3DB nodes, nodes that have received new shards will immediately begin receiving writes (but not serving reads) for the new shards that they are responsible for. They will also begin streaming in all the data for their newly acquired shards from the peers that already have data for those shards. Once the nodes have finished streaming in the data for the shards that they have acquired, they will mark their status for those shards as `Available` in the placement and begin accepting writes. Simultaneously, the nodes that are losing ownership of any shards will mark their status for those shards as `LEAVING`. Once all the nodes accepting ownership of the new shards have finished streaming data from them, they will relinquish ownership of those shards and remove all the data associated with the shards they lost from memory and from disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Available
be in all caps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no but LEAVING should not be capitalized
|
||
In the case of the M3DB nodes, nodes that have received new shards will immediately begin receiving writes (but not serving reads) for the new shards that they are responsible for. They will also begin streaming in all the data for their newly acquired shards from the peers that already have data for those shards. Once the nodes have finished streaming in the data for the shards that they have acquired, they will mark their status for those shards as `Available` in the placement and begin accepting writes. Simultaneously, the nodes that are losing ownership of any shards will mark their status for those shards as `LEAVING`. Once all the nodes accepting ownership of the new shards have finished streaming data from them, they will relinquish ownership of those shards and remove all the data associated with the shards they lost from memory and from disk. | ||
|
||
M3Coordinator nodes will also pickup the new placement from etcd and alter which M3DB nodes they issue writse and reads to appropriately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/writse/writes
|
||
#### Isolation Group | ||
|
||
This value controls how nodes that own the same M3DB shards are isolated from each other. For example, in a single datacenter configuration this value could be set to the rack that the M3DB node lives on. As a result, the placement will guarantee that nodes that exist on the same rack do not share any shards, allowing the cluster to survive the failure of an entire rack. Alternatively, if M3DB was deployed in an AWS region, the isolation group could be set to the regions availability zone and that would ensure that the cluster would survive the loss of an entire availability zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/regions/region's
|
||
#### Weight | ||
|
||
This value should be an integer and controls how the cluster will weigh the number of shards that an individual node will own. If you're running the M3DB cluster on homogenous hardware, then you probably want to assign all M3DB nodes the same weight so that shards are distributed evenly. On the otherhand, if you're running the cluster on heterogenous hardware, then this value should be higher for nodes with higher resources for whatever the limiting factor is in your cluster setup. For example, if disk space (as opposed to memory or CPU) is the limiting factor in how many shards any given node in your cluster can tolerate, then you could assign a higher value to nodes in your cluster that have larger disks and the placement calcualtions would assign them a higher number of shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/calcualtions/calculations
|
||
The instructions below all contain sample curl commands, but you can always review the API documentation by navigating to | ||
|
||
`http://<M3_COORDINATOR_HOST_NAME>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: https://m3db.io/openapi/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice....how does that get updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netlify will automatically update this during the build script assuming that the assets folder is up to date (via make asset-gen-query
).
``` | ||
|
||
#### Replacing a Node | ||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just a remove and then an add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robskillington Is this correct
|
||
## Overview | ||
|
||
M3DB was designed from the ground up to a be a distributed / clustered database that is isolation group aware. Clusters will seamlessly scale with your data, and you can start with a cluster as small as 3 nodes and grow it to a size of several hundred nodes with no downtime or expensive migrations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit with opening sentence: "to a be" -> "to be a" and "distributed / clustered database" -> "distributed (clustered)".
Perhaps:
M3DB was designed from the ground up to be a distributed (clustered) database that is availability zone or rack aware (by using isolation groups).
@@ -82,7 +82,7 @@ TODO | |||
|
|||
The operations below include sample CURLs, but you can always review the API documentation by navigating to | |||
|
|||
`http://<M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi` | |||
`http://<M3_COORDINATOR_HOST_NAME>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi` or our [online API documentation](https://m3db.io/openapi/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to your change but the deleting a namespace HTTP endpoint isn't correctly quoted: https://github.com/m3db/m3/blob/2eba3b8e0abc28b962d9d0132a7b0e82328a5c4e/docs/operational_guide/namespace_changes.md. mind updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXED
@@ -402,14 +419,14 @@ definitions: | |||
instances: | |||
type: "array" | |||
items: | |||
$ref: "#/definitions/Instance" | |||
$ref: "#/definitions/InstanceRequest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just to avoid the extra fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct :(
|
||
#### Removing a Node | ||
|
||
Send a DELETE request to the `/api/v1/placement/<NODE_ID>` endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After sending a DELETE, can the node be removed right away or do we need to wait for the cluster to rebalance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point I will clarify, but yes you need to wait for all the other nodes to stream data from the node you removed.
|
||
Before reading the rest of this document, we recommend familiarizing yourself with the [M3DB placement documentation](placement.md) | ||
|
||
**Note**: The primary limiting factor for the maximum size of an M3DB cluster is the number of shards. TODO: Explain how to pick an appropriate number of shards and the tradeoff with a (small) linear increase in required node resources with the number of shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robskillington @prateek Can I get some guidance on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, I wish we would do some work to remove this silly requirement to choose some kind of size that matters. If groups of shards use about as many resources as one single shard, then this greatly improves the ergonomics/usability of the DB. You could grow a cluster of size 1 to a size of a few thousand nodes, if say you start off with 16k shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats fair, but we should give some guidance for now. How about 64 shards for development / testing purposes, 1024 for anything you're gonna use in production, and 4096 if you know its gonna be a big cluster and you really don't want to worry about it? We've run 1024 and 4096 in production so we know those values are ok
@@ -120,7 +120,7 @@ Adding a namespace does not require restarting M3DB, but will require modifying | |||
|
|||
Deleting a namespace is a simple as using the `DELETE` `/api/v1/namespace` API on an M3Coordinator instance. | |||
|
|||
curl -X DELETE <M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>api/v1/namespace/<NAMESPACE_NAME> | |||
`curl -X DELETE <M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>api/v1/namespace/<NAMESPACE_NAME>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing /
before api/v1/...
here.
|
||
- Resource Constrained / Development clusters: `64 shards` | ||
- Production clusters: `1024 shards` | ||
- Production clusters with high-resource nodes (Over 128GiB of ram, etc) and an expected cluster size of several hundred nodes: `4096 shards` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a good recommendation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.