Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add placement change operational guide #998

Merged
merged 12 commits into from
Oct 3, 2018

Conversation

richardartoul
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Oct 1, 2018

Codecov Report

Merging #998 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #998      +/-   ##
==========================================
+ Coverage   77.82%   77.85%   +0.03%     
==========================================
  Files         411      411              
  Lines       34516    34516              
==========================================
+ Hits        26863    26874      +11     
+ Misses       5778     5770       -8     
+ Partials     1875     1872       -3
Flag Coverage Δ
#dbnode 81.42% <ø> (+0.04%) ⬆️
#m3ninx 75.25% <ø> (ø) ⬆️
#query 64.35% <ø> (ø) ⬆️
#x 84.72% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 918ee35...061fbdc. Read the comment docs.

@richardartoul richardartoul mentioned this pull request Oct 1, 2018
8 tasks
@richardartoul richardartoul changed the title [WIP] - Add placement change operational guide Add placement change operational guide Oct 1, 2018

## Overview

M3DB was designed from the ground up to a be a distributed / clustered database that is isolation group aware. Clusters will seamlessly scale with your data, and you can start with a cluster as small as 3 nodes and grow it to a size of several hundred nodes with no downtime or expensive migrations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that the 3 nodes comment is referring to clusters, but I wonder if this will confuse people in thinking it's not possible to test M3DB on their own (single) machine?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, perhaps just say "small number of nodes".


In other words, all you have to do is issue the desired instruction and the M3 stack will take care of making sure that your data is distributed with appropriate replication and isolation.

In the case of the M3DB nodes, nodes that have received new shards will immediately begin receiving writes (but not serving reads) for the new shards that they are responsible for. They will also begin streaming in all the data for their newly acquired shards from the peers that already have data for those shards. Once the nodes have finished streaming in the data for the shards that they have acquired, they will mark their status for those shards as `Available` in the placement and begin accepting writes. Simultaneously, the nodes that are losing ownership of any shards will mark their status for those shards as `LEAVING`. Once all the nodes accepting ownership of the new shards have finished streaming data from them, they will relinquish ownership of those shards and remove all the data associated with the shards they lost from memory and from disk.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Available be in all caps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no but LEAVING should not be capitalized


In the case of the M3DB nodes, nodes that have received new shards will immediately begin receiving writes (but not serving reads) for the new shards that they are responsible for. They will also begin streaming in all the data for their newly acquired shards from the peers that already have data for those shards. Once the nodes have finished streaming in the data for the shards that they have acquired, they will mark their status for those shards as `Available` in the placement and begin accepting writes. Simultaneously, the nodes that are losing ownership of any shards will mark their status for those shards as `LEAVING`. Once all the nodes accepting ownership of the new shards have finished streaming data from them, they will relinquish ownership of those shards and remove all the data associated with the shards they lost from memory and from disk.

M3Coordinator nodes will also pickup the new placement from etcd and alter which M3DB nodes they issue writse and reads to appropriately.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/writse/writes


#### Isolation Group

This value controls how nodes that own the same M3DB shards are isolated from each other. For example, in a single datacenter configuration this value could be set to the rack that the M3DB node lives on. As a result, the placement will guarantee that nodes that exist on the same rack do not share any shards, allowing the cluster to survive the failure of an entire rack. Alternatively, if M3DB was deployed in an AWS region, the isolation group could be set to the regions availability zone and that would ensure that the cluster would survive the loss of an entire availability zone.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/regions/region's


#### Weight

This value should be an integer and controls how the cluster will weigh the number of shards that an individual node will own. If you're running the M3DB cluster on homogenous hardware, then you probably want to assign all M3DB nodes the same weight so that shards are distributed evenly. On the otherhand, if you're running the cluster on heterogenous hardware, then this value should be higher for nodes with higher resources for whatever the limiting factor is in your cluster setup. For example, if disk space (as opposed to memory or CPU) is the limiting factor in how many shards any given node in your cluster can tolerate, then you could assign a higher value to nodes in your cluster that have larger disks and the placement calcualtions would assign them a higher number of shards.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/calcualtions/calculations


The instructions below all contain sample curl commands, but you can always review the API documentation by navigating to

`http://<M3_COORDINATOR_HOST_NAME>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice....how does that get updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netlify will automatically update this during the build script assuming that the assets folder is up to date (via make asset-gen-query).

```

#### Replacing a Node
TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just a remove and then an add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robskillington Is this correct


## Overview

M3DB was designed from the ground up to a be a distributed / clustered database that is isolation group aware. Clusters will seamlessly scale with your data, and you can start with a cluster as small as 3 nodes and grow it to a size of several hundred nodes with no downtime or expensive migrations.
Copy link
Collaborator

@robskillington robskillington Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit with opening sentence: "to a be" -> "to be a" and "distributed / clustered database" -> "distributed (clustered)".

Perhaps:

M3DB was designed from the ground up to be a distributed (clustered) database that is availability zone or rack aware (by using isolation groups).

@@ -82,7 +82,7 @@ TODO

The operations below include sample CURLs, but you can always review the API documentation by navigating to

`http://<M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi`
`http://<M3_COORDINATOR_HOST_NAME>:<CONFIGURED_PORT(default 7201)>/api/v1/openapi` or our [online API documentation](https://m3db.io/openapi/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to your change but the deleting a namespace HTTP endpoint isn't correctly quoted: https://github.com/m3db/m3/blob/2eba3b8e0abc28b962d9d0132a7b0e82328a5c4e/docs/operational_guide/namespace_changes.md. mind updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED

@@ -402,14 +419,14 @@ definitions:
instances:
type: "array"
items:
$ref: "#/definitions/Instance"
$ref: "#/definitions/InstanceRequest"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just to avoid the extra fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct :(


#### Removing a Node

Send a DELETE request to the `/api/v1/placement/<NODE_ID>` endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sending a DELETE, can the node be removed right away or do we need to wait for the cluster to rebalance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point I will clarify, but yes you need to wait for all the other nodes to stream data from the node you removed.


Before reading the rest of this document, we recommend familiarizing yourself with the [M3DB placement documentation](placement.md)

**Note**: The primary limiting factor for the maximum size of an M3DB cluster is the number of shards. TODO: Explain how to pick an appropriate number of shards and the tradeoff with a (small) linear increase in required node resources with the number of shards.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robskillington @prateek Can I get some guidance on this?

Copy link
Collaborator

@robskillington robskillington Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, I wish we would do some work to remove this silly requirement to choose some kind of size that matters. If groups of shards use about as many resources as one single shard, then this greatly improves the ergonomics/usability of the DB. You could grow a cluster of size 1 to a size of a few thousand nodes, if say you start off with 16k shards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats fair, but we should give some guidance for now. How about 64 shards for development / testing purposes, 1024 for anything you're gonna use in production, and 4096 if you know its gonna be a big cluster and you really don't want to worry about it? We've run 1024 and 4096 in production so we know those values are ok

@@ -120,7 +120,7 @@ Adding a namespace does not require restarting M3DB, but will require modifying

Deleting a namespace is a simple as using the `DELETE` `/api/v1/namespace` API on an M3Coordinator instance.

curl -X DELETE <M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>api/v1/namespace/<NAMESPACE_NAME>
`curl -X DELETE <M3_COORDINATOR_IP_ADDRESS>:<CONFIGURED_PORT(default 7201)>api/v1/namespace/<NAMESPACE_NAME>`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing / before api/v1/... here.


- Resource Constrained / Development clusters: `64 shards`
- Production clusters: `1024 shards`
- Production clusters with high-resource nodes (Over 128GiB of ram, etc) and an expected cluster size of several hundred nodes: `4096 shards`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is a good recommendation for now.

@CLAassistant
Copy link

CLAassistant commented Oct 3, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardartoul richardartoul merged commit ccf2888 into master Oct 3, 2018
@prateek prateek deleted the ra/placement_change_guide branch October 13, 2018 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants