From 83c09b177b13bccb30d3f5355a9fdfc3429b9a3f Mon Sep 17 00:00:00 2001 From: Justin SB Date: Wed, 23 Jan 2019 12:08:15 -0500 Subject: [PATCH] Update readme for TLS-by-default support --- README.md | 97 +++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 69 insertions(+), 28 deletions(-) diff --git a/README.md b/README.md index 25c6b49b..29802057 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,15 @@ a packaged docker container, but this walkthrough lets you see what's going on h etcd must be installed in `/opt/etcd-v2.2.1-linux-amd64/etcd`, etcdctl in `/opt/etcd-v2.2.1-linux-amd64/etcdctl`. Each version of etcd you want to run must be installed in the same pattern. Make sure you've downloaded `/opt/etcd-v3.2.18-linux-amd64` for this demo. (Typically you'll run etcd-manager in a docker image) +On Linux, you can do this with: + +``` +bazel build //:etcd-v2.2.1-linux-amd64_etcd //:etcd-v2.2.1-linux-amd64_etcdctl +bazel build //:etcd-v3.2.24-linux-amd64_etcd //:etcd-v3.2.24-linux-amd64_etcdctl +sudo cp -r bazel-genfiles/etcd-v* /opt/ +sudo chown -R ${USER} /opt/etcd-v* +``` + NOTE: If you're running on OSX, CoreOS does not ship a version of etcd2 that runs correctly on recent versions os OSX. Running inside Docker avoids this problem. ``` @@ -32,12 +41,17 @@ ln -sf bazel-bin/cmd/etcd-manager-ctl/linux_amd64_stripped/etcd-manager-ctl ln -sf bazel-bin/cmd/etcd-manager/linux_amd64_stripped/etcd-manager # Start etcd manager -./etcd-manager --address 127.0.0.1 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/1 --client-urls=http://127.0.0.1:4001 --quarantine-client-urls=http://127.0.0.1:8001 +./etcd-manager --insecure --etcd-insecure --address 127.0.0.1 --etcd-address 127.0.0.1 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/1 --client-urls=http://127.0.0.1:4001 --quarantine-client-urls=http://127.0.0.1:8001 --peer-urls=http://127.0.0.1:2380 # Seed cluster creation -./etcd-manager-ctl --members=1 --backup-store=file:///tmp/etcd-manager/backups/test --etcd-version=2.2.1 +./etcd-manager-ctl -member-count=1 --backup-store=file:///tmp/etcd-manager/backups/test -etcd-version=2.2.1 configure-cluster ``` +Note the `--insecure` and `--etcd-insecure` flags - we're turning off TLS for +both etcd-manager and etcd - you shouldn't do that in production, but for a +demo/walkthrough the TLS keys are a little complicated. The test suite and +production configurations do use TLS. + `etcd-manager` will start a node ready to start running etcd, and `etcd-manager-ctl` will provide the initial settings for the cluster. Those settings are written to the backup store, so the backup store acts as a source of truth when etcd is not running. So this will start a single node cluster of etcd (`--members=1`), @@ -55,8 +69,8 @@ You should be able to set and list keys using the etcdctl tool: Now if we want to expand the cluster (it's probably easiest to run each of these commands in different windows / tabs / tmux windows / screen windows): ``` -./etcd-manager --address 127.0.0.2 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/2 --client-urls=http://127.0.0.2:4001 --quarantine-client-urls=http://127.0.0.2:8001 -./etcd-manager --address 127.0.0.3 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/3 --client-urls=http://127.0.0.3:4001 --quarantine-client-urls=http://127.0.0.3:8001 +./etcd-manager --insecure --etcd-insecure --address 127.0.0.2 --etcd-address 127.0.0.2 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/2 --client-urls=http://127.0.0.2:4001 --quarantine-client-urls=http://127.0.0.2:8001 --peer-urls=http://127.0.0.2:2380 +./etcd-manager --insecure --etcd-insecure --address 127.0.0.3 --etcd-address 127.0.0.3 --cluster-name=test --backup-store=file:///tmp/etcd-manager/backups/test --data-dir=/tmp/etcd-manager/data/test/3 --client-urls=http://127.0.0.3:4001 --quarantine-client-urls=http://127.0.0.3:8001 --peer-urls=http://127.0.0.3:2380 ``` Within a few seconds, the two other nodes will join the gossip cluster, but will not yet be part of etcd. The leader controller will be logging something like this: @@ -87,13 +101,13 @@ If you do look around the directories: We can reconfigure the cluster: ``` -> curl http://127.0.0.1:4001/v2/keys/kope.io/etcd-manager/test/spec -{"action":"get","node":{"key":"/kope.io/etcd-manager/test/spec","value":"{\n \"memberCount\": 1,\n \"etcdVersion\": \"2.2.1\"\n}","modifiedIndex":4,"createdIndex":4}} - -> curl -XPUT -d 'value={ "memberCount": 3, "etcdVersion": "2.2.1" }' http://127.0.0.1:4001/v2/keys/kope.io/etcd-manager/test/spec +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test get +etcd-manager-ctl +member_count:1 etcd_version:"2.2.1" +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test -member-count=3 --etcd-version=2.2.1 configure-cluster ``` -Within a minute, we should see all 3 nodes in the etcd cluster: +If you now bounce the first etcd-manager process (control-C and relaunch), the cluster will reconfigure itself. This bouncing is typically done by a rolling-update, though etcd-manager can also be configured to automatically look for configuration changes: ``` > curl http://127.0.0.1:4001/v2/members/ @@ -106,8 +120,9 @@ and as long as you allow the cluster to recover before deleting a second data di ### Disaster recovery -The etcd-manager performs periodic backups. In the event of a total failure, it will restore automatically -(TODO: we should make this configurable - if a node _could_ recover we likely want this to be manually triggered) +The etcd-manager performs periodic backups. In the event of a total failure, we +can restore from that backup. Note that this involves data loss since the last +backup, so we require a manual trigger. Verify backup/restore works correctly: @@ -120,25 +135,48 @@ curl -XPUT -d "value=world" http://127.0.0.1:4001/v2/keys/hello * Remove the active data: `rm -rf /tmp/etcd-manager/data/test` * Restart all 3 processes -Disaster recovery will detect that no etcd nodes are running, will start a cluster on all 3 nodes, and restore the backup. +A leader will be elected, and will start logging `etcd has 0 members registered; must issue restore-backup command to proceed` +List the available backups with: -### Upgrading +```bash +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test list-backups +2019-01-14T15:26:45Z-000001 +2019-01-14T15:27:43Z-000001 +2019-01-14T15:29:48Z-000001 +2019-01-14T15:29:48Z-000002 +2019-01-14T15:29:48Z-000003 +``` +Issue the restore-backup command: + +```bash +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test restore-backup 2019-01-14T15:29:48Z-000003 +added restore-backup command: timestamp:1547480961703914946 restore_backup: backup:"2019-01-14T15:29:48Z-000003" > ``` -> curl http://127.0.0.1:4001/v2/keys/kope.io/etcd-manager/test/spec -{"action":"get","node":{"key":"/kope.io/etcd-manager/test/spec","value":"{ \"memberCount\": 3, \"etcdVersion\": \"2.2.1\" }","modifiedIndex":8,"createdIndex":8}} -> curl -XPUT -d 'value={ "memberCount": 3, "etcdVersion": "3.2.18" }' http://127.0.0.1:4001/v2/keys/kope.io/etcd-manager/test/spec +The controller will shortly restore the backup. Confirm this with: + +```bash +curl http://127.0.0.1:4001/v2/members/ +curl http://127.0.0.1:4001/v2/keys/hello ``` +### Upgrading + +``` +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test get +member_count:3 etcd_version:"2.2.1" +> ./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test -member-count=3 --etcd-version=3.2.24 configure-cluster +``` + +Bounce the etcd-manager that has leadership so that it picks up the reconfiguration. + Dump keys to be sure that everything copied across: ``` -> ETCDCTL_API=3 /opt/etcd-v3.2.18-linux-amd64/etcdctl --endpoints http://127.0.0.1:4001 get "" --prefix +> ETCDCTL_API=3 /opt/etcd-v3.2.24-linux-amd64/etcdctl --endpoints http://127.0.0.1:4001 get "" --prefix /hello world -/kope.io/etcd-manager/test/spec -{ "memberCount": 3, "etcdVersion": "3.2.18" } ``` You may note that we did the impossible here - we went straight from etcd 2 to etcd 3 in an HA cluster. There was some @@ -157,7 +195,10 @@ that the etcd-version of each object is changed, meaning all watches are invalid TODO: We should enable "hot" upgrades where the version change is compatible. (It's easy, but it's nice to have one code path for now) If you want to try a downgrade: -`ETCDCTL_API=3 /opt/etcd-v3.2.18-linux-amd64/etcdctl --endpoints http://127.0.0.1:4001 put /kope.io/etcd-manager/test/spec '{ "memberCount": 3, "etcdVersion": "2.3.7" }'` + +``` +./etcd-manager-ctl -backup-store=file:///tmp/etcd-manager/backups/test -member-count=3 --etcd-version=2.2.1 configure-cluster +``` ## Code overview @@ -194,21 +235,21 @@ Once a leader has been determined, it performs this basic loop: Help gratefully received: -* We need to split out the backup logic, so that we can run it as a simple coprocess - alongside existing etcd implementations. -* We should better integrate settting the `/kope.io/etcd-manager/test/spec` into kubernetes. A controller could sync it +* ~We need to split out the backup logic, so that we can run it as a simple coprocess + alongside existing etcd implementations.~ +* We should better integrate settting the spec into kubernetes. A controller could sync it with a CRD or apimachinery type. * We use the VFS library from kops (that is the only dependency on kops, and it's not a big one). We should look at making VFS into a true kubernetes shared library. -* We should probably not recover automatically from a backup in the event of total cluster loss, because backups are periodic - and thus we know some data loss is likely. Idea: drop a marker file into the backup store. +* ~We should probably not recover automatically from a backup in the event of total cluster loss, because backups are periodic + and thus we know some data loss is likely. Idea: drop a marker file into the backup store.~ * The controller leader election currently considers itself the leader when it has consensus amongst all reachable peers, and will create a cluster when there are sufficient peers to form a quorum. But with partitions, it's possible to have two nodes that both believe themselves to be the leader. If the number of machines is `>= 2 * quorum` then we could form two etcd clusters (etcd itself should stop overlapping clusters). A pluggable locking implementation is one solution in progress; GCS has good consistency guarantees. -* Discovery mechanisms are currently mostly fake - they work on a local filesystem. We have an early one backed by VFS, - but discovery via the EC2/GCE APIs would be great, as would network scanning or multicast discovery. +* ~Discovery mechanisms are currently mostly fake - they work on a local filesystem. We have an early one backed by VFS, + but discovery via the EC2/GCE APIs would be great, as would network scanning or multicast discovery.~ * All cluster version changes currently are performed via the "full dump and restore" mechanism. We should learn that some version changes are in fact safe, and perform them as a rolling-update (after a backup!) -* There should be a way to trigger a backup via a convenient mechanism. Idea: drop a marker key into etcd. +* ~There should be a way to trigger a backup via a convenient mechanism. Idea: drop a marker key into etcd.~