openshift · ahardin-rh · May 15, 2017 · Apr 27, 2017 · May 4, 2017 · May 8, 2017
diff --git a/admin_guide/backup_restore.adoc b/admin_guide/backup_restore.adoc
@@ -453,44 +453,74 @@ member d266df286a41a8a4 is healthy
 xref:bringing-openshift-services-back-online[Bringing {product-title}
 Services Back Online].
 
-////
 [[backup-restore-adding-etcd-hosts]]
 == Adding New etcd Hosts
 
-In cases where etcd hosts have failed, but you have at least one host still
-running, you can use the one surviving host to recover etcd hosts without
-downtime.
+In cases where etcd members have failed and you still have a quorum of etcd 
+cluster members running, you can use the surviving members to
+add additional etcd members without downtime.
 
 *Suggested Cluster Size*
 
 Having a cluster with an odd number of etcd hosts can account for fault
 tolerance. Having an odd number of etcd hosts does not change the number needed
-for majority, but increases the tolerance for failure. For example, a cluster
-size of seven hosts has a majority of four, leaving a failure tolerance of
-three. This ensures that four hosts will be guaranteed to operate.
+for a quorum, but increases the tolerance for failure. For example, a cluster
+size of three members, quorum is two leaving a failure tolerance of
+one. This ensures the cluster will continue to operate if two of the members are 
+healthy.
 
-Having an in-production cluster of seven etcd hosts is recommended.
+Having an in-production cluster of three etcd hosts is recommended.
 
 [NOTE]
 ====
 The following presumes you have a backup of the */etc/etcd* configuration for
 the etcd hosts.
 ====
 
-. xref:../install_config/adding_hosts_to_existing_cluster.adoc#install-config-adding-hosts-to-cluster[Add
+. If the new etcd members will also be {product-title} nodes, see xref:../install_config/adding_hosts_to_existing_cluster.adoc#install-config-adding-hosts-to-cluster[Add
 the desired number of hosts to the cluster]. The rest of this procedure presumes
 you have added just one host, but if adding multiple, perform all steps on each
 host.
 
-. Upgrade etcd on the surviving node:
+. Upgrade etcd and iptables on the surviving nodes:
 +
 ----
-# yum install etcd iptables-services
+# yum update etcd iptables-services
 ----
 +
 Ensure version `etcd-2.3.7-4.el7.x86_64` or greater is installed, and that the
 same version is installed on each host.
 
+. Install etcd and iptables on the new host
++
+----
+# yum install etcd iptables-services
+----
++
+Ensure version `etcd-2.3.7-4.el7.x86_64` or greater is installed, and that the
+same version is installed on the new host.
+
+. xref:cluster-backup[Backup the etcd data store] on surviving hosts before making any cluster configuration changes.
++
+. If replacing a failed etcd member, remove the failed member _before_ adding the new member.
++
+----
+# etcdctl -C https://<surviving host IP>:2379 \
+  --ca-file=/etc/etcd/ca.crt     \
+  --cert-file=/etc/etcd/peer.crt     \
+  --key-file=/etc/etcd/peer.key cluster-health
+
+# etcdctl -C https://<surviving host IP>:2379 \
+  --ca-file=/etc/etcd/ca.crt     \
+  --cert-file=/etc/etcd/peer.crt     \
+  --key-file=/etc/etcd/peer.key member remove <failed member identifier>
+----
++
+Stop the etcd service on the failed etcd member:
++
+----
+# systemctl stop etcd
+----
 . On the new host, add the appropriate iptables rules:
 +
 ----
@@ -506,7 +536,7 @@ same version is installed on each host.
 
 . Generate the required certificates for the new host. On a surviving etcd host:
 +
-.. Create a copy of the *_/etc/etcd/ca/_* directory.
+.. Make a backup of the *_/etc/etcd/ca/_* directory.
 
 .. Set the variables and working directory for the certificates, ensuring to create the *_PREFIX_* directory if one has not been created:
 +
@@ -522,7 +552,7 @@ same version is installed on each host.
 .. Create the $PREFIX directory:
 +
 ----
-$ mkdir $PREFIX
+$ mkdir -p $PREFIX
 ----
 
 .. Create the *_server.csr_* and *_server.crt_* certificates:
@@ -555,10 +585,11 @@ $ mkdir $PREFIX
   -extensions etcd_v3_ca_peer -batch
 ----
 
-.. Copy the *_etcd.conf_* file, and archive the contents of the directory:
+.. Copy the *_etcd.conf_* and *_ca.crt_* files, and archive the contents of the directory:
 +
 ----
 # cp etcd.conf ${PREFIX}
+# cp ca.crt ${PREFIX}
 # tar -czvf ${PREFIX}${CN}.tgz -C ${PREFIX} .
 ----
 
@@ -568,7 +599,7 @@ $ mkdir $PREFIX
 # scp ${PREFIX}${CN}.tgz  $CN:/etc/etcd/
 ----
 
-. While still on the surviving etcd host, add the new host to the cluster, take the copy of etcd, and transfer it to the new host:
+. While still on the surviving etcd host, add the new host to the cluster:
 
 .. Add the new host to the cluster:
 +
@@ -586,48 +617,22 @@ ETCD_NAME="<NEW_ETCD_HOSTNAME>"
 ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<SURVIVING_ETCD_HOST>=https:/<SURVIVING_HOST_IP>:2380
 ETCD_INITIAL_CLUSTER_STATE="existing"
 ----
-
-.. Create a backup of the surviving etcd host, and transfer the contents to the new
-host:
 +
-[NOTE]
-====
-Skip this step if version is lower than `etcd-2.3.7-4` or if etcd database size
-is smaller than 700 MB.
+Copy the three environment variables in the etcdctl member add output. They will be used later.
 
-If the etcd backup is larger than 700 MB,
-xref:../admin_guide/pruning_resources.adoc#admin-guide-pruning-resources[prune
-the resource], or clear time to live (TTL) data, such as events. If the backup
-is still larger than 700 MB, stop the other hosts before performing this step.
-====
-+
-[WARNING]
-====
-If you must skip this step, do not use a backup of *_/var/lib/etcd_*. Also, if
-you reuse a node, *_/var/lib/etcd_* must first be purged of old data. Otherwise,
-etcd will also indicate that the `cluster-id` does not match.
-====
+.. On the new host, extract the copied configuration data and set the permissions:
 +
 ----
-# export NODE_ID="<NEW_NODE_ID>"
-# etcdctl backup --keep-cluster-id --node-id ${NODE_ID} \
-  --data-dir /var/lib/etcd --backup-dir /var/lib/etcd/$NEW_ETCD-backup
-# tar -cvf $NEW_ETCD-backup.tar.gz -C /var/lib/etcd/$NEW_ETCD-backup/ .
-# scp $NEW_ETCD-backup.tar.gz $NEW_ETCD:/var/lib/etcd/
+# tar -xf /etc/etcd/<NEW_ETCD_HOSTNAME>.tgz -C /etc/etcd/ --overwrite
+# chown -R etcd:etcd /etc/etcd/*
 ----
 +
-On the new host, extract the backup data and set the permissions:
+.. On the new host, remove any etcd data:
 +
 ----
-# tar -xf /etc/etcd/<NEW_ETCD_HOSTNAME> -C /etc/etcd/ --overwrite
-# chown etcd:etcd /etc/etcd/*
-
 # rm -rf /var/lib/etcd/member
-# tar -xf /var/lib/etcd/<NEW_ETCD_HOSTNAME>-backup.tar.gz -C /var/lib/etcd/
-# chown -R etcd:etcd /var/lib/etcd/
+# chown -R etcd:etcd /var/lib/etcd
 ----
-+
-Ensure that you save the *_db_* file.
 
 . On the new etcd host's *_etcd.conf_* file:
 .. Replace the following with the values generated in the previous step:
@@ -643,7 +648,7 @@ Replace the IP address with the "NEW_ETCD" value for:
 * ETCD_INITIAL_ADVERTISE_PEER_URLS
 * ETCD_ADVERTISE_CLIENT_URLS
 +
-For replacing failed hosts, you will need to remove the failed hosts from the
+For replacing failed members, you will need to remove the failed hosts from the
 etcd configuration.
 
 . Start etcd on the new host:
@@ -652,14 +657,36 @@ etcd configuration.
 # systemctl enable etcd --now
 ----
 
-. To verify that the new host has been added successfully:
+. To verify that the new member has been added successfully:
 +
 ----
 etcdctl -C https://${ETCD_CA_HOST}:2379 --ca-file=/etc/etcd/ca.crt \
   --cert-file=/etc/etcd/peer.crt     \
   --key-file=/etc/etcd/peer.key cluster-health
 ----
-////
+
+. Update the master configuration on all masters to point to the new etcd host
++
+.. On every master in the cluster, edit *_/etc/origin/master/master-config.yaml_*
+.. Find the *etcdClientInfo* section.
+.. Add the new etcd host to the *urls* list.
+.. If a failed etcd host was replaced, remove it from the list.
+.. Restart the master API service.
++
+On a single master cluster installation:
++
+----
+# systemctl restart atomic-openshift-master
+----
++
+On a multi-master cluster installation, on each master:
++
+----
+# systemctl restart atomic-openshift-master-api
+----
+
+The procedure to add an etcd member is complete.
+
 
 [[bringing-openshift-services-back-online]]
 == Bringing {product-title} Services Back Online
@@ -883,4 +910,4 @@ Alternatively, you can scale down the deployment to 0, and then up again:
 ----
 $ oc scale --replicas=0 dc/jenkins
 $ oc scale --replicas=1 dc/jenkins
-----
+----
diff --git a/admin_guide/garbage_collection.adoc b/admin_guide/garbage_collection.adoc
@@ -43,7 +43,7 @@ specified using unit suffixes such as *h* for hour, *m* for minutes, *s* for sec
 |The number of instances to retain per pod container. The default is *2*.
 
 |`*maximum-dead-containers*`
-|The maximum number of total dead containers in the node. The default is *100*.
+|The maximum number of total dead containers in the node. The default is *240*.
 |===
 
 The `*maximum-dead-containers*` setting takes precedence over the
@@ -70,7 +70,7 @@ kubeletArguments:
   maximum-dead-containers-per-container:
     - "2"
   maximum-dead-containers:
-    - "100"
+    - "240"
 ----
 ====
 

diff --git a/admin_guide/high_availability.adoc b/admin_guide/high_availability.adoc
@@ -488,7 +488,7 @@ See xref:../admin_guide/high_availability.adoc#ha-vrrp-id-offset[this discussion
 |`--check-script`
 |`OPENSHIFT_HA_CHECK_SCRIPT`
 |
-|Full path name in the pod file system of a script that is periodically run to verify the application is operating.  Please see xref:../admin_guide/high_availability.adoc#check-notify[this discussion] for more details.
+|Full path name in the pod file system of a script that is periodically run to verify the application is operating. See xref:../admin_guide/high_availability.adoc#check-notify[this discussion] for more details.
 
 |`--check-interval`
 |`OPENSHIFT_HA_CHECK_INTERVAL`
@@ -516,10 +516,12 @@ When there are multiple ipfailover deployment configuration care must be taken t
 
 [[configuring-a-highly-available-service]]
 === Configuring a Highly-available Service
-The following steps describe how to set up highly-available *router* and *geo-cache* network services with IP failover on a set of nodes.
+The following steps describe how to set up highly-available *router* and
+*geo-cache* network services with IP failover on a set of nodes.
 
-. Label the nodes that will be used for the services. This step can be optional if you run the services on all the nodes in your {product-title} cluster and will use VIPs that can
-float within all nodes in the cluster.
+. Label the nodes that will be used for the services. This step can be optional
+if you run the services on all the nodes in your {product-title} cluster and
+will use VIPs that can float within all nodes in the cluster.
 +
 The following example defines a label for nodes that are servicing
 traffic in the US west geography *ha-svc-nodes=geo-us-west*:
@@ -530,10 +532,12 @@ $ oc label nodes openshift-node-{5,6,7,8,9} "ha-svc-nodes=geo-us-west"
 ----
 ====
 
-. Create the service account. You can use ipfailover or when using
-a router (depending on your environment policies), you can either reuse the *router* service account created previously or a new ipfailover service account.
+. Create the service account. You can use ipfailover or when using a router
+(depending on your environment policies), you can either reuse the *router*
+service account created previously or a new ipfailover service account.
 +
-The example below creates a new service account with the name ipfailover in the *default* namespace:
+The following example creates a new service account with the name ipfailover in the
+*default* namespace:
 +
 ====
 ----
@@ -553,10 +557,12 @@ $ oadm policy add-scc-to-user privileged system:serviceaccount:default:ipfailove
 +
 [IMPORTANT]
 ====
-Since the ipfailover runs on all nodes from step 1, it is recommended to also run the router/service on all the step 1 nodes.
+Since the ipfailover runs on all nodes from step 1, it is recommended to also
+run the router/service on all the step 1 nodes.
 ====
 +
-.. Start the router with the nodes matching the labels used in the first step. The following example runs five instances using the ipfailover service account:
+.. Start the router with the nodes matching the labels used in the first step.
+The following example runs five instances using the ipfailover service account:
 +
 ifdef::openshift-enterprise[]
 ====
@@ -580,30 +586,37 @@ $ oadm router ha-router-us-west --replicas=5 \
 ====
 endif::[]
 +
-.. Run the *geo-cache* service with a replica on each of the nodes. An example configuration for running a *geo-cache* service https://raw.githubusercontent.com/openshift/openshift-docs/master/admin_guide/examples/geo-cache.json[is
-provided here].
+.. Run the *geo-cache* service with a replica on each of the nodes. See an
+link:https://raw.githubusercontent.com/openshift/openshift-docs/master/admin_guide/examples/geo-cache.json[example
+configuration] for running a *geo-cache* service.
 +
 [IMPORTANT]
 ====
-Make sure that you replace the *myimages/geo-cache* Docker image referenced in the
-file with your intended image. Change the number of replicas to the
-number of nodes in the *geo-cache* label. Check that the label matches the one used in the first step.
+Make sure that you replace the *myimages/geo-cache* Docker image referenced in
+the file with your intended image. Change the number of replicas to the number
+of nodes in the *geo-cache* label. Check that the label matches the one used in
+the first step.
 ====
 +
 ----
 $ oc create -n <namespace> -f ./examples/geo-cache.json
 ----
 
-. Configure ipfailover for the *router* and *geo-cache* services. Each has its own VIPs and both use the same nodes labeled with *ha-svc-nodes=geo-us-west* in the first step. Please ensure that the number of replicas match the number of nodes listed in the label setup, in the first step.
+. Configure ipfailover for the *router* and *geo-cache* services. Each has its
+own VIPs and both use the same nodes labeled with *ha-svc-nodes=geo-us-west* in
+the first step. Ensure that the number of replicas match the number of
+nodes listed in the label setup, in the first step.
 +
 [IMPORTANT]
 ====
-The *router*, *geo-cache*, and ipfailover all create deployment configuration and all must have different names.
+The *router*, *geo-cache*, and ipfailover all create deployment configuration
+and all must have different names.
 ====
 
-. Specify the VIPs and the port number that ipfailover should monitor on the desired instances.
+. Specify the VIPs and the port number that ipfailover should monitor on the
+desired instances.
 +
-Below is the ipfailover command for the *router*.
+The ipfailover command for the *router*:
 +
 ifdef::openshift-enterprise[]
 ====
@@ -631,7 +644,7 @@ $ oadm ipfailover ipf-ha-router-us-west \
 endif::[]
 
 +
-Below is the `oadm ipfailover` command for the *geo-cache* service that is
+The following is the `oadm ipfailover` command for the *geo-cache* service that is
 listening on port 9736. Since there are two `ipfailover` deployment
 configurations, the `--vrrp-id-offset` must be set so that each VIP gets its own
 offset. In this case, setting a value of `10` means that the
@@ -676,16 +689,17 @@ that all the *router* VIPs point to the same *router*, and all the *geo-cache*
 VIPs point to the same *geo-cache* service. As long as one node remains running,
 all the VIP addresses are served.
 
-. Deploy the ipfailover router to monitor postgresql listening on node
+
+*Deploy IP Failover Pod*
+
+Deploy the ipfailover router to monitor postgresql listening on node
 port 32439 and the external IP address, as defined in the *postgresql-ingress*
 service:
-+
 ====
 ----
 $ oadm ipfailover ipf-ha-postgresql \
     --replicas=1 <1> --selector="app-type=postgresql" <2> \
     --virtual-ips=10.9.54.100 <3> --watch-port=32439 <4>  \
-    --credentials=/etc/origin/master/openshift-router.kubeconfig \
     --service-account=ipfailover --create
 ----
 <1> Specifies the number of instances to deploy.