Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weekly Release - Enterprise 3.5 - Monday, May 15,2017 #4416

Merged
merged 15 commits into from May 15, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
129 changes: 78 additions & 51 deletions admin_guide/backup_restore.adoc
Expand Up @@ -453,44 +453,74 @@ member d266df286a41a8a4 is healthy
xref:bringing-openshift-services-back-online[Bringing {product-title}
Services Back Online].

////
[[backup-restore-adding-etcd-hosts]]
== Adding New etcd Hosts

In cases where etcd hosts have failed, but you have at least one host still
running, you can use the one surviving host to recover etcd hosts without
downtime.
In cases where etcd members have failed and you still have a quorum of etcd
cluster members running, you can use the surviving members to
add additional etcd members without downtime.

*Suggested Cluster Size*

Having a cluster with an odd number of etcd hosts can account for fault
tolerance. Having an odd number of etcd hosts does not change the number needed
for majority, but increases the tolerance for failure. For example, a cluster
size of seven hosts has a majority of four, leaving a failure tolerance of
three. This ensures that four hosts will be guaranteed to operate.
for a quorum, but increases the tolerance for failure. For example, a cluster
size of three members, quorum is two leaving a failure tolerance of
one. This ensures the cluster will continue to operate if two of the members are
healthy.

Having an in-production cluster of seven etcd hosts is recommended.
Having an in-production cluster of three etcd hosts is recommended.

[NOTE]
====
The following presumes you have a backup of the */etc/etcd* configuration for
the etcd hosts.
====

. xref:../install_config/adding_hosts_to_existing_cluster.adoc#install-config-adding-hosts-to-cluster[Add
. If the new etcd members will also be {product-title} nodes, see xref:../install_config/adding_hosts_to_existing_cluster.adoc#install-config-adding-hosts-to-cluster[Add
the desired number of hosts to the cluster]. The rest of this procedure presumes
you have added just one host, but if adding multiple, perform all steps on each
host.

. Upgrade etcd on the surviving node:
. Upgrade etcd and iptables on the surviving nodes:
+
----
# yum install etcd iptables-services
# yum update etcd iptables-services
----
+
Ensure version `etcd-2.3.7-4.el7.x86_64` or greater is installed, and that the
same version is installed on each host.

. Install etcd and iptables on the new host
+
----
# yum install etcd iptables-services
----
+
Ensure version `etcd-2.3.7-4.el7.x86_64` or greater is installed, and that the
same version is installed on the new host.

. xref:cluster-backup[Backup the etcd data store] on surviving hosts before making any cluster configuration changes.
+
. If replacing a failed etcd member, remove the failed member _before_ adding the new member.
+
----
# etcdctl -C https://<surviving host IP>:2379 \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key cluster-health

# etcdctl -C https://<surviving host IP>:2379 \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key member remove <failed member identifier>
----
+
Stop the etcd service on the failed etcd member:
+
----
# systemctl stop etcd
----
. On the new host, add the appropriate iptables rules:
+
----
Expand All @@ -506,7 +536,7 @@ same version is installed on each host.

. Generate the required certificates for the new host. On a surviving etcd host:
+
.. Create a copy of the *_/etc/etcd/ca/_* directory.
.. Make a backup of the *_/etc/etcd/ca/_* directory.

.. Set the variables and working directory for the certificates, ensuring to create the *_PREFIX_* directory if one has not been created:
+
Expand All @@ -522,7 +552,7 @@ same version is installed on each host.
.. Create the $PREFIX directory:
+
----
$ mkdir $PREFIX
$ mkdir -p $PREFIX
----

.. Create the *_server.csr_* and *_server.crt_* certificates:
Expand Down Expand Up @@ -555,10 +585,11 @@ $ mkdir $PREFIX
-extensions etcd_v3_ca_peer -batch
----

.. Copy the *_etcd.conf_* file, and archive the contents of the directory:
.. Copy the *_etcd.conf_* and *_ca.crt_* files, and archive the contents of the directory:
+
----
# cp etcd.conf ${PREFIX}
# cp ca.crt ${PREFIX}
# tar -czvf ${PREFIX}${CN}.tgz -C ${PREFIX} .
----

Expand All @@ -568,7 +599,7 @@ $ mkdir $PREFIX
# scp ${PREFIX}${CN}.tgz $CN:/etc/etcd/
----

. While still on the surviving etcd host, add the new host to the cluster, take the copy of etcd, and transfer it to the new host:
. While still on the surviving etcd host, add the new host to the cluster:

.. Add the new host to the cluster:
+
Expand All @@ -586,48 +617,22 @@ ETCD_NAME="<NEW_ETCD_HOSTNAME>"
ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<SURVIVING_ETCD_HOST>=https:/<SURVIVING_HOST_IP>:2380
ETCD_INITIAL_CLUSTER_STATE="existing"
----

.. Create a backup of the surviving etcd host, and transfer the contents to the new
host:
+
[NOTE]
====
Skip this step if version is lower than `etcd-2.3.7-4` or if etcd database size
is smaller than 700 MB.
Copy the three environment variables in the etcdctl member add output. They will be used later.

If the etcd backup is larger than 700 MB,
xref:../admin_guide/pruning_resources.adoc#admin-guide-pruning-resources[prune
the resource], or clear time to live (TTL) data, such as events. If the backup
is still larger than 700 MB, stop the other hosts before performing this step.
====
+
[WARNING]
====
If you must skip this step, do not use a backup of *_/var/lib/etcd_*. Also, if
you reuse a node, *_/var/lib/etcd_* must first be purged of old data. Otherwise,
etcd will also indicate that the `cluster-id` does not match.
====
.. On the new host, extract the copied configuration data and set the permissions:
+
----
# export NODE_ID="<NEW_NODE_ID>"
# etcdctl backup --keep-cluster-id --node-id ${NODE_ID} \
--data-dir /var/lib/etcd --backup-dir /var/lib/etcd/$NEW_ETCD-backup
# tar -cvf $NEW_ETCD-backup.tar.gz -C /var/lib/etcd/$NEW_ETCD-backup/ .
# scp $NEW_ETCD-backup.tar.gz $NEW_ETCD:/var/lib/etcd/
# tar -xf /etc/etcd/<NEW_ETCD_HOSTNAME>.tgz -C /etc/etcd/ --overwrite
# chown -R etcd:etcd /etc/etcd/*
----
+
On the new host, extract the backup data and set the permissions:
.. On the new host, remove any etcd data:
+
----
# tar -xf /etc/etcd/<NEW_ETCD_HOSTNAME> -C /etc/etcd/ --overwrite
# chown etcd:etcd /etc/etcd/*

# rm -rf /var/lib/etcd/member
# tar -xf /var/lib/etcd/<NEW_ETCD_HOSTNAME>-backup.tar.gz -C /var/lib/etcd/
# chown -R etcd:etcd /var/lib/etcd/
# chown -R etcd:etcd /var/lib/etcd
----
+
Ensure that you save the *_db_* file.

. On the new etcd host's *_etcd.conf_* file:
.. Replace the following with the values generated in the previous step:
Expand All @@ -643,7 +648,7 @@ Replace the IP address with the "NEW_ETCD" value for:
* ETCD_INITIAL_ADVERTISE_PEER_URLS
* ETCD_ADVERTISE_CLIENT_URLS
+
For replacing failed hosts, you will need to remove the failed hosts from the
For replacing failed members, you will need to remove the failed hosts from the
etcd configuration.

. Start etcd on the new host:
Expand All @@ -652,14 +657,36 @@ etcd configuration.
# systemctl enable etcd --now
----

. To verify that the new host has been added successfully:
. To verify that the new member has been added successfully:
+
----
etcdctl -C https://${ETCD_CA_HOST}:2379 --ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key cluster-health
----
////

. Update the master configuration on all masters to point to the new etcd host
+
.. On every master in the cluster, edit *_/etc/origin/master/master-config.yaml_*
.. Find the *etcdClientInfo* section.
.. Add the new etcd host to the *urls* list.
.. If a failed etcd host was replaced, remove it from the list.
.. Restart the master API service.
+
On a single master cluster installation:
+
----
# systemctl restart atomic-openshift-master
----
+
On a multi-master cluster installation, on each master:
+
----
# systemctl restart atomic-openshift-master-api
----

The procedure to add an etcd member is complete.


[[bringing-openshift-services-back-online]]
== Bringing {product-title} Services Back Online
Expand Down Expand Up @@ -883,4 +910,4 @@ Alternatively, you can scale down the deployment to 0, and then up again:
----
$ oc scale --replicas=0 dc/jenkins
$ oc scale --replicas=1 dc/jenkins
----
----
4 changes: 2 additions & 2 deletions admin_guide/garbage_collection.adoc
Expand Up @@ -43,7 +43,7 @@ specified using unit suffixes such as *h* for hour, *m* for minutes, *s* for sec
|The number of instances to retain per pod container. The default is *2*.

|`*maximum-dead-containers*`
|The maximum number of total dead containers in the node. The default is *100*.
|The maximum number of total dead containers in the node. The default is *240*.
|===

The `*maximum-dead-containers*` setting takes precedence over the
Expand All @@ -70,7 +70,7 @@ kubeletArguments:
maximum-dead-containers-per-container:
- "2"
maximum-dead-containers:
- "100"
- "240"
----
====

Expand Down
58 changes: 36 additions & 22 deletions admin_guide/high_availability.adoc
Expand Up @@ -488,7 +488,7 @@ See xref:../admin_guide/high_availability.adoc#ha-vrrp-id-offset[this discussion
|`--check-script`
|`OPENSHIFT_HA_CHECK_SCRIPT`
|
|Full path name in the pod file system of a script that is periodically run to verify the application is operating. Please see xref:../admin_guide/high_availability.adoc#check-notify[this discussion] for more details.
|Full path name in the pod file system of a script that is periodically run to verify the application is operating. See xref:../admin_guide/high_availability.adoc#check-notify[this discussion] for more details.

|`--check-interval`
|`OPENSHIFT_HA_CHECK_INTERVAL`
Expand Down Expand Up @@ -516,10 +516,12 @@ When there are multiple ipfailover deployment configuration care must be taken t

[[configuring-a-highly-available-service]]
=== Configuring a Highly-available Service
The following steps describe how to set up highly-available *router* and *geo-cache* network services with IP failover on a set of nodes.
The following steps describe how to set up highly-available *router* and
*geo-cache* network services with IP failover on a set of nodes.

. Label the nodes that will be used for the services. This step can be optional if you run the services on all the nodes in your {product-title} cluster and will use VIPs that can
float within all nodes in the cluster.
. Label the nodes that will be used for the services. This step can be optional
if you run the services on all the nodes in your {product-title} cluster and
will use VIPs that can float within all nodes in the cluster.
+
The following example defines a label for nodes that are servicing
traffic in the US west geography *ha-svc-nodes=geo-us-west*:
Expand All @@ -530,10 +532,12 @@ $ oc label nodes openshift-node-{5,6,7,8,9} "ha-svc-nodes=geo-us-west"
----
====

. Create the service account. You can use ipfailover or when using
a router (depending on your environment policies), you can either reuse the *router* service account created previously or a new ipfailover service account.
. Create the service account. You can use ipfailover or when using a router
(depending on your environment policies), you can either reuse the *router*
service account created previously or a new ipfailover service account.
+
The example below creates a new service account with the name ipfailover in the *default* namespace:
The following example creates a new service account with the name ipfailover in the
*default* namespace:
+
====
----
Expand All @@ -553,10 +557,12 @@ $ oadm policy add-scc-to-user privileged system:serviceaccount:default:ipfailove
+
[IMPORTANT]
====
Since the ipfailover runs on all nodes from step 1, it is recommended to also run the router/service on all the step 1 nodes.
Since the ipfailover runs on all nodes from step 1, it is recommended to also
run the router/service on all the step 1 nodes.
====
+
.. Start the router with the nodes matching the labels used in the first step. The following example runs five instances using the ipfailover service account:
.. Start the router with the nodes matching the labels used in the first step.
The following example runs five instances using the ipfailover service account:
+
ifdef::openshift-enterprise[]
====
Expand All @@ -580,30 +586,37 @@ $ oadm router ha-router-us-west --replicas=5 \
====
endif::[]
+
.. Run the *geo-cache* service with a replica on each of the nodes. An example configuration for running a *geo-cache* service https://raw.githubusercontent.com/openshift/openshift-docs/master/admin_guide/examples/geo-cache.json[is
provided here].
.. Run the *geo-cache* service with a replica on each of the nodes. See an
link:https://raw.githubusercontent.com/openshift/openshift-docs/master/admin_guide/examples/geo-cache.json[example
configuration] for running a *geo-cache* service.
+
[IMPORTANT]
====
Make sure that you replace the *myimages/geo-cache* Docker image referenced in the
file with your intended image. Change the number of replicas to the
number of nodes in the *geo-cache* label. Check that the label matches the one used in the first step.
Make sure that you replace the *myimages/geo-cache* Docker image referenced in
the file with your intended image. Change the number of replicas to the number
of nodes in the *geo-cache* label. Check that the label matches the one used in
the first step.
====
+
----
$ oc create -n <namespace> -f ./examples/geo-cache.json
----

. Configure ipfailover for the *router* and *geo-cache* services. Each has its own VIPs and both use the same nodes labeled with *ha-svc-nodes=geo-us-west* in the first step. Please ensure that the number of replicas match the number of nodes listed in the label setup, in the first step.
. Configure ipfailover for the *router* and *geo-cache* services. Each has its
own VIPs and both use the same nodes labeled with *ha-svc-nodes=geo-us-west* in
the first step. Ensure that the number of replicas match the number of
nodes listed in the label setup, in the first step.
+
[IMPORTANT]
====
The *router*, *geo-cache*, and ipfailover all create deployment configuration and all must have different names.
The *router*, *geo-cache*, and ipfailover all create deployment configuration
and all must have different names.
====

. Specify the VIPs and the port number that ipfailover should monitor on the desired instances.
. Specify the VIPs and the port number that ipfailover should monitor on the
desired instances.
+
Below is the ipfailover command for the *router*.
The ipfailover command for the *router*:
+
ifdef::openshift-enterprise[]
====
Expand Down Expand Up @@ -631,7 +644,7 @@ $ oadm ipfailover ipf-ha-router-us-west \
endif::[]

+
Below is the `oadm ipfailover` command for the *geo-cache* service that is
The following is the `oadm ipfailover` command for the *geo-cache* service that is
listening on port 9736. Since there are two `ipfailover` deployment
configurations, the `--vrrp-id-offset` must be set so that each VIP gets its own
offset. In this case, setting a value of `10` means that the
Expand Down Expand Up @@ -676,16 +689,17 @@ that all the *router* VIPs point to the same *router*, and all the *geo-cache*
VIPs point to the same *geo-cache* service. As long as one node remains running,
all the VIP addresses are served.

. Deploy the ipfailover router to monitor postgresql listening on node

*Deploy IP Failover Pod*

Deploy the ipfailover router to monitor postgresql listening on node
port 32439 and the external IP address, as defined in the *postgresql-ingress*
service:
+
====
----
$ oadm ipfailover ipf-ha-postgresql \
--replicas=1 <1> --selector="app-type=postgresql" <2> \
--virtual-ips=10.9.54.100 <3> --watch-port=32439 <4> \
--credentials=/etc/origin/master/openshift-router.kubeconfig \
--service-account=ipfailover --create
----
<1> Specifies the number of instances to deploy.
Expand Down