Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for issue #99; HA logging + reviewed HA in general #604

Merged
merged 1 commit into from
Aug 14, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 84 additions & 70 deletions src/en/juju-ha.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,84 @@
# Juju High Availability

As of version 1.20, Juju supports high availability for its state server.

High Availability in general terms means that a Juju environment has 3 or more
(up to 7) redundant state servers. One of these is the master with automatic
failover occurring should something happen to the master.

This section describes how Juju's high availability features works.

### MongoDB

Juju's High Availability (HA) mode is tightly integrated and dependent on
MongoDB. Juju stores all its data about the environment in a MongoDB database.
MongoDB has an HA implementation that connects multiple MongoDB databases in a
cluster called a [replicaset](http://docs.mongodb.org/manual/replication/). The
replicaset has a 1:1 relation to Juju state servers, and the master of the
replicaset is the master of the Juju state servers.

### Ensure Availability

Juju's HA mode is turned on using the `juju ensure-availability` command. By
default this sets the desired number of state machines in the environment to 3.
There is an optional `-n` parameter which can be used to set this number higher.

As stated above, there is a 1:1 relationship between the number of state
machines and the number of MongoDB instances. This means that the number of
state machines must be an odd number to prevent ties during voting for master,
and the number of state servers cannot exceed 7, due to MongoDB limits. In
practice, this means the possibilities are 3, 5 or 7 state machines.

Currently the number of state machines can be increased using the -n flag on
`juju ensure-availability`, but not decreased. The only way to decrease the
number of machines is to create a backup of your environment and then restore
the backup to a new environment, which starts with a single state server.

Whenever you run ensure-availability, the command will report the changes that
it made to the system's desired model, which will shortly be reflected in
reality.

### When State Servers Fail

Juju does not automatically re-spawn state machines if one or more fail.
However, if an environment is already in HA mode, you can recover from state
machine failure by manually re-running `juju ensure-availability`. This can be
done as long as more than half the original number of machines are still
running.

The process to recovering when state servers have failed is:

* Run `juju ensure-availability`. New state servers will be created to replace
the dead ones.
* After some time the new state servers will be ready and the dead state servers
will be removed from Juju's set of high availability state servers. This will
take on the order of 30 seconds to 20 minutes depending on variables like the
load on the machines and the amount of Juju configuration data to
replicate. In the output from `juju status` the new state servers will have a
`state-server-member-status` value of `has-vote` and the dead state servers
will have `no-vote`. At this point, the state servers in the environment are
fully-redundant again.
* To have Juju not treat the dead state server machines as state servers any
more, run `juju ensure-availability` again. The `state-server-member-status`
attribute will disappear from these machines in the `juju status` output.
* The dead state server hosts can now be completely removed from Juju's
configuration by using `juju remove-machine`.

If fewer than half of the original state servers are still running, you cannot
recover by using the ensure-availability command because the MongoDB replicaset
does not have a quorum with which to elect a new master. In this case, you must
restore from a previous backup.
Title: Juju High Availability


# High Availability

Juju High Availability (HA) means that a Juju environment has 3 or more (up to
7) state servers (bootstrap nodes) one of which is the *master*. Automatic
failover occurs should the master lose connectivity.


## Juju HA and MongoDB

Juju HA is tightly integrated with the MongoDB database since that is where all
environment data is stored. The MongoDB software has a native HA implementation
and this is what Juju HA uses. A MongoDB cluster is called a
[replica set](http://docs.mongodb.org/manual/replication/) and, in the context
of Juju HA, i) each of its members corresponds to a different Juju state server
and ii) the MongoDB replica set master corresponds to the Juju state server master.

The number of state servers must be an odd number to prevent ties during voting
for master, and the number of state servers cannot exceed 7, due to MongoDB
limitations. This means a Juju HA cluster can have 3, 5 or 7 state servers.


## Activating and modifying HA

Juju HA is activated and modified with the `juju ensure-availability` command.
As will be shown in the next section, it is also used to recover from failed
state servers.

When activating HA, by default, this command sets the number of state
servers in the environment to 3. The optional `-n` switch can modify this
number.

When modifying HA, the `-n` switch can be used to increase the number of state
servers. The only way to decrease is to create a backup of your environment
and then restore the backup to a new environment, which starts with a single
state server. You can then increase to the desired number.

Whenever you run ensure-availability, the command will report the changes it
intends to make, which will shortly be implemented.

For complete syntax, see the [command reference page](./commands.html#ensure-availability
).


## Recovering from state server failure

In the advent of failed state servers, Juju does not automatically re-spawn new
state servers nor remove the failed ones. However, as long as more than half of
the original number of state servers remain available you can manually recover.
The process is detailed below.

1. Run `juju ensure-availability`.
1. Verify that the output of `juju status` shows a value of `has-vote` for the
`state-server-member-status` attribute for each new server and a value of
`no-vote` for each old server. Once confirmed, the new servers are fully
operational as cluster members and the old servers have been demoted (no longer
part of HA). This process can take between 30 seconds to 20 minutes depending
on machine resources and Juju data volume.
1. Run `juju ensure-availability` again to have Juju no longer consider the
old machines as state servers. The `state-server-member-status` attribute
should disappear from these machines.
1. Use the `juju remove-machine` command to remove the old machines entirely.

You cannot repair the cluster as outlined above if fewer than half of the
original state servers remain available because the MongoDB replica set will not
have the quorum necessary to elect a new master. You must restore from backups
in this case.


## HA and logging

All Juju units send logs to all state servers in the HA cluster and the user
accesses those logs in the usual manner, via `juju debug-log` (see
[Troubleshooting with debug-log](./troubleshooting-debug-log.html)) or by
viewing the logs directly on any state server (/var/log/juju).

Logging to a state server begins once it becomes fully operational. One caveat
is that past cluster logs are not sent to the new "slave" state server. It
should therefore be noted that all state servers are not guaranteed to house
the same logs. In particular, this should be understood when using `juju
debug-log` as this triggers a connection to be made to a random state server.
This will be corrected in the near future (past logs will be synced).
4 changes: 4 additions & 0 deletions src/en/troubleshooting-debug-log.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Title: Troubleshooting with debug-log


# Troubleshooting with debug-log

When problems arise the first step in determining the cause is to look at the
Expand All @@ -18,6 +19,9 @@ For complete syntax, see the [command reference page](./commands.html).
You can also learn more by running `juju debug-log --help` and `juju help
logging`.

See [Juju High Availability](./juju-ha.html#ha-and-logging) when viewing logs
in an HA context.


## Examples:

Expand Down