### Binary encoding

For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format. For example, you could choose a format that is more compact or faster to parse. For a small dataset, the gains are negligible, but once you get into the terabytes, the choice of data format can have a big impact.

JSON is less verbose than XML, but both still use a log of space compared to binary formats. This observation led to the development of a profusion f binary encodings for JSON and for XML. These formats have been adopted in various niches, but none of them are as widely adopted as the textual versions of JSON and XML.

Some of these formats extend hte set of datatypes, but otherwise they keep the JSON/XML data model unchanged. In particular, since they don't prescribe a schema, they need to include all the object field names within the encoded data. That is, in a binary encoding of the JSON document in Example 4-1, they will need to include the strings userName, favoriteNumber, adn interests somewhere.

Let's look at an example of MessagePack, a binary encoding for JSON. Figute 4-1 shows the byte sequence that you get if you encode the JSON document in Example 4-1 with MessagePack. The first few bytes are as follows:

1. The first byte, 0x83, indicates that what follows is an object with three fields.

2. The second byte, 0xa8, indicates that what follows is a string that is eight bytes long.
3. The next eight bytes are the field name userName in ASCII. Since the length was indicated previously, there's no need for any marker to tell us where the string ends
4. The next seven bytes encode the six-letter string value Martin with a prefix 0xa6, and so on.

## Thrift and Protocol Buffers

Apache Thrift and Protocol Buffers are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrif twas originally developed at Facebook, and both were made oepn source in 2007.

Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in Example 4-1 in Thrift, you would describe the schema in the Thrift interface definition language like this:

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol, but as you can see in Figure 4-3, it packs the same information into only 34 bytes. It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between -64 and 63 are encoded in one byte, number between -8192 and 8191 are encoded in two btes, etc. Bigger numbers use more bytes.

### Field tags and schema evolution

We said previously that schemas inevitably need to change over time. We call this schema evolution. How do Thrift and Protocol Buffers handle schema changes while keeping backward and forward compatibility?

As you can see from the examples, an encoded record is just the concatenation of its encoded fields. Each field is identified by its tag number and annotated with a datatype. If a field value is not set, it is simply omitted from the encoded record. From this you can see that field tas are critical to the meaning of the encoded data. You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field's tag, since that would make all existing encoded data invalid.

If you examine the byte sequence, you can see that there is nothing to identify fields or their datatypes. The encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there's nothing in the encoded data that tells you that it is a string. It could just as well be an integer, or something else entirely. An integer is encoded using a variable-length encoding.

The key idea with Avro is that the writer's schema and the reader's schema don't have to be the same-they only need to be compatible. When data is decoded, the Avro library resolves the differences by looking at the writer's schema and the reader's schema side by side and translating the data from the writer's schema into the reader's schema. The Avro specification defines exactly how this resolution works, and it is illustrated in Figure 4-6.

For example, it's not problem if the writer's schema and the reader's schema have their fields in a different order, because the schema resoluation matches up the fields by field name. If the code reading the data encourters a field that appears in the writer's schema but not in the reader's schema, it is ignored. If the code reading the data expects some fields, but the writer's schema does not contain a field of that name, it is filled in with a default value declared in the reader's schema.

### Schema evolution rules

With Avro, forward compatibility  means that you can have a new version of the schema as writer and an old version of the schema as reader

## Leaders and Followers

Each node that stores a copy of the database is called a replica. With multiple replicas, a question inevitably arises: how do we ensure taht all the data ends up on all the replicas?

This mode of replication is a built-in feature of many relational databses, such as PostgreSQL, MySQL, Oracle Data Guar, and SQL Server's alwayson availability groups. It is also used in some nonrelational databses, including MongoDB, RethinkDB, and Espresso.

## Synchronous Versus Asynchronous Replication

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower. The disadvantage is that if the synchronous follower doesn't respond, the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

## Handling Node Outages

Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to planned maintenance. Being able to reboot individual nodes without downtime is a big advantage for operations and maintenance. Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep the impack of a node outage as small as possible.

- If asynchronous replication is usedm the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to hose writes? The new leader may have received conflicting writes in the meantime. The most common solution is for the old leader's unreplicated writes to simply be discarded, which may violate clients' durability expectations.

## Implementation of Replication Logs

How does leader-based replication work under the hood? Several different replication methods are used in practice, so let's look at each one briefly.

### Statement-based replication

In the simplest case, the leader logs every write request that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT, UPDATE, or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as if it had been received from a client.

### Write-ahead log shipping

In Chapter 3 we discussed how storage engines represent data on disk, and we found that usually every write is appended to a log

- In the case of a log-structured storage engine, that log is the main place for storage. Log segments are compacted and garbage-collected in the background
- In the case of a B-tree, which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.

When the follower processes this log, it builds a copy of the exact same data structures as found on the leader.

This methods of replication is used in PostgreSQL and Oracle among others. The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks.

That may seem like a minor implementation detail, but it can have a big operational impact. If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero downtime upgrade of the database software by firt upgrading the followers and then performing a failover to make one of the upgraded nodes the new leader. If the replication protocol does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require downtime.

### Logical log replication

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called a logical log, to distinguish it from the storage engine's data representation.

A logical log for a replational database is usually a sequence of records describing writes to database tables at the granularity of a row:

- For an inserted row, the log contains the new values of all columns
- For a deleted row, the log contains enough information to uniquely identify the primary key on the table, the old values of all columns need to be logged
- For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns

A transaction that modifies several rows generates several such log records, followed by a recod indicating that the transaction was committed. MySQL's binlog uses this approach.

### Trigger-based replication

The replication approaches described so far are implemented by the database system, without involving any application code. In many cases, that's what you want - but there are some circumstances whrer more flexibility is needed. For example, if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic, then you may need to move replication up to the application layer.

Some tools, such as Oracle GoldenGate, can make data changes available to an application bby reading the databse log. An alternative is to use features that are available in many relational databases: triggers and stored procedures.


In this read-scaling architeture, you can increase the capacity for serving read-only requests simply by adding more followers. However, this approach only realistically works with asynchronous replication - if you tried to synchronously replicate to all followers, a single node failure or network outage would make the entire system unavailable for writing. And the more nodes you have, the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.

Unfortunately, if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind. This leads to apparent inconsistencies in the databse: if you run the same query on the leader and a follower at the same time, you may get different results, because not all writes have been reflected in the follower. This inconsistency is just a temporary state - if you stop writing to the database and wait a while, the followers will eventually catch up and become consistent with the leader. For that reason, this effect is known as eventual consistency.

The term eventually is deliberately vague: in general, there is no limit to how far a replica can fall behind. In normal operation, the delay between a write happening on the leader and being reflected on a follower - the replication lag - may be only a fraction of a second, and not noticeable in practice. However, if the system is operating near capacity or if there is a problem in the network, the lag can easily increase to several seconds or even minutes.

When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a real problem for applications. In this section we will highlight three examples of problems that are likely to occur when there is replication lag and outline some approaches to solving them.

p. 189