Skip to content

Log Storage

santip edited this page Jan 4, 2012 · 1 revision

Log Storage


  1. Guaranteeing the durability of the user's data (namely their index documents).
  2. Serving as the source for indexes recovery from failure and upgrades.
  3. Removing the dependency on EC2 or other providers.

File format

Log files are basically plain sequences of log records concatenated one after the other. Each log record is a binary serialization of a thrift object called LogRecord. Each record represents one or more modifications to a document in an index, for more details on the details of a LogRecord and how multiple modifications can be merged in a single record see Log Storage Records

Log Writer Server

This is the most critical process of the service. It's in charge of answersing the write calls from the api and writing them to append-only segments of up to 50MB. These segments are simply named <timestamp>.raw where timestamp is the number of milliseconds since epoch at the time the segment was created.

Fault tolerance

The writer was designed to make fault tolerance simple with the following principles:

  • New writers can be spawned and start receiving writes at any time
  • A writer should receive every following write after the first one.
  • A writer can stop receiving writes at any time, but should not receive any write thereafter.

If these are enforced, each writer will contain all the modifications of a portion of the timeline. Upon failure, the product of different writers (either from the actual boxes or backups) can be merged by throwing all the segment files in a single directory. If there are no missing portions of the timeline, then reading the totality these files will yield a valid representation of the write history.

This way, multiple writers can be used for perdurability, but also, when one is deemed unusable, it can be easily be left out of the writing process and a new one can easily be spawned to maintain a very low probability of data loss.

Indexes Log Server

This process has two main purposes:

  • Arranging the data in a more compact and easily accessible format.
  • Providing read access to the modifications history on a per-index basis.

Due to its nature, there can only be one of these actively arranging the data.


One of the responsibilities of the server is distributing the live data produced by the writers, we call this dealing the logs. A live segment is only dealt once there's a newer segment being written. Dealing a segment implies grouping all the modifications done in that period by index code and dumping this portions to separate logs kept for each index.

These separate logs have segments that are named <timestamp>__unsorted_<record_count> where timestamp is a formatted version of the timestamp of the raw segment that originated this segment. Finally, record_count represents the number of records that compose the file. When dealing, new unsorted segments are created for each dealing operation, this way the segment files are immutable.

These segments will be created until there are more than 30 segments or added together they exceed 30MB of size. At such point, the dealer will load all the unsorted segments in memory, sort the records by docid, merge multiple records for the same docid and write this down to a new segment named <timestamp>__sorted_<record_count> where timestamp is the timestamp of the first unsorted segment that composes this segment and record_count represents the number of different docids contained in the sorted segment. After this operation, the unsorted segment will be removed. Sorted segments are also immutable, once written, nobody will modify them.


Another responsibility is optimizing the data so it can be read faster and takes up less space. For this it simply takes several sorted segments for an index and merges them into one big single optimized segment. This is done by sequentially iterating the segments in question, since they are all sorted by docid, merging them is trivial and when multiple records are found for the same docid they are merged into a single agreggated one. Also if the last operation is delete, it can be skipped in the output segment. The policies for optimization are still under development. The optimized segments are named <timestamp>__optimized_<record_count> and the timestamp is the timestamp of the latest sorted segment merged in.


Another responsibility is serving the data to be read for specific indexes. For this, the server implements a paged reading mechanism that goes through the following steps:

  1. Serves pages of 5MB portions of the largest optimized segment for the index.
  2. Serves pages of 5MB portions of the available sorted segments for the index.
  3. Closes and sort the current unsorted segments, and serves the product in 5MB pages.
  4. Goes on reading 5MB portions of the live segments after the last dealt segment (before forcing the close and sort in step 3) and serves pages with the records of those portions that belong to the requested index.

Durability plans

See Log Storage Durability plan

Something went wrong with that request. Please try again.