Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
280 lines (224 sloc) 12.3 KB

ZS: a file format for compressed sets

ZS is a simple, read-only, binary file format designed for distributing, querying, and archiving arbitarily large data sets (up to tens of terabytes and beyond) -- so long as those data sets can be represented as a set of arbitrary binary records. Of course it works on small data sets too. You can think of it as an alternative to storing data in tab- or comma-separated files -- each line in such a file becomes a record in a ZS file. But ZS has a number of advantages over these traditional formats:

  • ZS files are small: ZS files (optionally) store data in compressed form. The 3-gram counts from the 2012 US English release of the Google N-grams are distributed as a set of gzipped text files in tab-separated format, and take 1.3 terabytes of space. Uncompressed, this data set comes to more than 10 terabytes (and would be even more if loaded into a database). The same data in a ZS file with the default settings (LZMA compression) takes just 0.75 terabytes -- this is more than 41% smaller than the current distribution format, and 13.5x smaller than the raw data.
  • Nonetheless, ZS files are fast: Decompression is an inherently slow and serial operation, which means that reading compressed files can easily become the bottleneck in an analysis. Google distributes the 3-gram counts in many separate .gz files; one of these, for example, contains just the n-grams that begin with the letters "th". Using a single core on a handy compute server, we find that we can get decompressed data out of this .gz file at ~190 MB/s. At this rate, reading this one file takes more than 47 minutes -- and that's before we even begin analyzing the data inside it.

    The LZMA compression used in our ZS file is, on its own, slower than gzip. If we restrict ourselves to a single core, then we can only read our ZS file at ~50 MB/s. However, ZS files allow for multithreaded decompression. Using 8 cores, gunzip runs at... still ~190 MB/s, because gzip decompression cannot be parallelized. On those same 8 cores, our ZS file decompresses at ~390 MB/s -- a nearly linear speedup. This is also ~3x faster than our test server can read an uncompressed file from disk.

  • In fact, ZS files are really, REALLY fast: Suppose we want to know how many different Google-scanned books published in the USA in 1955 used the phrase "this is fun". ZS files have a limited indexing ability that lets you quickly locate any arbitrary span of records that fall within a given sorted range, or share a certain textual prefix. This isn't as nice as a full-fledged database system that can query on any column, but it can be extremely useful for data sets where the first column (or first several columns) are usually used for lookup. Using our example file, finding the "this is fun" entry takes 5 disk seeks and ~25 milliseconds of CPU time -- something like 85 ms all told. (And hot cache performance -- e.g., when performing repeated queries in the same file -- is even better.) The answer, by the way, is 27 books:

    $ zs dump --prefix='this is fun\t1955\t' google-books-eng-us-all-20120701-3gram.zs
    this is fun     1955    27      27
    

    When this data is stored as gzipped text, then only way to locate an individual record, or span of similar records, is start decompressing the file from the beginning and wait until the records we want happen to scroll by, which in this case -- as noted above -- could take more than 45 minutes. Using ZS makes this query ~33,000x faster.

  • ZS files contain rich metadata: In addition to the raw data records, every ZS file contains a set of structured metadata in the form of an arbitrary JSON document. You can use this to store information about this file's record format (e.g., column names), notes on data collection or preprocessing steps, recommended citation information, or whatever you like, and be confident that it will follow your data where-ever it goes.

  • ZS files are network friendly: Suppose you know you just want to look up a few individual records that are buried inside that 0.75 terabyte file, or want a large span of records that are still much smaller than the full file (e.g., all 3-grams that begin "this is"). With ZS, you don't have to actually download the full 0.75 terabytes of data. Given a URL to the file, the ZS tools can find and fetch just the parts of the file you need, using nothing but standard HTTP. Of course going back and forth to the server does add overhead; if you need to make a large number of queries then it might be faster (and kinder to whoever's hosting the file!) to just download it. But there's no point in throwing around gigabytes of data to answer a kilobyte question.

    If you have the ZS tools installed, you can try it right now. Here's a live trace of the readthedocs.org servers searching the 3-gram database stored at UC San Diego. Note that the computer in San Diego has no special software installed at all -- this is just a static file that's available for download over HTTP:

    .. command-output:: time zs dump --prefix='this is fun\t' http://cpl-data.ucsd.edu/zs/google-books-20120701/eng-us-all/google-books-eng-us-all-20120701-3gram.zs
       :shell:
       :ellipsis: 2,-4
    
    
  • ZS files are splittable: If you're using a big distributed data processing system (e.g. Hadoop), then it's useful to split up your file into pieces that approximately match the underlying storage chunks, so each CPU can work on locally stored data. This is only possible, though, if your file format makes it possible to efficiently start reading near arbitrary positions in a file. With ZS files, this is possible (though because this requires multiple index lookups, it's not as convenient as in file formats designed with this as a primary consideration).

  • ZS files are ever-vigilant: Computer hardware is simply not reliable, especially on scales of years and terabytes. I've dealt with RAID cards that would occasionally flip a single bit in the data that was being read from disk. How confident are you that this won't be a key bit that totally changes your results? Standard text files provide no mechanism for detecting data corruption. Gzip and other traditional compression formats provide some protection, but it's only guaranteed to work if you read the entire file from start to finish and then remember to check the error code at the end, every time. But ZS is different: it protects every bit of data with 64-bit CRC checksums, and the software we distribute will never show you any data that hasn't first been double-checked for correctness. (Fortunately, the cost of this checking is negligible; all the times quoted above include these checks). If it matters to you whether your analysis gets the right answer, then ZS is a good choice.

  • Relying on the ZS format creates minimal risk: The ZS file format is simple and :ref:`fully documented <format>`; it's not hard to write an implementation for your favorite language. In an emergency, an average programmer with access to standard libraries could write a minimal but working decompressor in just an hour or two. The reference implementation is BSD-licensed, undergoes exhaustive automated testing (>98% coverage) after every checkin, and just in case there are any ambiguities in the English spec, we also have a complete :ref:`file format validator <zs validate>`, so you can confirm that your files match the spec and be confident that they will be readable by any compliant implementation.

  • ZS files have a name composed entirely of sibilants: How many file formats can say that?

This manual documents the reference implementation of the ZS file format, which includes both a command-line zs tool for manipulating ZS files and a fast and featureful Python API, and also provides a complete specification of the ZS file format in enough detail to allow independent implementations.

Contents:

.. toctree::
   :maxdepth: 2

   logistics.rst

   cmdline.rst

   library.rst

   conventions.rst

   datasets.rst

   format.rst

   changes.rst

Indices and tables