Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"
C++ C Other
Switch branches/tags
Nothing to show
Clone or download
Permalink
Failed to load latest commit information.
3rd_party Workaround bug in mongodb nonce initialization Dec 7, 2012
perftest Switch to external succinct library as a submodule Oct 16, 2011
semi_index Do not build unnecessary indexes Dec 7, 2012
succinct @ f0c9452
.gitignore Add initial .gitignore Aug 8, 2011
.gitmodules Switch to external succinct library as a submodule Oct 16, 2011
CMakeLists.txt Small changes to README Oct 16, 2011
LICENSE Add Apache license Jul 30, 2011
README.md
drop_caches.sh Initial import Aug 8, 2011
example.json Initial import Aug 8, 2011
json_select.cpp Switch to external succinct library as a submodule Oct 16, 2011

README.md

semi_index

This library implements the algorithm described in Semi-Indexing Semi-Structured Data in Tiny Space.

The aim is to speed-up the processing of large collections of JSON documents when only a small subset of the values in each document is needed at each scan. An example is given by MapReduce jobs on semi-structured logging data, where usually two or three values are needed but the average document size can be in the hundreds of kilobytes. Another example is the creation of indexes in document databases such as CouchDB and MongoDB.

The main idea is, given a collection of JSON documents, to generate an auxiliary file that encodes the structure of the documents, so that in the parsing phase the parser can be pointed directly at the positions where the values are, without having to parse the whole document. Such an auxiliary file is called a semi-index.

This auxiliary file can be create once and used in all the subsequent scans. By using succinct data structures the size of the auxiliary file can be kept relatively small compared to the size of the collection.

The reasons for the speed-up are two-fold:

  • Since the structure of the document is known, the parser can be pointed to the position in the file where the values we are interested in are located. This way a considerable amount of I/O can be saved.
  • For the same reason, there is no need to perform a full parsing of the document into a document tree. This reduces both CPU time and memory usage.

Depending on the characteristics of the documents and how the semi-index is used, the speedups reported in the paper range between 2x and 12x.

Usage example: json_select

An example application is provided with the library: json_select.

json_select scans a collection of JSON documents given as lines of a text file, extracts a set of given attributes and outputs them as a collection of JSON lists.

For example consider the file example.json, which is a sample from Wikipedia edits log.

$ head -3 example.json 
{"comment": "content was: '[[Media:Example.og[http://www.example.com link title][http://www.example.com link title]''Italic text'''''Bold text'''jjhkjhkjhkjhkjhjggghg]]'", "logtitle": "Vivian Blaine", "timestamp": "2004-12-23T03:20:32Z", "action": "delete", "params": "", "contributor": {"username": "Slowking Man", "id": 56299}, "type": "delete", "id": 1}
{"comment": "{{GFDL}} {{cc-by-sa-2.0}}", "logtitle": "File:Mini Christmas tree.png", "timestamp": "2004-12-23T03:24:26Z", "action": "upload", "params": "", "contributor": {"username": "Fredrik", "id": 26675}, "type": "upload", "id": 2}
{"comment": "content was: 'Daniel Li is an amazing human being.'", "logtitle": "Daniel Li", "timestamp": "2004-12-23T03:27:51Z", "action": "delete", "params": "", "contributor": {"username": "Slowking Man", "id": 56299}, "type": "delete", "id": 3}

If we want to extract the attributes logtitle and contributor.id from attribute we can use the following command:

$ ./json_select naive_parse_stream logtitle,contributor.id < example.json | head -3
["Vivian Blaine",56299]
["File:Mini Christmas tree.png",26675]
["Daniel Li",56299]

naive_parse_stream works by loading each document and parsing it with JsonCpp.

To show the speed-up that can be obtained by using a semi-index it is better to use a bigger file. We use wp_history.json from the datasets used in the paper.

We get the id, title and timestamp of the last contribution from each page.

$ sudo ./drop_caches.sh && time ./json_select naive_parse_stream id,title,revision[-1].timestamp < wp_history.json > /dev/null

real	1m0.408s
user	0m34.630s
sys	0m3.260s

drop_cache.sh drops all the kernel page caches to ensure that the file is read from disk each time. The script supports only Linux.

Now we create a semi-index on the file with the following command:

$ sudo ./drop_caches.sh && time ./json_select si_save wp_history.json.si < wp_history.json > /dev/null
[ ... DEBUG OUTPUT ... ]
real	0m37.710s
user	0m8.890s
sys	0m2.880s

Note that the time needed to create the semi-index is less then the time needed for a scan+parse!

For this very low density file the semi-index is negligibly small, compared to the raw collection:

2.8G	wp_history.json
9.7M	wp_history.json.si

For more typical file, the overhead is around 10%.

We can now extract the same attributes as before but taking advantage of the semi-index.

$ sudo ./drop_caches.sh && time ./json_select saved_si_parse_mapped wp_history.json wp_history.json.si id,title,revision[-1].timestamp > /dev/null

real	0m11.282s
user	0m0.030s
sys 	0m0.380s

Using the semi-index, the extraction is almost 6 times faster than normal parsing. Using a compressor that supports random-access on the JSON file further speedups are possible thanks to the reduced I/O. See the source code of json_select for the details.

How it works

Semi-indexing is a technique that can be applied to most text serialization formats (such as XML). The generic scheme is described in full detail in the paper. Here we give a simple explanation of how it works when applied to JSON.

The JSON semi-index consists of two binary strings, pos and bp, which encode respectively the positions of the structural elements of JSON (i.e. {}[]:,) and the structure of the parsing tree, encoded as a sequence of balanced parentheses.

    {"a": 1, "b": {"l": [1, null], "v": true}}
pos 110000100100001100001100100000010000100000
bp  (()()()(()(()())()()))

Note that each 1 in the pos string is associated to two consecutive parentheses in the bp string.

One property of pos is that it is very sparse, because usually keys and values are at least a few characters long, hence it is very compressible. Using an Elias-Fano encoding, a space close to the information-theoretical optimum can be reached while allowing efficient random-access and powerful operations such as select, which gives the position of the i-th 1 bit in the string.

On the other hand bp, which is very small (it is twice as long as the number of ones in pos) can be augmented with data structures that enable quick tree-like navigation. This structure take only a negligible amount of space compared to the string.

Navigation in the JSON object is performed by navigating the tree represented by bp and then retrieving the key/values from the JSON document by pointing to the the positions obtained from pos.

How to build the code

Dependencies

The following dependencies have to be installed to compile the library.

  • CMake >= 2.6, for the build system
  • zlib
  • Boost >= 1.42

Also, the library succinct has to be downloaded as a git submodule, so the following two commands have to be executed before running cmake:

$ git submodule init
$ git submodule update

json_select and the performance tests also depend on MongoDB and JsonCpp, but their sources are included in the source tree so they don't need to be installed.

Supported systems

The code has been developed and tested mainly on Linux, but it has been tested also on Mac OS X and Windows 7.

The code has been tested only on x86-64. Compiling it on 32bit architectures would probably require some work.

Building on Unix

The project uses CMake. To build it on Unix systems it should be sufficient to do the following:

$ cmake . -DCMAKE_BUILD_TYPE=Release
$ make

It is also advised to perform a make test, which runs the unit tests.

Building on Windows

On Windows, Boost and zlib are not installed in default locations, so it is necessary to set some environment variables to allow the build system to find them.

  • For Boost BOOST_ROOT must be set to the directory which contains the boost include directory.
  • For zlib CMAKE_PREFIX_PATH must be set to the directory that contains zlib.h
  • Both the directories that contain the Boost and zlib DLLs must be added to PATH so that the executables find them

Once the env variables are set, the quickest way to build the code is by using NMake (instead of the default Visual Studio). Run the following commands in a Visual Studio x64 Command Prompt:

$ cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release .
$ nmake
$ nmake test