NOTES

mcdb - fast, reliable, simple code to create and read constant databases

mcdb enhancements to cdb (on which mcdb is based; see References section below)
- updated to C99 and POSIX.1-2001 (not available/portable when djb wrote cdb)
- optimized for mmap access to constant db (and avoid double buffering)
- redesigned for use in threaded programs (thread-safe interface available)
- convenience routines to check for updated constant db and to refresh mmap
- support mcdb > 4 GB with 64-bit program (required to mmap() mcdb > 4 GB)
- 64-bit safe (for use in 64-bit programs)

Advantages over external database
- performance: better; avoids context switch to external database process
Advantages over specialized hash map
- generic, reusable
- maintained (created and verified) externally from process (less overhead)
- shared across processes (though shared-memory could be used for hash map)
- read-only (though memory pages could also be marked read-only for hash map)
Disadvantages to specialized hash map
- performance: slightly slower than specialized hash map
Disadvantages to djb cdb
- mmap requires address space be available into which to mmap the const db
  (i.e. large const db might fail to mmap into 32-bit process)
- mmap page alignment requirements and use of address space limits const db
  max size when created by 32-bit process.  Sizes approaching 4 GB may fail.
- arbitrary limit of each key or data set to (INT_MAX - 8 bytes; almost 2 GB)
  (djb cdb doc states there is no limit besides cdb fitting into 4 GB)
  (writev() on some platforms in 32-bit exe might also have 2 GB limit)

Incompatibilities with djb cdb
- padding added at the end of key,value data to 16-byte align hash tables
  (incompatible with djb cdbdump)
- initial table and hash tables have 8-byte values instead of 4-byte values
  in order to support mcdb > 4 GB.  cdb uses 24 bytes per record plus 2048,
  whereas mcdb uses 24 bytes per record plus 4096 when data section < 4 GB,
  and mcdb uses 40 bytes per record plus 4096 when data section >= 4 GB.
- packing of integral lengths into char strings is done big-endian for
  performance in packing/unpacking integer data in 4-byte (or better)
  aligned addresses.  (incompatible with all djb cdb* tools and cdb's)
  (djb cdb documents all 32-bit quantities stored in little-endian form)
  Memory load latency is limiting factor, not the x86 assembly instruction
  to convert uint32_t to and from big-endian (when data is 4-byte aligned).

Limitations
- 2 billion keys
  As long as djb hash is 32-bit, mcdb_make.c limits number of hash keys to
  2 billion.  mcdb handles hash collisions, but there is a small expense each
  collision.  As the key space becomes denser within the 2 billion, there is
  greater chance of collisions.  Input strings also affect this probability,
  as do the sizes of the hash tables.
- process must mmap() entire mcdb
  Each mcdb is mmap()d in its entirety into the address space.  For 32-bit
  programs that means there is a 4 GB limit on size of mcdb, minus address
  space used by the program (including stack, heap, shared libraries, shmat
  and other mmaps, etc).  Compile and link 64-bit to remove this limitation.


References
----------
  Dan Bernstein's (djb) reference implementation of cdb (public domain)
    http://cr.yp.to/cdb.html
  Michael Tokarev's TinyCDB implementation of cdb (public domain)
    http://www.corpit.ru/mjt/tinycdb.html
  How To Write Shared Libraries, by Ulrich Drepper
    http://www.akkadia.org/drepper/dsohowto.pdf


There is plenty more information which will eventually (hopefully) be here.
In the meantime, here are some snippets, and more will come after the
initial release, as time allows.


Technical Asides
----------------


mcdbctl creates databases with limited permissions (0400)
---------------------------------------------------------
mcdbctl creates new databases with limited permissions (0400), important for
security when creating mcdb for sensitive data, such as /etc/shadow data.
mcdbctl recreates existing databases and preserves the permission modes that
were applied to the previous mcdb (replaced by the recreated mcdb).

mcdb performant design tidbits
------------------------------
Data locality is one of the primary keys to performance: djb cdb linearly
probed open hash tables minimize arbitrary jumps around memory.
Endian tranformations are absorbed in the noise of memory load latency.

An immediate performance boost can be seen when using mcdb for nsswitch
databases on machines that run Apache with suexec which was the reason behind
using cdb, and now mcdb, for passwd and group databases.  See INSTALL file.

smallest mcdb
-------------
Empty mcdb with no keys is 4K
$ echo | mcdbctl make empty.mcdb -

mcdb supports empty keys and values
-----------------------------------
$ echo -e "+0,0:->\n" | mcdbctl make empty.mcdb -
In practice, empty values can be useful when interested only in existence,
or not, of keys.  However, more than one empty key is not generically useful,
(or a use case is not immediately apparent to me).

mcdb supports one-char tag prefix on keys
-----------------------------------------
One method of storing multiple types of constant data in a single mcdb is
to prefix the key with a single character that indicate the type of data in
the key.  This feature is used by nss_mcdb for various nsswitch.conf databases.
For example, passwd databaes can be queried by name or by uid.  Both are stored
in the same mcdb database, with a different character prefixing the keys for
names versus the keys for uids.  This prefix tag character feature of mcdb
allows for tagged-key queries without the extra work of having to create a
temporary buffer to set the tag followed by a copy of the key.

mcdb creation speed
-------------------
mcdb creation supports the same input format as cdb.  In fact, creating an
mcdb from a cdb is as simple as: $ cdbdump foo.cdb | mcdbctl make foo.mcdb -
However, there is an overhead to formating the input as ASCII numbers, as well
as parsing and translating the ASCII back to native numbers.  (The cost adds up
with lots of keys.)  Another method to create mcdb which is even faster is
for a program to link with and call mcdb_make_* routines directly, as is done
in nss_mcdbctl.  For any mcdb beyond trivial size, disk speed has a large impact
on performance.  An encrpyted filesystem (e.g. homedirs in some Linux distros)
can be multiple times slower than unencrypted.  Where absolute performance
matters, solid state disks (SSDs) are recommended over spinning hard disks,
and in-memory filesystems (tmpfs on Linux, e.g. mounted on /dev/shm on Fedora
Linux) are even faster as long as the mcdb fits in the in-memory filesystem.

mcdb in memory only
-------------------
Creating an mcdb in memory without filesystem backing is possible.  This might
be useful for testing, but in practice, a hash function customized to the
specific purpose at hand would be faster.  Setting the fd to -1 in the call to
mcdb_make_start() is how to tell mcdb routines the mcdb is not backed by the
filesystem.  Then, caller must create mmap large enough for all keys and data
(filling in nkeys (num keys), total_klen (key len), and total_dlen (data len)).
  struct mcdb_make m;
  size_t msz;
  mcdb_make_start(&m, -1, malloc, free);
  /* preallocated mcdb mmap to proper full size; (msz+15) & ~15 for alignment */
  msz = (MCDB_HEADER_SZ + nkeys*8 + total_klen + total_dlen + 15) & ~15;
  msz+= (msz < UINT_MAX ? nkeys*16 : nkeys*32);
  m.msz = (msz = (msz + ~m.pgalign) & m.pgalign); /*align to page size*/
  /*(mmap or malloc or other mem alloc scheme, but must be sufficiently sized)*/
  m.map = (char *)mmap(0, msz, PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (m.map == MAP_FAILED) return false;
  /* ... loop and mcdb_make_add(), then mcdb_make_finish() */
  /* ... when finished using the mmap, then munmap(m.map, m.msz) */
If caller is keeping track of size using the formula above, then caller could
resize the map, copy the existing data ([0,m.pos)) and set new m.map and m.msz.
Either way, this technique could be used to create an mcdb, which is then added
as a value to a larger mcdb (an mcdb where the values are also mcdbs).  Upon
retrieving such a value, code could manually initialize struct mcdb_mmap and
then query it, though take care not to destroy or reopen it since the mcdb will
not be backed by the filesystem.  Initializing struct mcdb_mmap after querying
an example larger_mcdb:
  struct mcdb m;
  struct mcdb_mmap map;
  memset(&map, '\0', sizeof(struct mcdb_mmap));
  map.ptr  = mcdb_dataptr(larger_mcdb);
  map.size = mcdb_datalen(larger_mcdb);
  map->b   = map.size < UINT_MAX || *(uint32_t *)map.ptr == 0 ? 3 : 4;
  map->n   = ~0;
  m.map = &map;
  /* ... mcdb_find(&m, str, len) ... */

compiler intrinsics/builtins not (yet) tested on all platforms
--------------------------------------------------------------
The compiler intrinsics/builtins in plasma_attr.h have been tested on i686
and x86_64 platforms, but have not (yet) been tested on all other platforms.
They were written from vendor documentation available on the internet, but I do
not have private access to all those platforms in order to test.  Please send
to me advice, suggestions, and corrections.

mcdb database file validation
-----------------------------
mcdb database file is assumed to be consistent (not corrupt) once created.
A failure in 'mcdbctl make' results in corrupt database temporary file -not-
replacing an existing, good database.  An expensive method to validate an
existing mcdb is to run 'mcdbctl stats foo.mcdb', which reads the mcdb and
queries every key.  If 'mcdbctl stats' exits 0, then the database is consistent
(good).  A faster, non-destructive (and non-crashing) heuristic to validate if
a file is an mcdb is to call mcdb_validate_slots() which validates the slots
table in the mcdb header.  It the slot table is coherent, then the file is very
likely an mcdb.

mcdb queries into a file that is not an mcdb has strong possibility of causing
a program crash upon first access to slot and hash tables in mcdb.  Programs
that pass an invalid file (not an mcdb) to mcdb_mmap_create() will likely crash
upon first access into that invalid mcdb, unless the mcdb is first checked with
a call to mcdb_validate_slots().

mcdb unique keys
----------------
mcdb supports multi-valued keys, where the same key is in the mcdb more than
once, each time with another value.  If there is a need to have unique keys in
the mcdb, then mcdb creator has a few options: keep track of the keys already
entered in the database and check before adding another, create a hash table of
keys and then use that to create the mcdb, or post-process the mcdb.  The first
two options work well for small mcdb, but might be expensive for large mcdb.
mcdbctl provides an easy way to accomplish the third option, post-processing the
mcdb.  Given an mcdb, mcdbctl uniq file.mcdb will create a new mcdb with unique
keys, using the first key,value found for each key.  If the optional parameter
"last" is given (mcdbctl uniq file.mcdb "last"), mcdbctl will do the same, but
will use the last (final) value found for each key.  In both cases, a new mcdb
is only created (and then renamed into the original mcdb) if a multi-valued key
is detected in the original.

mcdb limit of a billion keys (on that order of magnitude)
----------------------------
(See "Limitations" above)
Just came across https://github.com/pcarrier/cdb64, which changes all structures
in cdb to 64-bit, including the hash function, so that it can handle billions of
keys.  By comparison, mcdb uses 32-bit structures in select places, and changes
some 32-bit structures to 64-bit only when the data exceeds 4 GB.  Doing this
in mcdb makes mcdb much smaller and faster than cdb64 in the vast majority of
use cases.  However, if there is a need for billions of keys, then cdb64 might
be considered, though at that point it is probably advisable to write custom
code to handle the specific data set, including a replacement for the djb hash
function.  (mcdb code could be easily tweaked to support 64-bit hslots and
64-bit hash values, but the on-disk format would be incompatible with mcdb.)

mcdb support for user-provided (custom) hash function
-----------------------------------------------------
The ability to use other hash functions can measurably improve mcdb performance
if a specific dataset hashes more quickly and/or with better distribution with
a different hash function. 

Experimental support exists for a user-provided (custom) hash function to
replace the djb hash function.  It is only available in C and documented here
in the raw.  (If useful, please send feedback to me and I might create simple
function interfaces to configure a custom hash function.  It would even be
possible to have a structured footer at the end of the mcdb which, if present,
could contain info about which hash_fn to use from a future pre-configured set.)

During mcdb creation, a custom hash function can be set after mcdb_make_start()
but prior to the first mcdb_make_add().  By default, struct mcdb_make is
initialized in mcdb_make_start() to use the hash_fn = djb hash and initial hash
value hash_init = djb hash initial value.
  uint32_t hash_init;         /* hash init value */
  uint32_t (*hash_fn)(uint32_t, const void * restrict, size_t); /* hash func */
Similarly, mcdb_mmap_init() sets the djb hash function and initial value in
struct mcdb_mmap, using members named the same as above, and so the mcdb
consumer must set the custom hash function after mcdb_mmap_init() and prior to
querying the mcdb.  Key and value arguments passed to mcdb functions typically
expect (const char *) and so casting might be necessary to avoid compiler
warnings/errors if the hash function expects a different type.

mcdb makes no attempt to validate the hash function matches the hash function
used to create the mcdb.  Doing so is the responsibilty of mcdb creator and
consumers.  If the hash function is lost, the key/value data in the mcdb is
still retrievable via mcdb_iter() or mcdbctl dump.  Other options to the
command-line tool mcdbctl will not function properly when operating on an mcdb
created using a custom hash function.

One interesting example of a high-performance custom hash function is the
identity function, which can be used when the data in the mcdb always has
keys that are 4-byte integers (with reasonable distribution in the lower bits),
and hash_init is set to 0 to be initialized, but is otherwise ignored.
  uint32_t
  hash_uint32_identity(uint32_t h       __attribute__((unused)),
                       const void * const restrict vbuf,
                       const size_t sz  __attribute__((unused)))
  {
      return *(uint32_t *)vbuf;
  }
Additionally, there are other options.  An excellent starting point -- but by
no means exhaustive -- comparison of some alternative hash functions can be
found at: http://burtleburtle.net/bob/hash/doobs.html


Portability Notes
-----------------


warning: visibility attribute not supported in this configuration; ignored
--------------------------------------------------------------------------
This is a harmless warning and occurs on some platforms where the gcc extension
__attribute__((visibility("..."))) is not supported.  On Solaris on SPARC,
newer versions of gcc (e.g. 4.6.1) support the visibility attribute, while
earlier versions of gcc might not.

HP-UX
-----
Compilation requires newer version of gcc (e.g. 4.6.1)

-D_FORTIFY_SOURCE=2
------------------
infinite loop in 'mcdbctl make xxx.mcdb -' reading from stdin on Fedora 16
with clang 2.9 and mcdb_makefmt.c compiled with -D_FORTIFY_SOURCE=2
(workaround is to compile mcdb_makefmt.c without -D_FORTIFY_SOURCE=2)


Distribution Notes
------------------
mcdb is licensed under the GNU Lesser General Public License (see file COPYING)
mcdb is originally based on Dan J. Berstein cdb package.  cdb is public domain:

http://cr.yp.to/distributors.html
D. J. Bernstein
Frequently asked questions from distributors
[...]
What are the distribution terms for cdb?
2009.07.21: I hereby place the cdb package (in particular, cdb-0.75.tar.gz, with MD5 checksum 81fed54d0bde51b147dd6c20cdb92d51) into the public domain. The package is no longer copyrighted.