[PDB Docs] Add documentation for the hash table format.

Zachary Turner · Zachary Turner · commit cb70fe1c69a2 · 2019-04-05T22:09:30.000Z
llvm-svn: 357826
diff --git a/llvm/docs/PDB/HashTable.rst b/llvm/docs/PDB/HashTable.rst
@@ -1,2 +1,103 @@
 The PDB Serialized Hash Table Format
 ====================================
+
+.. contents::
+   :local:
+
+.. _hash_intro:
+
+Introduction
+============
+
+One of the design goals of the PDB format is to provide accelerated access to
+debug information, and for this reason there are several occasions where hash
+tables are serialized and embedded directly to the file, rather than requiring
+a consumer to read a list of values and reconstruct the hash table on the fly.
+
+The serialization format supports hash tables of arbitrarily large size and
+capacity, as well as value types and hash functions.  The only supported key
+value type is a uint32.  The only requirement is that the producer and consumer
+agree on the hash function.  As such, the hash function can is not discussed
+further in this document, it is assumed that for a particular instance of a PDB
+file hash table, the appropriate hash function is being used.
+
+On-Disk Format
+==============
+
+.. code-block:: none
+
+  .--------------------.-- +0
+  |        Size        |
+  .--------------------.-- +4
+  |      Capacity      |
+  .--------------------.-- +8
+  | Present Bit Vector |
+  .--------------------.-- +N
+  | Deleted Bit Vector |
+  .--------------------.-- +M                  ─╮
+  |        Key         |                        │
+  .--------------------.-- +M+4                 │
+  |       Value        |                        │
+  .--------------------.-- +M+4+sizeof(Value)   │
+           ...                                  ├─ |Capacity| Bucket entries
+  .--------------------.                        │
+  |        Key         |                        │
+  .--------------------.                        │
+  |       Value        |                        │
+  .--------------------.                       ─╯
+
+- **Size** - The number of values contained in the hash table.
+  
+- **Capacity** - The number of buckets in the hash table.  Producers should
+  maintain a load factor of no greater than ``2/3*Capacity+1``.
+  
+- **Present Bit Vector** - A serialized bit vector which contains information
+  about which buckets have valid values.  If the bucket has a value, the
+  corresponding bit will be set, and if the bucket doesn't have a value (either
+  because the bucket is empty or because the value is a tombstone value) the bit
+  will be unset.
+  
+- **Deleted Bit Vector** - A serialized bit vector which contains information
+  about which buckets have tombstone values.  If the entry in this bucket is
+  deleted, the bit will be set, otherwise it will be unset.
+
+- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first
+  entry is the key (always a uint32), and the second entry is the value.  The
+  state of each bucket (valid, empty, deleted) can be determined by examining
+  the present and deleted bit vectors.
+
+
+.. _hash_bit_vectors:
+
+Present and Deleted Bit Vectors
+===============================
+
+The bit vectors indicating the status of each bucket are serialized as follows:
+
+.. code-block:: none
+
+  .--------------------.-- +0
+  |     Word Count     |
+  .--------------------.-- +4
+  |        Word_0      |        ─╮
+  .--------------------.-- +8    │
+  |        Word_1      |         │
+  .--------------------.-- +12   ├─ |Word Count| values
+           ...                   │
+  .--------------------.         │
+  |       Word_N       |         │
+  .--------------------.        ─╯
+
+The words, when viewed as a contiguous block of bytes, represent a bit vector with
+the following layout:
+
+.. code-block:: none
+
+    .------------.         .------------.------------.
+    |   Word_N   |   ...   |   Word_1   |   Word_0   |
+    .------------.         .------------.------------.
+    |            |         |            |            |
+  +N*32      +(N-1)*32    +64          +32          +0
+
+where the k'th bit of this bit vector represents the status of the k'th bucket
+in the hash table.