# Skip Lists, Commutative Hashing, and Authenticated Dictionaries

Goodrich and Tamassia's paper ["Efficient Authenticated Dictionaries with Skip Lists and Commutative Hashing"](http://www.cs.jhu.edu/~goodrich/cgc/pubs/hashskip.pdf) discuss a few interesting concepts.

The Goodrich paper uses the skip list data structure and a concept called _commutative hashing_ to build an _authenticated dictionary_. The described use-case of an authenticated dictionary is a certificate revocation statusing system. In this example a certificate authority (CA) is an authoratative source of a set of certificates and their revocation status. Instead of relying on querying the CA for trustworthy revocation checks, entities interested in revocation status can query other (untrusted) services for revocation status and a non-interactive proof of status from the CA.

**TODO(kkl)**: Add a diagram of Checker -> Untrusted Service -> CA.

**TODO(kkl)**: Diagrams of Plateau and Tower elements.

## Other version of the paper

* Updated, different algorithms: https://pdfs.semanticscholar.org/7eff/1b4529cd99168245cf0ff17cc23b27dd69b1.pdf

* More implementation details: https://www.ics.uci.edu/~goodrich/pubs/discex2001.pdf

## An overview of terminology used in the paper

$h$ - A commutative hash function

$S$ - A set of elements for which we want to enable authenticated membership queries.

$S_t$ - A linked list at level $t$. $S_0$ is the base list that contains all elements of $S$ in order.

$v$ - A node within a linked list $S_t$.

$s$ - A "start" node when traversing a skip list.

$f(v)$ - A "label" computed of a value $v$ that computed from $h$ and the element's position in a skip list.

$f(s)$ - The label of the start node.

$elem(v)$ - The element stored at the node $v$. E.g. The number 32.

"plateau" - The top-most node of a "column" in a skip list. 

"tower" - Any node that isn't a plateau node.


## Skip lists

Skip lists are a data structure that were introduced by [William Pugh](https://epaperpress.com/sortsearch/download/skiplist.pdf) as a (simpler) to implement alternative to self-balancing trees with reasonably efficient search, insertion, and deletion properties.

For demonstration purposes, let's build an incomplete version of a skip list and some helper functions. You can skim the details here if you are familar with skip lists.

For clarity, I'll stick with naming conventions used in Goodrich's paper. First, let's build a "Node" element which will store integer values (i.e. "elements" in Goodrich's paper) and pointers to other Nodes.

In [1]:
import hashlib
from kelbz.skiplist import SkipList, INF

# Using the skip list from Figure 5 of Goodrich's paper.
skip_list = SkipList([
    [-INF,                                             INF], # Level 5
    [-INF,     17,                                     INF], # Level 4
    [-INF,     17,         25,                     55, INF], # Level 3
    [-INF,     17,         25, 31,                 55, INF], # Level 2
    [-INF, 12, 17,         25, 31, 38,     44,     55, INF], # Level 1
    [-INF, 12, 17, 20, 22, 25, 31, 38, 39, 44, 50, 55, INF], # Level 0
])

assert skip_list.height == 6
assert skip_list.get(12).elem == 12
assert skip_list.get(12).level == 0
assert skip_list.get(77) is None

Now let's build a partially complete skip list. This skip list implementation doesn't support a few things like deletions, a dynamic number of levels, nor am I certain it will handle duplicates correctly. But for demonstration purposes, it should suffice.

The implementation details are not critical, so feel free to skim over them.

To get a feel for the `SkipList` API (and to test it's mostly functional), let's insert some elements and search for elements.

## Commutative Hashing

Commutative hash functions are a relaxation of typical collision resistant hash function. That is, a commutative hash function $h(x, y)$ is defined to return the same results regardless of the order of the inputs, so there is a trivial case that would normally be considered a collision but is not: $h(x, y) = h(y, x)$. Commutative hash function collisions explicitly call out this collision as irrelevant.

**TODO: Write about the value-add of commutative hashing in this context**

> Note that our scheme requires only the repeated accumulation of a sequence of values with a hash function. Unlike the previous best hash tree schemes, there is no need to provide auxiliary information about the order of the arguments to be hashed at each step, as determined by the topology of the path in a hash tree.

The commutative hash function they implement is simple. We first hash the smaller element of the two inputs and then the larger. This is implemented below.

In [2]:
def chash(x, y):
    """Commutative hashing algorithm as described in Goodrich section 3.2.
    
    This function truncates the output of SHA-256 for readability 
    but it is not required.
    """

    def _to_bytes(i):
        """Serialize a number into a byte string.

        Assumes input will always be a 256-bit integer. Sentinel values
        -Inf/Inf are serialized using values that are not otherwise
        possible under this assumption.
        """
        if i == INF:
            return b"\xff" * ((256//8)+1)
        elif i == -INF:
            return b"\x00" * ((256//8)+1)
        else:
            return i.to_bytes(256//8, byteorder='big')
    
    # The Goodirch paper doesn't explicitly state how to handle the -Infinity
    # case but there are cases where it expects us to hash -Infinity elements.
    # E.g. The bottom-left node will hit case (1a) which requires
    # a hash of elem(v).
    # 
    # The Goodrich paper *also* says we don't need to hash Infinity elements
    # but I think it also contradicts itself. Since case (1a) would require the
    # hash of elem(w) which could be infinity if v was a non-plateau node immedietely
    # to the left of an Inf node.
    fst, sec = min(x, y), max(x, y)
    sha256 = hashlib.sha256()
    sha256.update(_to_bytes(fst))
    sha256.update(_to_bytes(sec))
    return int.from_bytes(sha256.digest(), byteorder='big') % 0xff

def chash_sequence(seq):
    """Hash a sequence of values using the algorithm described in Section 3.
    """
    if len(seq) == 2:
        return chash(seq[0], seq[1])
    return chash(seq[0], chash_sequence(seq[1:]))

assert chash(1, 2) == chash(2, 1)
assert chash(1, 2) != chash(2, 3)
assert chash_sequence([1,2]) == chash(1, 2)
assert chash_sequence([1,2,3]) == chash(1, chash(2, 3))
assert chash_sequence([1,3,2]) == chash(1, chash(2, 3))
assert chash_sequence([1,2,3,4]) == chash(1, chash(2, chash(3, 4)))

## Hashing the skip list

Now that we have a skip list implementation, we can build the "label" function described in Goodrich 4.1. This label is derived from the values to the right and below a given node $v$ and the commutative hash function $h$ we just implemented.

This function, given a node in the skip list, returns it's label.

In [3]:
def f(v):
    w = v.right
    u = v.down
    # Base case: w/right is None. Default to a zero value.
    if w is None:
        return 0
    
    # Case 1: We are at the bottom level of the skiplist.
    if u is None:
        # Case 1a
        if w.is_tower:
            return chash(v.elem, w.elem)
        # Case 1b
        return chash(v.elem, f(w))
    else:
        # Case 2a
        if w.is_tower:
            return f(u)
        # Case 2b
        return chash(f(u), f(w))
    
def f_verbose(v):
    return _f_verbose(v, {})

def _f_verbose(v, d):
    w = v.right
    u = v.down
    # Base case: w/right is None. Default to a zero value.
    if w is None:
        d["f(%s)" % v._debug()] = "0"
        return 0, d
    
    # Case 1: We are at the bottom level of the skiplist.
    if u is None:
        # Case 1a
        if w.is_tower:
            d["f(%s)" % v._debug()] = "h(%s, %s)" % (v.elem, w.elem)
            return chash(v.elem, w.elem), d
        # Case 1b
        d["f(%s)" % v._debug()] = "h(%s, f(%s))" % (v.elem, w._debug())
        fw, d = _f_verbose(w, d)
        return chash(v.elem, fw), d
    else:
        # Case 2a
        if w.is_tower:
            d["f(%s)" % v._debug()] = "f(%s)" % u._debug()
            return _f_verbose(u, d)
        # Case 2b
        d["f(%s)" % v._debug()] = "h(f(%s), f(%s))" % (u._debug(), w._debug())
        fu, d = _f_verbose(u, d)
        fw, d = _f_verbose(w, d)
        return chash(fu, fw), d

# f should be deterministic for a given start node in a list.
assert f(skip_list.start) == f(skip_list.start)

smol = SkipList([
    [-INF,                      INF], # Level 3
    [-INF,       3,             INF], # Level 2
    [-INF,    2, 3,       6,    INF], # Level 1
    [-INF, 1, 2, 3, 4, 5, 6, 7, INF], # Level 0
])
_, d = f_verbose(smol.start)

def replace(d, match, replace):
    new_d = d.copy()
    for k, v in d.items():
        if match in v:
            new_d[k] = v.replace(match, replace)
    return new_d

def expand(d):
    new_d = d.copy()
    _continue = True
    while _continue:
        for k in sorted(d.keys()):
            v = d[k]
            new_d = replace(new_d, k, v)
        _continue = any(["Node{" in v for v in new_d.values()])
    return new_d

def labels(skip_list):
    d = {}
    for i in reversed(range(skip_list.height)):
        nodes = skip_list._level(i)
        for node in nodes:
            d["f(%s)" % node._debug()] = f(node)
    return d

    
expanded = expand(d)
for k in sorted(expanded.keys()):
    print("%s = %s" % (k, expanded[k]))
    
label_dict = labels(smol)
for k in sorted(label_dict.keys()):
    print("%s = %s" % (k, label_dict[k]))

h = chash
inf = INF
f(skip_list.start), h(h(h(h(-inf, 12), h(12, 17)), h(h(17, h(20, h(22, 25))), h(h(h(25, 31), h(h(31, 38), h(h(38, h(39, 44)), h(44, h(50, 55))))), h(55, inf)))), 0)

f(Node{level: 0, elem:-inf}) = h(-inf, h(1, 2))
f(Node{level: 0, elem:1}) = h(1, 2)
f(Node{level: 0, elem:2}) = h(2, 3)
f(Node{level: 0, elem:3}) = h(3, h(4, h(5, 6)))
f(Node{level: 0, elem:4}) = h(4, h(5, 6))
f(Node{level: 0, elem:5}) = h(5, 6)
f(Node{level: 0, elem:6}) = h(6, h(7, inf))
f(Node{level: 0, elem:7}) = h(7, inf)
f(Node{level: 1, elem:-inf}) = h(h(-inf, h(1, 2)), h(2, 3))
f(Node{level: 1, elem:2}) = h(2, 3)
f(Node{level: 1, elem:3}) = h(h(3, h(4, h(5, 6))), h(6, h(7, inf)))
f(Node{level: 1, elem:6}) = h(6, h(7, inf))
f(Node{level: 2, elem:-inf}) = h(h(h(-inf, h(1, 2)), h(2, 3)), h(h(3, h(4, h(5, 6))), h(6, h(7, inf))))
f(Node{level: 2, elem:3}) = h(h(3, h(4, h(5, 6))), h(6, h(7, inf)))
f(Node{level: 3, elem:-inf}) = h(h(h(h(-inf, h(1, 2)), h(2, 3)), h(h(3, h(4, h(5, 6))), h(6, h(7, inf)))), 0)
f(Node{level: 3, elem:inf}) = 0
f(Node{level: 0, elem:-inf}) = 53
f(Node{level: 0, elem:1}) = 17
f(Node{level: 0, elem:2}) = 159
f(Node{level: 0, elem:3}) = 178
f(Node{level: 0, elem

(53, 53)

## TODO: The search stack

In [4]:
def _check_stack(search_stack, expected_search_stack):
    assert len(search_stack) == len(expected_search_stack)
    for (exp_level, exp_elem), node in zip(expected_search_stack, search_stack):
        assert node.level == exp_level
        assert node.elem == exp_elem

# Figure 3
search_stack_39 = skip_list.search(39)
search_stack_42 = skip_list.search(42)

expected_search_stack = [
    (5, -INF),
    (4, -INF), (4, 17),
               (3, 17), (3, 25),
                        (2, 25), (2, 31),
                                 (1, 31), (1, 38),
                                          (0, 38), (0, 39),
]

_check_stack(search_stack_39, expected_search_stack)
_check_stack(search_stack_42, expected_search_stack)

# TODO: Authenticated queries

Once we have a defined algorithm for labeling nodes in the skip list we can start providing authenticated queries. Authenticated queries can prove a given element does or does not exists in a skip list, even if the answer was provided by a potentially untrusted party.

To provide authenticated queries using the skip list structure, we need a few things:

* A signature (e.g. ed25519) of the start node's label. This would, in the CRL example given in the paper, be published by a certificate authority. An entity wanting to verify queries would posses, and trust, the corresponding public key.

* We'd need a way of encoding a path.

> In [both membership and non-membership cases], the query authentication information is a single sequence of values, together with the signed timestamp and value $f(s)$.

In [48]:
def authenticated_query(skiplist, element):
    search_stack = skiplist.search(element)
    verbose_stack = []
    px = list(reversed(search_stack))
    # build Q(X)
    qx = []
    w1 = px[0].right
    if w1.is_plateau:
        # print("Case 0: Appending f(%s) = %s" % (w1._debug(), f(w1)))
        verbose_stack.append("f(%s)" % w1._debug())
        qx.append(f(w1))
    else:
        # print("Case 1: Appending %s.elem" % w1._debug())
        verbose_stack.append("%s" % w1.elem)
        qx.append(w1.elem)
    
    # the algorithm in Figure 6 has a typo. It assigns $x1$ (The second element of Q(x))
    # to the value $x$. This cannot be correct, as $x$ is not guaranteed to exist in the
    # skip list. Based on other examples in the paper, I believe $v1$ is the intended value here.
    # px[0] = the paper's $v1$.
    qx.append(px[0].elem)
    verbose_stack.append("%s" % px[0].elem)
    # print("Case 2: Appending %s.elem" % px[0]._debug())
    # print("Before: QX %s" % qx)
    # Paper is using one-indexing but Python is using zero-indexing.
    # Hence, I start at 1 here, not 2.
    for i in range(1, len(search_stack)):
        wi = px[i].right
        if wi.is_plateau:
            if wi != px[i-1]:
                # print("Case 3: Appending f(%s) = %s" % (wi._debug(), f(wi)))
                verbose_stack.append("f(%s)" % wi._debug())
                qx.append(f(wi))
            else:
                if px[i].level == 0:
                    # print("Case 4: Appending %s.elem" % px[i]._debug())
                    verbose_stack.append("%s" % px[i].elem)
                    qx.append(px[i].elem)
                else:
                    # print("Case 5: Appending f(%s) = %s" % (px[i].down._debug(), f(px[i].down)))
                    verbose_stack.append("f(%s)" % px[i].down._debug())
                    qx.append(f(px[i].down))
            # print("QX: %s" % (str(qx)))
            
    # The special cases from 4.3, where the searched for element does not match $v1$.
    v1 = px[0]
    if v1.elem != element:
        w1 = v1.right
        z = v1.right.right
        # w1 is a tower node, return Q(x) as-is
        if w1.is_tower:
            # print("Case 4: Just %s" % qx)
            print("Q(%s) = %s" % (element, list(reversed(verbose_stack))))
            return list(reversed(qx))
        if w1.is_plateau and z.is_tower:
            # print("Case 5: Adding [%s, %s] + %s" % (z._debug(), w1._debug(), qx))
            print("Q(%s) = %s" % (element, list(reversed([z._debug(), w1._debug() ] + verbose_stack))))
            return list(reversed([z.elem, w1.elem] + qx))
        if w1.is_plateau and z.is_plateau:
            # print("Case 6: Adding [f(%s), %s] + %s" % (z._debug(), w1._debug(), qx))
            print("Q(%s) = %s" % (element, list(reversed(["f(%s)" % z._debug(), w1._debug() ] + verbose_stack[1:]))))
            return list(reversed([f(z), w1.elem] + qx[1:]))
    print("Q(%s) = %s" % (element, list(reversed(verbose_stack))))
    return list(reversed(qx))

# From Figure 7 description, the authentication information should be the same
# assert authenticated_query(skip_list, 39) == authenticated_query(skip_list, 42)

# q39 = list(reversed(authenticated_query(skip_list, 39)))
# assert chash_sequence(q39) == f(skip_list.start)

# q42 = list(reversed(authenticated_query(skip_list, 42)))
# assert chash_sequence(q42) == f(skip_list.start)

# q18 = list(reversed(authenticated_query(skip_list, 18)))
# assert chash_sequence(q18) == f(skip_list.start)



# Observations:
# * If there is not another column at max height, then f(Node{level:MAX, elem:inf}) is first in the list.
# * I believe the hash of every taller column to the right is included.
# * A chain of lone base nodes will be included as an exposed elem until a column is reached on the left. E.g.
# smol = SkipList([
#    [-INF,    2,          INF], # Level 1
#    [-INF, 1, 2, 3, 4, 5, INF], # Level 0
# ])
# Q(2) = [... 'f(Node{level: 1, elem:inf})', '2', 'f(Node{level: 0, elem:3})']
# Q(3) = [... 'f(Node{level: 1, elem:inf})', '2', '3', 'f(Node{level: 0, elem:4})']
# Q(4) = [... 'f(Node{level: 1, elem:inf})', '2', '3', '4', 'f(Node{level: 0, elem:5})']
# Q(5) = [... 'f(Node{level: 1, elem:inf})', '2', '3', '4', '5', 'inf']

smol = SkipList([
    [-INF,                   INF], # Level 3
    [-INF,             5,    INF], # Level 2
    [-INF,    2,       5,    INF], # Level 1
    [-INF, 1, 2, 3, 4, 5, 6, INF], # Level 0
 ])

smol = SkipList([
    [-INF,          INF], # Level 2
    [-INF,    2, 3, INF], # Level 1
    [-INF, 1, 2, 3, INF], # Level 0
])

for i in range(1, 7):
    qi = authenticated_query(smol, i)
#    print("Q(%d) = %s" % (i, qi))
    assert chash_sequence(qi) == f(smol.start)

_, d = f_verbose(smol.start)

expanded = expand(d)
label_dict = labels(smol)
for k in sorted(expanded.keys()):
    print("%s = %s = %s" % (k, expanded[k], label_dict[k]))

print(f(smol.start))

Q(1) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 1, elem:2})', '-inf', '1', '2']
Q(2) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 0, elem:-inf})', 'f(Node{level: 1, elem:3})', '2', '3']
Q(3) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 0, elem:-inf})', 'f(Node{level: 0, elem:2})', '3', 'inf']
Q(4) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 0, elem:-inf})', 'f(Node{level: 0, elem:2})', '3', 'inf']
Q(5) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 0, elem:-inf})', 'f(Node{level: 0, elem:2})', '3', 'inf']
Q(6) = ['f(Node{level: 2, elem:inf})', 'f(Node{level: 0, elem:-inf})', 'f(Node{level: 0, elem:2})', '3', 'inf']
f(Node{level: 0, elem:-inf}) = h(-inf, h(1, 2)) = 53
f(Node{level: 0, elem:1}) = h(1, 2) = 17
f(Node{level: 0, elem:2}) = h(2, 3) = 159
f(Node{level: 0, elem:3}) = h(3, inf) = 115
f(Node{level: 1, elem:-inf}) = h(h(-inf, h(1, 2)), h(h(2, 3), h(3, inf))) = 74
f(Node{level: 1, elem:2}) = h(h(2, 3), h(3, inf)) = 3
f(Node{level: 1, elem:3}) = h(3, inf) = 115

# Notes

## Intuition for hashing algorithm

To the author's credit, the implementation of the label function $f$ is straightfoward. I did, however, find the underlying intuition for what the algorithm was accomplishing to be a bit opaque. After working out a few examples, I felt I could approximate the algorithm in my head. My notes follow.

To avoid confusiong with "tower"/"plateau" terminology (which is used in the context of a single node) I'll call a vertical stack of at least two nodes (one plateau, at least one tower nodes) a "column".

First let's focus on the base level of the skip list (Cases 1a and 1b in Section 4.1). My shortest explanation for the base level is $f$ is recursively hashing a path to the nearest "column" to $v$'s right. 

For demonstartion, consider this snippet of the bottom row of a skip list.

```
    |         |                 |                           |
[ -Inf ] -> [ 1 ] -> [ 2 ] -> [ 3 ] -> [ 4 ] -> [ 5 ] -> [ Inf ]
```

Working out a few computations of $f(v)$:

* The label of $N_1$ (The base node containing element 1) is: $f(N_1) = h(1, h(2, 3))$.

* The label of $N_2$ is: $f(N_2) = h(2, 3)$.

* The label of $N_3$ is: $f(N_3) = h(3, h(4, h(5, Inf)))$

* The label of $N_5$ is: $f(N_5) = h(5, Inf)$

Note how the hash chain for $N_3$ is longer as there are more lone plateau nodes between the 3-column and the Inf-column.

For nodes not on the base-level (Cases 2a and 2b in Section 4.1), it's a little trickier to explain all at once. Breaking it down into a two observations, first may help:

* Non-base nodes simulateanous hash nodes below and to the right of themselves.

* A non-base node $v$'s hash will include the label of other columns, if the column's plateau node is at $v$'s height or below. This is dervied from a Case 2a and 2b. As you move down from a starting node $v$, you are checking $w = right(v)$ at each level. If $w$ is a plateau node (i.e. the top of a new column), you recursively compute the label of that new column (which in turn, will include labels of all columns of the same or lesser height from $w$).

Putting both cases together and hand waving a bit: A label for a node $v$ will include all same-sized or smaller columns to the right of $v$ as well as a hash chain of elements at the base connecting each column. Note the top-left "start" node $s$ should be based on the entire structure of the skip-list as all nodes are at least $s$'s height and to the right of $s$.

## Potential attacks

* Assuming you have a signed node $s$ for some instance of a skiplist, you can likely create collisions by copying $s$ and adding prepended nodes.

* Encoding of -Inf/Inf.

* Hashing tuples?

* Lack of hash customization.

## Duplicates in a skip list

The paper makes no mention of duplicate values in the skip list.

**TODO(kkl): Test problems with duplicates?**

## h(-inf, y)

The paper doesn't explicitly mention that you need to handle hashing the element value of `-inf`. Specifically, take the bottom left element of a skip list to be $v$. When applying the label function $f$ (Described in section 4.1) to $v$ we will hit cases 1a/1b (dependent on the node right of $v$). In both cases, we compute $h(elem(v), ...)$ where $elem(v)$ is `-inf`.

Since I used SHA256 as the hash function (The authors mentioned they used MD5 in their paper which would suffer from the same problem) one would have to design a mapping from `-inf` to hashable byte strings. For implementation simplicity, I chose to just convert `-inf` cases to 0.

This is a problem, as this introduces collisions! E.g. h(-inf, 0) == h(0, 0). **TODO(kkl): Demonstrate this**.

## plateau vs. tower

I added `up` pointers which aren't totally necesary. When traversing the skip list any moves `right` will land you on a plateau.

## Related concepts that are worth exploring

* Merkle trees

* Object Hash