# Chapter 10: Maps, Hash Tables, and Skip Lists

## Chapter 10.1: Maps and Dictionaries

Python's `dict` class is arguably the most significant data structure in the language. It represents an abstraction known as a dictionary in which unique keys are mapped to associated values. We note that the keys are assumed to be unique, but the values are not necessarily unique.

**Common applications of maps including the following.**

* A university's information system relies on some form of a student ID as a ke that is mapped to that student's associated record serving as the value.
* The domain-name system (DNS) maps a host name, such as `www.wiley.com`, to an Internet-Protocol (IP) address, such as `208.215.179.146`.
* A social media site typically relies on a (nonnumeric) username as a key that can be efficiently mapped to a particular user's associated information.
* A computer graphics system may map a color name, such as `turquoise`, to the tripple of numberss that describes the color's RGB representation, such as `(64, 22,4 208)`.
* Python uses a dictionary to reprsent each namespace, mapping an identifying string, such as `pi`, to an associated object, such as `3.14159`.

In this chapter and the next we demonstrate that a map may be implemented so that a search for a key, and its associated value, can be performed very efficiently, thereby supporting fast lookup in such applications.

### 10.1.1 The Map ADT

We begin by listing what we consider the most significant five behaviors of a map `M` as follows:

* `M[k]`: Return the value `v` associated with key `k` in map `M`, if one exists; otherwise raise a `KeyError`. In python, this is implemented with the special method `__getitem__`.
* `M[k] = v`: Associate value `v` with key `k` in map `M`, replacing the existing value if the map already contains an item with key equal to `k`. In python, this is implemented with the special method `__setitem__`.
* `del M[k]`: Remove from map `M` the item with key equal to `k`; if `M` has no such item, then raise a `KeyError`. In Python, this is implemented with the special method `__delitem__`.
* `len(M)`: Return the number of items in map `M`. In python, this is impelmented with the special method `__len__`.
* `iter(M)`: The default iteration for a map generates a sequence of `keys` in the map. In Python, this is implemented with the special method `__iter__`, and it allows lopps of the form, `for k in M`.

We have highlighted the above five behaviors because they demonstrate the core functionality of a map-namely, the ability to query, add, modify, or delete a key-value pair, and the ability to report all such pairs. For additional convenience, map `M` should also support the following behaviors:

* `k in M`: Return `True` if the map contains an item with key `k`. In Python, this is implemented with the special `__contains__` method.
* `M.get(k, d=None)`: Return `M[k]` if key `k` exists in the map; otherwise return default value `d`. This provides a form to query `M[k]` without risk of a `KeyError`.
* `M.setdefault(k, d)`: If key `k` exists in the map, simply return `M[k]`; if key `k` does not exist, set `M[k] = d` and return that value.
* `M.pop(k, d=NOne)`: Remove the item associated with key `k` from the map and return its associated value `v`. If key `k` is not in the map, return default value `d` (or raise `KeyError` if parameter `d` is `None`).
* `M.popitem()`: Remove an arbitrary key-value pair from the map, and return a `(k,v)` tuple representing the removed pair. If map is empty, raise a `KeyError`.
* `M.clear()`: Remove all key-value pairs from the map.
* `M.keys()`: Return a set-like view of all keys of `M`.
* `M.values()`: Return a set-like view of all values of `M`.
* `M.items()`: Return a set-like view of `(k, v)` tuples for all entries of `M`.
* `M.update(M2)`: Assign `M[k] = v` for every `(k, v)` pair in map `M2`.
* `M == M2`: Return `True` if maps `M` and `M2` have identical key-value associations.
* `M != M2`: Return `True` if maps `M` and `M2` do not have identical key-value associations.

### 10.1.2 Application: Counting Word Frequencies

In [13]:
text_sample = "ALEXEY Fyodorovitch Karamazov was the third son of Fyodor Pavlovitch Karamazov, a landowner well known in our district in his own day, and still remembered among us owing to his gloomy and tragic death, which happened thirteen years ago, and which I shall describe in its proper place. For the present I will only say that this 'landowner'- for so we used to call him, although he hardly spent a day of his life on his own estate- was a strange type, yet one pretty frequently to be met with, a type abject and vicious and at the same time senseless. But he was one of those senseless persons who are very well capable of looking after their worldly affairs, and, apparently, after nothing else. Fyodor Pavlovitch, for instance, began with next to nothing; his estate was of the smallest; he ran to dine at other men's tables, and fastened on them as a toady, yet at his death it appeared that he had a hundred thousand roubles in hard cash. At the same time, he was all his life one of the most senseless, fantastical fellows in the whole district. I repeat, it was not stupidity- the majority of these fantastical fellows are shrewd and intelligent enough- but just senselessness, and a peculiar national form of it."

In [18]:
freq = {}
words = [c for c in text_sample.lower().split() if c.isalpha()]

In [19]:
for word in words:
    freq[word] = 1 + freq.get(word, 0)

In [20]:
freq

{'alexey': 1,
 'fyodorovitch': 1,
 'karamazov': 1,
 'was': 6,
 'the': 8,
 'third': 1,
 'son': 1,
 'of': 8,
 'fyodor': 2,
 'pavlovitch': 1,
 'a': 7,
 'landowner': 1,
 'well': 2,
 'known': 1,
 'in': 5,
 'our': 1,
 'district': 1,
 'his': 7,
 'own': 2,
 'and': 8,
 'still': 1,
 'remembered': 1,
 'among': 1,
 'us': 1,
 'owing': 1,
 'to': 5,
 'gloomy': 1,
 'tragic': 1,
 'which': 2,
 'happened': 1,
 'thirteen': 1,
 'years': 1,
 'i': 3,
 'shall': 1,
 'describe': 1,
 'its': 1,
 'proper': 1,
 'for': 3,
 'present': 1,
 'will': 1,
 'only': 1,
 'say': 1,
 'that': 2,
 'this': 1,
 'so': 1,
 'we': 1,
 'used': 1,
 'call': 1,
 'although': 1,
 'he': 5,
 'hardly': 1,
 'spent': 1,
 'day': 1,
 'life': 2,
 'on': 2,
 'strange': 1,
 'yet': 2,
 'one': 3,
 'pretty': 1,
 'frequently': 1,
 'be': 1,
 'met': 1,
 'type': 1,
 'abject': 1,
 'vicious': 1,
 'at': 4,
 'same': 2,
 'time': 1,
 'but': 2,
 'those': 1,
 'senseless': 1,
 'persons': 1,
 'who': 1,
 'are': 2,
 'very': 1,
 'capable': 1,
 'looking': 1,
 'after': 2,
 

In [22]:
max_word = ''
max_count = 0
for (w,c) in freq.items():
    if c > max_count:
        max_word = w
        max_count = c

print('The most frequent word is', max_word)
print('Its number of occurencies is', max_count)


The most frequent word is the
Its number of occurencies is 8


### 10.1.3: Python's `MutableMapping` Abstract Base Class

The `collections` moudle provides two abstract base classes that are relevant to our concurrent discussion: the `Mapping` and `MutableMapping` classes. THe `Mapping` class includes all nonmutating methods supported by Python's `dict` classes. The `Mapping` class includes all nonmutating methods supportd by Python's `dict` lcass, while the `MutableMapping` class extends that to include the mutating methods. What we define as the map ADT in Section 10.1.1 is aking to the `MutableMapping` abtract base class in Python's `collections` module.

The significance of these abstract base classes is that they provide a framework to assis in createing a user-defined amp class.

In [3]:
from collections.abc import Mapping, MutableMapping

### 10.1.4: Our `MapBase` Class

In [4]:
class MapBase(MutableMapping):
    """Our own abstract base class that includes a nonpublic _Item class."""

    #-------------------------- nested _Item class --------------------------
    class _Item:
        """Lightweight composite to store key-value pairs as map items"""

        __slots__ = '_key', '_value'

        def __init__(self, k, v):
            self._key = k
            self._value = v

        def __eq__(self, other):
            return self._key == other._key  # compare items based on their keys

        def __ne__(self, other):
            return not (self == other)  # opposite of __eq__

        def __lt__(self, other):
            return self._key < other._key  # compare items based on their keys




### 10.1.5: Simple Unsorted Map Implementation

We demonstrate the use of tghe `MapBase` class with a very simple concrete implementation of the map ADT.

In [8]:
class UnsortedTableMap(MapBase):
    """Map implementation using an unordered list."""

    def __init__(self):
        """Create an empty map."""
        self._table = []

    def __getitem__(self, k):
        """Return value associated with key k (raise KeyError if not found)."""

        for item in self._table:
            if k == item._key:
                return item._value
        raise KeyError('Key Error: ' + repr(k))

    def __setitem__(self, k, v):
        """Assign value v to key k, overwriting existing value if present."""
        for item in self._table:
            if k == item._key:
                item._value = v
        # did not find match for key
        self._table.append(self._Item(k,v))

    def __delitem__(self, k):
        """Remove item associated with key k (raise KeyError if not found)."""
        for j in range(len(self._table)):
            if k == self._table[j]._key:
                self._table.pop(j)
                return
        raise KeyError("Key Error: " + repr(k))

    def __len__(self):
        """Return number of items in the map."""
        return len(self._table)

    def __iter__(self):
        """Generate iteration of the map's keys."""
        for item in self._table:
            yield item._key

## 10.2 Hash Tables

In this section, we introduce one of the most practical data structures for implementing a map, and the one that is used by Python's own implementation of the `dict` class. This structure is known as a ***hash table***.

The novel concept for a hash table is the use of a **hash function** to map general keys to corresponding indices in a table. Ideally, keys will be well distributed in the range from $0$ to $N-1$ by a hash function, but in practice there may be two or more distinct keys that get mapped to the same index. As a result, we will conceptualize our table as a ***buckey array***, in which each bucket may manage a collection of items that are sent to a specific index by the hash function.

### 10.2.1 Hash Functions

The goal of a ***hash function***, $h$, is to map each key $k$ to an integer in the range $[0,N-1]$, where $N$ is the capacity of the bucket array for a hash table. Equipped with such a hash function, $h$ the main idea of this approach is to user the hash function value, $h(k)$, as an index into our bucket array, $A$, instead of the key $k$. That is, we store item $(k,v)$ in the bucket $A[h(k)]$.

If there are two or more keys with the same hash value, then two different items will be mapped to the same bucket in $A$%. In this case, we say that a ***collision*** has occurred. To be sure, there are ways of dealing with collisions, which we will discuss later, but the best strategy is to try to avoid them in the first place. We say that a hash function is "good" if it maps the keys in our map so as to sufficiently minimize collisions. For practical reasons, we also would like a hash function to be fast and easy to compute.

It is common to view the evaluation of a hash function, $h(k)$, as consisting of two portions-a ***hash code*** that maps aa key $k$ to an integer, and a ***compression function*** that maps the hash code to an integer within a range of indices, $[0, N-1]$, for a bucket array.

The ad avantage of separating the hash function into two such components is that the hash code portion of that computation is independent of a specific hash table size. This allows the development of a general hash code for each object that can be used for a hash table of any size.; only the compression function depends upon the table size. This is particularly convenient because the underelying bucket array for a hash table may be dynamically resized, depnding on the number of items currently stored in the map.

### Hash Codes

The first action that a hash function performs is to take an arbitrary key $k$ in our map and compute an integer that is called the ***hash code*** for $k$; this integer need not be in the range $[0, N-1]$, and may even be negative. We desire that the set of hash codes assigned to our keys should avoid collisions as much as possible. For if the hash codes of our keys cause collisions, then there is no hope for our compression function to avoid them. In this subsection, we begin by discussing the theory of hash codes. Following that, we discuss practical implementations of hash codes in Python.

**Treating the Bit Representation as an Integer**

For a type whose bit representation is longer than a desired hash code, simply using the interpretation of bits will be highly inefficient. For wexample, Python relies on 32-bit hash codes. If a floating-point number uses a 64-bit representation, its bits cannot be viewed directly as a hash code. One possibility is to use only the high-order 32 bits (or the low-order 32 bits). This hash code, of course, ignores half of the information present in the original key, and if many of the keys in our map only differ in these bits, then they will collide using this simple hash code.

A better approach is to combine in some way the high-order and low-order portions of a 64-bit key to form a 32-bit hash code, which takes all the original bits into consideration. A simple implementation is to add the two components as 32-bit numberss (ignoring overflow), or to take the exclusive-or of the two components. These approaches of combining components can be extended to any object $x$ whose binary representation can be viewed as an $n$-tuple $(x_0, x_1, \ldots, x_{n-1})$ of 32-bit integers, for example, by forming a hash code for $x$ as $\sum_{i=0}^{n-1} x_{i}$ or as $x_0 \bigoplus$ x_1 \bigoplus$ \cdots \bigoplus x_{n-1}$, where the $\bigoplus$ symbol represents the bitwise exclusive-or operation (which is `^` in Python).

**Polynomial Hash Codes**

The summation and exclusive-or hash codes produces lots of unwanted collisions for common group of strings. An alternative hash code, which does exactly this, is to choose a nonzero constant, $a \neq 1$, and use as a hash code the value:

$$ x_0 a^{n-1} + x_1 a^{n-2} + \cdots +x_{n-2}a + x_{n-1} $$

Mathematically speaking, this is simply a polynomial in $a$ that takes the components $(x_0, x_1, \cdots, x_{n-1})$ of an object $x$ as its coefficients. This hash code is therefore called a ***polynomial hash code***. We should choose the constant $a$ so that it has some nonzero, low-order bits, which will serve to preserve some of the information content even as we are in an overflow situation.

**Cyclic-Shift Hash Codes**

A variant of the polynomial hash coide replaces multiplcation by $a$ with a cyclic shift of a partial sum by a certain number of bits. It accomplishes the goal of varying the bits of the calculation. In Python, a cyclic shift of bits can be accomplished through careful use of the bitwise operations `<<` and `>>`, taking care to truncate result to 32-bit integers.

In [1]:
def hash_code(s):
    mask = (1 << 32) - 1  # limit to 32-bit integers
    h = 0
    for character in s:
        h = (h << 5 & mask) | (h >> 27)
        h += ord(character)
    return h

In [10]:
mask = (1 << 32) - 1

In [11]:
(1 << 5) & mask

32

As with the traditional polynomial hash code, fine-tuning is required when using a cyclic-shift hash code, as we must wisely choose the amount to shift by for each new character.

**Hash Codes in Python**

The standard mechanism for computing hash codes in Python is a built-in function with signature `hash(x)` that returns an integer value that serves as the hash code for object `x`. However, only ***immutable*** data types are deemed hashable in Python. This restriction is meant to ensure that a aparticular object's hash code remains constant during that object's lifespan. This is an important property for an object's use as as key in a hash table. A problem could occur if a key were inserted into the hash table, yet a later search were performed for that key based on a different hash code than that which it had when insereted; the wrong bucket would be searched.

Among Python's built-in data types, the immutable `int`, `float`, `str`, `tuple`, and `frozenset` classes produce robust hash codes, via the `hash` function, using techniques similar to those discussed earlier in this section. hash codes for character strings are well crafted based on a technique similar to polynomial hash codes, except using exclusive-or computations rather than additions. Hash codes for tuples are computed with a similar technique based upon a combination of the hash codes of the individual element of the tuple. When hashing a `frozenset`, the order of the elements should be irrelevant, and so a natural option is to compute the exclusive-or of the individual hash codes without any shifting. If `hash(x)` is called for an instance `x` of a mutable type, such as a `list`, a `TypeError` is raised.

In [14]:
x = [1,2,3]
hash(x)

TypeError: unhashable type: 'list'

In [16]:
x = (1,2,3)
hash(x)

529344067295497451

Instance of user-defined classes are treated as unhashable by default, with a TypeError rasied by the `hash` function. however, a function that computes hash codes can be implemented in the form of a special method named `__hash__` within a class. The returned hash code shold reflect the immutable attributes of an instance. It is common to return a hash code that is itself based on the computed hash of the combination of such attributes. For example, a `Color` class that maintains three numberic red, green, and blue components might implement the method as:

```
def __hash__(self):
    return hash((self._red, self._green, self._blue))  # hash combined tuple
```

An important rule to obey is that if a class defines equivalence through `__eq__`, then any implementation of `__hash__` must be consistent, in that if `x == y`, then `hash(x) == hash(y)`. THis is important because if two instances are considered to be equivalent and one is used as a key in a hash table, a search for the second instance should result in the discovery of the first. It is therefore important that the hash code for the second match the hash code for the first, so that the proper bucket is examined. This rule extends to any well-defined comparisons betwen objects of different classes. For example, since Python treates the expression `5 == 5.0` as true, it ensures that `hash(5)` and `hash(5.0)` are the same.

In [17]:
hash(5)

5

In [18]:
hash(5.0)

5

In [21]:
hash(3.1415) == hash(3.14159)

False

In [22]:
hash(3.1415) == hash(3.1415000)

True