# Fundamentals of Computer Science 30398 - Lecture 14

In this lecture we will discuss a `Dictionary` data structure. Often, when processing large quantities of data, we might want to access the data associated to a particular identifier, which is not necessairly just an integer index. For example, we might be managing a long, with names of runners in a particular marathon, and the time they took to complete the run, like this:

In [95]:
results = [ ("Jim", 5.1), ("John", 5.8), ("Jane", 4.2)] # and so on...

At various stages in the code, we might want to get access to a time a particular person got, given their name, or update their record. To this end, we would like to store the data in some data structure, s.t. we can efficiently implement the following three operations:
```python
dataset = Dictionary() # we create a new dictionary
dataset.set("Jim", 5.1) # we would like to set the value associated to the key "Jim" to 5.1
dataset.get("Jim") # should return 5.1
```

The "keys" for us are strings, and the value could be any Python value. Of course the simplest way to implement a data structure like that is to just internally store a long list as above, with pairs `(key, value)`. To add a new pair `(key, value)` we just append it at the end of the list. If we want to get a value associated with a specific key, we iterate over the entire list, and check if the key at a given element is equal to the key we are looking for; if so - we return the associated value:

In [142]:
class DictionaryList:
    def __init__(self):
        self.buffer = []

    def add(self, key, value):
        self.buffer.append( (key, value) )

    def get(self, key):
        for k, v in self.buffer:
            if k == key:
                return v
                
    def set(self, key, value):
        for i in range(len(self.buffer)):
            if key == self.buffer[i][0]:
                self.buffer[i] = (key, value)
                return
        self.add(key, value)
        

Let's try if it works

In [143]:
dataset = DictionaryList()

In [144]:
dataset.add("Jim", 12)
dataset.add("John", 15)

In [145]:
dataset.get("Jim")

12

In [146]:
dataset.get("John")

15

In [147]:
dataset.get("Jane")

In [148]:
dataset.set("Jim", 17)

In [149]:
dataset.get("Jim")

17

Sounds good. The benefit of this solution is its simplicty: we can very easily implement it. The downside is that it is extremely inefficient: to find a value associated with a specific key, we need to iterate over the entire list in time $\Theta(n)$ -- if the database is large, and we need to access elements with specific key repeatedly over the course of the program, this becomes prohibitively expensive.

### Better idea - Hash Map

We will disucss here a different way of implementing the Dictionary interface: a data structure supporting `get(key)` and `set(key, value)` as above, but such that both of those operations should typically run in time $O(1)$. To this we need a concept of "hashing":

We will attempt to write a simple function that takes a string as the input and produces a number in some large range (here between $0$ and $2^{30} - 1$ that depends on this string, by "scrambling" the string.

The key proprty of the hash function is that (as a function in mathematical sense), if evaluate it on the same string twice, we always get the same hash; on the other hand for two different string their hashes should be relatively unrelated. There is a large amount of theory how to produce good hashes for various applications (including crypthographic hash functions), but in many cases quite simple scrambling will do the job.

First of all, the python builtin function `ord` takes as an input character and produces its ASCII code (a number between 0 and 255 corresponding to this character:

In [103]:
ord('A')

65

**Exercise**
Write a fuction `string_hash` that takes string `key` as an argument, and computes the hash of the string. The value of the hash should be:
$$\sum_i \mathrm{ord}(\mathrm{key}[i]) p^{n-i} \pmod{2^{30}}$$

where $p$ is some fixed prime (for example $p=181$).

In [66]:
def string_hash(key):
    p = 181
    result = 0
    for c in key:
        result = (p*result + ord(c)) % (2**30)
    return result

In [104]:
string_hash("Jim")

2443428

In [105]:
string_hash("John")

442456239

In [106]:
string_hash("Jim")

2443428

**Remark** Of course, since the number of strings of lenght $10$ is already astronomically larger than $2^{30}$, there will always exists some pairs of different short strings that hash to the same value. The key property of a good hash function is that this (so-called "hash collision") shouldn't happen much more often than by chance.

Once we have a hash function associated with the key datatype we can try to implement a `Dictionary` as a `HashMap`.

Let's say that we already know some upper bound `capacity` on how many elements will eventually be stored in our Dictionary. The way the hash map works, is that during the initialization, we create a list of length `m = 2*capacity`, each element of which is initially an empty list, that we will refer to as a "bucket"

Whenever we want to add a pair `(key, value)` to the data structure, we first calculate the hash of the key, and look at the reminder from dividing the hash by `m` (so that we get a number between `0` and `m-1`). This will be the index of the bucket we want to insert the new pair into.

Each bucket, similarly to the previous implementation, will store a list of pairs `(key, value)` --- but crucially, in each bucket we will have only the keys that hash to the bucket index. To get an element with a given key, we just iterate over all pairs `(k, v)` in the appropriate bucket, and for each pair we check if `k` is equal to the `key` we are looking for --- in which case we found the right value.

The crucial observation is that if the hash functions is reasonable, since the number of buckets is larger than the number of elements we want to store, vast majority of the buckets will have at most few elements: so iterating over a bucket is going to take constant time.

Indeed, as we will not discuss in details here, if we had access to a "perfect" hash function, not only almost all buckets have constant size (making vast majority of operations run in time $O(1)$), but also the size of the largest bucket is at most $O(\frac{\log n}{\log \log n})$, so the longest access time is also very fast.

Let's try to write it:

In [150]:
class Dictionary:
    def __init__(self, capacity):
        self.buffer = [ [] for i in range(2*capacity) ]

    def add(self, key, value):
        key_hash = string_hash(key)
        bucket_index = key_hash % len(self.buffer)
        self.buffer[bucket_index].append( (key, value) )

    def get(self, key):
        key_hash =  string_hash(key)
        bucket_index = key_hash % len(self.buffer)
        for k, v in self.buffer[bucket_index]:
            if key == k:
                return v
                
    def set(self, key, value):
        key_hash = string_hash(key)
        bucket_index = key_hash % len(self.buffer)
        bucket = self.buffer[bucket_index]
        
        for i in range(len(bucket)):
            if bucket[i][0] == key:
                bucket[i] = (key, value)
                return
        self.add(key, value)

In [151]:
dataset = Dictionary(capacity = 50)

dataset.add("Jim", 12)
dataset.add("John", 17)

In [152]:
dataset.get("Jim")

12

In [153]:
dataset.get("John")

17

In [154]:
dataset.set("John", 33)

In [155]:
dataset.get("John")

33

### A bit of generalization --- arbitrary key type
One small place of to improve this it wasn't ever important the the keys in the dictioanry data structure are strings. As long as we can provide a hash function for a specific type, we can use it as a key in the dictionary. To make our data structure a bit more general, we can just specify a hash function as an argument in the initialization of the dictionary, for whatever is going to be the key type.

In [156]:
class Dictionary:
    def __init__(self, capacity, hash_function):
        self.hash_function = hash_function
        self.buffer = [ [] for i in range(2*capacity) ]

    def add(self, key, value):
        key_hash = hash_function(key)
        bucket_index = key_hash % len(self.buffer)
        self.buffer[bucket_index].append( (key, value) )

    def get(self, key):
        key_hash = self.hash_function(key)
        bucket_index = key_hash % len(self.buffer)
        for k, v in self.buffer[bucket_index]:
            if key == k:
                return v
                
    def set(self, key, value):
        key_hash = self.hash_function(key)
        bucket_index = key_hash % len(self.buffer)
        bucket = self.buffer[bucket_index]
        
        for i in range(len(bucket)):
            if bucket[i][0] == key:
                bucket[i] = (key, value)
                return
        self.add(key, value)

Let's see if it still works:

In [157]:
dataset = Dictionary(capacity = 50, hash_function=string_hash)

dataset.add("Jim", 12)
dataset.add("John", 17)

In [158]:
dataset.get("Jim")

12

In [159]:
dataset.set("John", 33)

In [160]:
dataset.get("John")

33

Looks fine.

### Exercise
Using the just-implemented `Dictionary`, write a function `calc_occurences(lst)` that takes a list of strings, and returns a list;
for each $i$ it should return how many times string `lst[i]` occured before position `i`.

**Example**
```python
calc_occurences(["Jim", "John", "Jim", "Jim", "Frank"]) == [ 0, 0, 1, 2, 0 ]
```
**Solution**

In [64]:
def calc_occurences(lst):
    my_dict = Dictionary(len(lst))
    result = []
    for x in lst:
        count = my_dict.get(x)
        if count == None:
            count = 0
        result.append(count)
        my_dict.set(x, count + 1)
    return result

## Python built-in dictionaries

As it turns out, we do not need to implement all of this on our own. The python already provides built-in dictionaries that implement essentially the same data structure.

To create a new empty dictionary we can write either one of the following

In [70]:
my_dictionary = {}

In [71]:
my_dictionary = dict()

Now to set the value associated with a specific key (as we did using `set` method in our own implementations), we can just use the indexing syntax, similar to setting the $i$-th element of a list:

In [124]:
my_dictionary["Jim"] = 17

In [125]:
my_dictionary["John"] = 12

To get an element associated with a specific key, we use the same syntax again.

In [126]:
my_dictionary["Jim"]

17

Importantly those operation can be assumed to be quite fast on average - as we discussed above (altough, still, several times slower than just accessing the $i$-th element of the list).

The `my_dictionary = {}` syntax to create a new empty dictionary is just a special case of a syntax to create a dictionary with some constant number of elements already inserted:

In [127]:
my_dictionary = { "John": 17, 
                  "Jim" : [1,2,3,4],
                  "Jane" : 0.14 }

In [128]:
my_dictionary["John"]

17

In [130]:
my_dictionary["Kim"] = [1,2,3]

In [131]:
my_dictionary["Kim"]

[1, 2, 3]

If we try to access an element corresponding to the key that is not present in the dictionary, we will see an error:

In [132]:
my_dictionary["Frank"]

KeyError: 'Frank'

As an alternative we can use a method `get` which will just return `None` in this case:

In [133]:
print(my_dictionary.get("Frank"))

None


In fact method `get` accepts additional optimal argument `default` --- what should it return if the `key` is not present in our dictionary:

In [134]:
? my_dictionary.get

[31mSignature:[39m  my_dictionary.get(key, default=[38;5;28;01mNone[39;00m, /)
[31mDocstring:[39m Return the value for key if key is in the dictionary, else default.
[31mType:[39m      builtin_function_or_method

In [135]:
my_dictionary.get("key", 0)

0

We can rewrite the `calc_occurences` function using built-in dictionaries instead of our custom ones.

In [136]:
def calc_occurences(lst):
    my_dict = {}
    result = []
    for x in lst:
        count = my_dict.get(x, 0)
        result.append(count)
        my_dict[x] = count + 1
    return result

In [137]:
calc_occurences(["Jim", "John", "Jim", "Jim", "Frank"])

[0, 0, 1, 2, 0]

Additionally, the built-in dictionaries support iterating over all keys, all values, and all pairs `key, value` in the dictionary as follows:

In [138]:
for x in my_dictionary:
    print(x)

John
Jim
Jane
Kim


In [139]:
for x in my_dictionary.values():
    print(x)

17
[1, 2, 3, 4]
0.14
[1, 2, 3]


In [140]:
for k, v in my_dictionary.items():
    print("Key:", k, "value:", v)

Key: John value: 17
Key: Jim value: [1, 2, 3, 4]
Key: Jane value: 0.14
Key: Kim value: [1, 2, 3]


Of course, we can combine it with list comprehension syntax:

In [141]:
[ v for v in my_dictionary.values() ]

[17, [1, 2, 3, 4], 0.14, [1, 2, 3]]

### Dynamic size of a hash-map - constant time

In our own implementation of the `HashMap` we asked a user to provide a `capacity` during construction of a dictionary. We don't need to do this with python dictionaries. How does it work?

Essentially, when we create an empty python dictionary, we get one with some fixed capacity (e.g. $10$). When we append an element such that the number of elements is becomes larger than capacity, the dictionary will multiply a capacity by $2$, allocate entire new buffer of size $2$ times larger, rehash all elements, and add all of them to the new buffer.

This means that any particular insertion operation could potentially be extremely slow: if we happend to trigger resizing the buffer, the dictionary will have to iterate over all elements and move them to the new buffer - spending time $\Omega(n)$. Luckily this is not happening too often: a sequence of $n$ insertions to a dynamic hash-map like this still takes time $O(n)$ -- so on average $O(1)$ time per insertions.

Indeed, imagine that we add $2^k$ elements to a dynamic array like that. The total amount of time we spend on resizing the buffer is
$$
2^0 + 2^1 + 2^2 + \ldots + 2^k = \sum_{j \leq k} 2^k = 2^{k+1} - 1
$$
That is only $2$ times more than the number of elements inserted.

**Exercise**
Try to implement such a dynamic resizing in the class `Dictionary` you wrote above.

### Sets

The Python provides additional data structure, `set()` which is implemented similarly to the dictionary, using `HashMap`. As opposed to dictionary, it does not provides a mapping between keys and values; it is just used to store a set of keys. As opposed to list, the keys are not guaranteed to be stored in any particular order, but on the other hand, we can check if specific key is in the set efficiently (typically in time $O(1)$.

**Example**

We could use a list to store a set of elements:

In [164]:
A = [1, 3, 6, 10, 2]

And to check if an element is in the list we can write:

In [165]:
3 in A

True

In [166]:
4 in A

False

Unfortunately, if $A$ is a long list, each innocuous instruction `3 in A` internally just involves iterating over the entire list `A` involves just iterating over the entire list - it takes linear time, which might be prohibitively inefficient. What python is doing internally is just

In [168]:
def element_in_list(key, lst):
    for x in lst:
        if x == key:
            return True
    return False

Using sets we can do instead:

In [169]:
A = set([1,3,6,10])

In [162]:
3 in A

True

In [173]:
15 in A

True

In [172]:
A.add(15)

In [175]:
15 in A

True

On the other hand, if we iterate over all elements in the set, they are not guaranteed to be in any particular order.

In [176]:
for k in A:
    print(k)

1
3
6
10
15
