# Playing Around with HashTables
Here we create a hash table implementation in pure python. Mainly this is just for learning, and largely I would like to recreate a lot of the same methods used in the standard python dictionaries.  
So a few things to duplicate:
- Closed Hashing or Open Addressing for dealing with colliosn, see the following stack overflow answers:
  + [how-python-dict-stores-key-value-when-collision-occurs](https://stackoverflow.com/questions/21595048/how-python-dict-stores-key-value-when-collision-occurs)
  + [why-can-a-python-dict-have-multiple-keys-with-the-same-hash](https://stackoverflow.com/questions/9010222/why-can-a-python-dict-have-multiple-keys-with-the-same-hash)
  + [why-is-early-return-slower-than-else](https://stackoverflow.com/questions/8271139/why-is-early-return-slower-than-else)
- Compact ordered storage, see info from following links:
  + https://mail.python.org/pipermail/python-dev/2012-December/123028.html
  + [faster-more-memory-efficient-and-more](https://morepypy.blogspot.com/2015/01/faster-more-memory-efficient-and-more.html)

Apprently the Cpython implementation only uses 8 slots initially. It is then resized once it is 2/3rds full. It is my understanding that they double in size at that point.

The CPython implementation uses bitmasking instead, of the classic `hash % len(hash_table)` approach.
[This article](https://www.data-structures-in-practice.com/hash-tables/) has a great explanation of bit masking, that helped me understand what was going on.  
From what I understand, using the bit mask is faster than division on modern CPUs.

The following function mimics the C code ` size_t i = (size_t)hash & mask;` in the CPython dictionary implementation ([see here](https://github.com/python/cpython/blob/22415ad62555d79bd583b4a7d6a96006624a8277/Objects/dictobject.c#L867) for the code in the CPython repo).

In [8]:
def bit_mask(i, j):
    return int(bin(i & j), 2)

In [11]:
%%timeit
bit_mask(5165093096324751164, 31)
#this is slightly faster, probably because of the funcation call overhead
# int(bin(5165093096324751164 & 31), 2)

341 ns ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [12]:
%%timeit
5165093096324751164 % 31

7.27 ns ± 0.487 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)


This python implementation is certainly not faster than just typing out the mod operator. But in the spirit of duplicating the CPython implementation we will use it for now.

In [13]:
class Hashtable:
    def __init__(self, size=8):
        # Array size
        self.size = size
        self.sparse_key = [None] * size
        self.data = []

    def _get_index(self, hash_value):
        return bit_mask(hash_value, self.size - 1)

    def add(self, key, value):
        """Add a key value pair to Hashtable"""
        # TODO: Add collision and growth
        if (len(self.data) + 1) >= ((2 * self.size) // 3):
            print("resizing...")
            self._doublesize()

        hsh = hash(key)
        idx = self._get_index(hsh)
        # Check that sparse array is empty, if it is, fill it!
        if self.sparse_key[idx] is None:
            self.sparse_key[idx] = len(self.data)
            self.data.append((hsh, key, value))

        # If its not empty, check to see if the data is the same
        else:
            pos = self.sparse_key[idx]
            val = self.data[pos]
            # If it's the same key/hash just replace
            if (hsh == val[0]) and (key == val[1]):
                self.data[pos] = (hsh, key, value)
            else:
                self._probe()

    def get(self, key):
        hsh = hash(key)
        idx = self._get_index(hsh)
        pos = self.sparse_key[idx]
        # If it's not none, make sure it's the
        # right value
        if pos is not None:
            val = self.data[pos]
            if (hsh == val[0]) and (key == val[1]):
                return val[2]
            else:
                self._probe()
        else:
            raise KeyError(f"The key {key} was not found.")

    def _doublesize(self):
        """Double the size of the table"""
        # TODO: Make more effecient after _probe method is complete
        self.size *= 2
        old_data = self.data
        self.data = []
        self.sparse_key = [None] * self.size
        for _, i, j in old_data:
            self.add(i, j)

    def __repr__(self):
        s = ""
        for _, i, j in self.data:
            s += f"{i}: {j}\n"
        return s

    def _probe(self):
        print("You have been probed")

In [14]:
ht = Hashtable()

In [15]:
hash("ahasd")

-1062710959628165386

In [16]:
ht.add("1", 1)
ht

1: 1

In [17]:
ht.add("20", 20)
ht.add("10", 5)
ht

1: 1
20: 20
10: 5

In [18]:
ht.add("1", 40)
ht

1: 40
20: 20
10: 5

In [19]:
ht.get("1")

40

In [20]:
ht.add("30", 30)

In [21]:
ht.add("50", 50)

resizing...


In [22]:
ht

1: 40
20: 20
10: 5
30: 30
50: 50

In [23]:
for i in range(30):
    ht.add(str(i), i)

You have been probed
You have been probed
resizing...
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed
You have been probed


In [24]:
ht._get_index(hash("1"))

21

In [25]:
ht.sparse_key[2]