# Hash Functions and Hash Tables

A hash table is a data structure to store data for fast searching. In particular, it is often implemented as an array of buckets holding your data and the data is indexed by hashing their keys using a hash function that converts a text string into an integer referred to as hash value. The hash value of the key modular the bucket size determines the index of the bucket to hold the data with that key.

In Python there’s no need to create your custom implementation of hash functions and hash tables since you may directly use built-in hash function hash(), and directly use dictionaries as hash tables. 

From the point of learning hash functions and hash tables, it helps to see Python code to create both.

In [1]:
import pprint

def myhash(key): # key is a text string
    hash_value = 7; # better use a prime number and 7 is a prime
    for i in range(len(key)):
        hash_value = hash_value * 31 + ord(key[i]) # better use a prime number and 31 is a prime
        return hash_value

class Hashtable:
    def __init__(self, elements):
        self.bucket_size = len(elements)
        self.buckets = [[] for _ in range(self.bucket_size)]
        self._assign_buckets(elements)
        
    def _assign_buckets(self, elements):
        for key, value in elements:
            hashed_value = myhash(key)
            #print(type(hashed_value))
            index = hashed_value % self.bucket_size
            self.buckets[index].append((key, value))

    def get_value(self, input_key):
        hashed_value = myhash(input_key)
        index = hashed_value % self.bucket_size
        bucket = self.buckets[index]
        for key, value in bucket:
            if key == input_key:
                return(value)
        return None

    def __str__(self):
        return pprint.pformat(self.buckets) # here pformat is used to return a printable representation of the object

if __name__ == "__main__":
     capitals = [
        ('France', 'Paris'),
        ('United States', 'Washington D.C.'),
        ('Italy', 'Rome'),
        ('Canada', 'Ottawa')
    ]
hashtable = Hashtable(capitals)
print(hashtable)
print(f"The capital of Italy is {hashtable.get_value('Italy')}")


[[('Canada', 'Ottawa')],
 [],
 [('United States', 'Washington D.C.'), ('Italy', 'Rome')],
 [('France', 'Paris')]]
The capital of Italy is Rome


# Collision Remedies

When two different keys having the same hash value, a collison occurs. This will happen when we are dealing with a large data set. When two keys collide, the common remedy is to create a linked list (a chain) to store the data with these keys under the same bucket. To search for data with a given key, we first identify the bucket for the key, and then use a linear search to find the data with the key in the linked list. So we want to have a hash function that would uniformly distribute the hash values so that the length of each chain is approximately the same. 

The Hashtables() function we built above uses a simple division method, which works quite well in practice. Namely, let $b$ denote the number of buckets and $k$ be the integer "converted" from the input key string, then $k \% b$ is pretty evenly distributed.  

Let $\alpha = n/b$. Then in a hash table in which collisions are resolved by chaining, a successful search
takes expected time $O(1+\alpha)$ under the assumption of simple uniform hashing. This result is intuitive enough to see it. We can formally prove this result as follows: 

We assume that the element being searched for is equally likely to be any
of the $n$ elements stored in the table. Note the an element inserted into a chain is either in the front or at the end (but only one should be followed throughout), which takes $O(1)$ time. Let $x_1, \ldots, x_n$ be the elements inserted into the hashtable in this order. Let $x_i$ be just inserted into the table. Then any element $x_j$ that collides with $x_i$ after $x_i$ is inserted must have $j > i$.
The number of elements examined during a
successful search for an element $x_i$ is one more than the number of elements that
appear after $x_i$ (assuming insertion is at the end). This is because elements before $x_i$ in the list were all inserted after $x_i$ was inserted.

Under our assumption, we have $p(\mbox{keys $k_i$ and $k_j$ collide}) = 1/b$. Let $X_{ij}$ be a random variable such taht $X_{ij} = 1$ if $x_i$ and $x_j$ collide, and 0 otherwise. Thus, $E[X_{ij}] = 1/b$. Then the expected number of elements examined in a successful search on an $n$-element hashtable, assuming that each element is equaly likely to be searched for, is
\begin{align*}
&E\left[\left(1 + \sum_{i=1}^n \sum_{j=i+1}^n X_{ij}\right)\right] \\
&= 1 + E\left[\sum_{i=1}^n\sum_{j=i+1}^n X_{ij}\right] \\
&= 1 + \frac{1}{nb}\sum_{i=1}^n (n-i) \\
&= 1 + \frac{n(n-1)}{2nb} \\
&= 1 + \frac{\alpha}{2} - \frac{\alpha}{2n} \\
&< 1 + \frac{\alpha}{2}.
\end{align*}
This completes the proof.

In [73]:
# Capacity for internal array
INITIAL_CAPACITY = 10

class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None
    def __str__(self):
        return "<Node: (%s, %s), next: %s>" % (self.key, self.value, self.next != None)
    def __repr__(self):
        return str(self)

class HashTable:
    def __init__(self):
        self.capacity = INITIAL_CAPACITY
        self.size = 0
        self.buckets = [None]*self.capacity

    def __iter__(self):
        for bucket in self.buckets:
            node = bucket
            while node is not None:
                yield node
                node = node.next

    def hash(self, key):
        hash_value = 7
        for i in range(len(key)):
            hash_value = hash_value * 31 + ord(key[i])
        hash_value = hash_value % self.capacity
        return hash_value

    def resize(self, new_capacity):
        new_table = HashTable()
        new_table.capacity = new_capacity
        new_table.buckets = [None] * new_capacity

        for node in self:
            while node is not None:
                new_table.insert(node.key, node.value)
                node = node.next

        self.buckets = new_table.buckets
        self.capacity = new_table.capacity

    def insert(self, key, value):
        load_factor = self.size / self.capacity
        if load_factor > 0.5:
            new_capacity = self.capacity * 2
            self.resize(new_capacity)

        index = self.hash(key)

        node = self.buckets[index]
        prev = None
        while node is not None and node.key != key:
            prev = node
            node = node.next

        if node is not None:
            node = (key, value)
        else:
            new_node = Node(key, value)
            if prev is not None:
                prev.next = new_node
            else:
                self.buckets[index] = new_node
            self.size += 1


        # Search a data value based on key
        # Input:  key - string
        # Output: value stored under "key" or None if not found
    def search(self, key):

        # 1. Compute hash
        index = self.hash(key)
        # 2. Go to first node in list at bucket
        node = self.buckets[index]
        # 3. Traverse the linked list at this node
        while node is not None and node.key != key:
            node = node.next
        # 4. Now, node is the requested key/value pair or None
        if node is None:
            # Not found
            return None
        else:
            # Found - return the data value
            return node.value

    # Remove node stored at key
    # Input:  key - string
    # Output: removed data value or None if not found
    def remove(self, key):
        # 1. Compute index of key
        index = self.hash(key)

        # 2. Iterate linearly to find key
        while self.buckets[index] is not None:
            if self.buckets[index].key == key:
                # 3. Remove (key,value) pair and return value
                result = self.buckets[index].value
                self.buckets[index] = None
                self.size -= 1

                # 4. Check load factor and contract if necessary
                load_factor = self.size / self.capacity
                if load_factor < 0.25 and self.capacity > INITIAL_CAPACITY:
                    new_capacity = self.capacity // 2
                    self.resize(new_capacity)

                return result
            index = (index + 1) % self.capacity

        # 5. Key not found
        return None


In [74]:
# Create a new HashTable

ht = HashTable()
# Create some data to be stored

phone_numbers = ["555-555-5555", "444-444-4444"]
# Insert our data under the key "phoneDirectory"

ht.insert("phoneDirectory", phone_numbers)
# Do whatever we need with the phone_numbers variable

phone_numbers = None
# Later on...

# Retrieve the data we stored in the HashTable

phone_numbers = ht.search("phoneDirectory")
# search() retrieved our list object

print(phone_numbers)
# phone_numbers is now equal to ["555-555-5555", "444-444-4444"]

['555-555-5555', '444-444-4444']


In [75]:
#from hashtable import HashTable
import unittest

class TestHashTable(unittest.TestCase):
    def setUp(self):
        self.ht = HashTable()
    def test_hash(self):
        self.assertEqual(self.ht.hash("hello"), self.ht.hash("hello"))
        self.assertTrue(self.ht.hash("hello") < self.ht.capacity)
    def test_insert(self):
        self.assertEqual(self.ht.size, 0)
        self.ht.insert("test_key", "test_value")
        self.assertEqual(self.ht.size, 1)
        self.assertEqual(self.ht.buckets[self.ht.hash("test_key")].value, "test_value")
    def test_search(self):
        self.assertEqual(self.ht.size, 0)
        obj = "hello"
        self.ht.insert("key1", obj)
        self.assertEqual(obj, self.ht.search("key1"))
        obj = ["this", "is", "a", "list"]
        self.ht.insert("key2", obj)
        self.assertEqual(obj, self.ht.search("key2"))
    def test_remove(self):
        self.assertEqual(self.ht.size, 0)
        obj = "test object"
        self.ht.insert("key1", obj)
        self.assertEqual(1, self.ht.size)
        self.assertEqual(obj, self.ht.remove("key1"))
        self.assertEqual(0, self.ht.size)
        self.assertEqual(None, self.ht.remove("some random key"))
    def test_capacity(self):
        # Test all public methods in one run at a large capacity
        for i in range(0,10):
            self.assertEqual(i, self.ht.size)
            self.ht.insert("key" + str(i), "value")
        self.assertEqual(self.ht.size, 10)
        for i in range(0,10):
            self.assertEqual(10-i, self.ht.size)
            self.assertEqual(self.ht.search("key" + str(i)), self.ht.remove("key" + str(i)))
    def test_issue2(self):
        self.assertEqual(self.ht.size, 0)
        self.ht.insert('A', 5)
        self.assertEqual(self.ht.size, 1)
        self.ht.insert('B', 10)
        self.assertEqual(self.ht.size, 2)
        self.ht.insert('Ball', 'hello')
        self.assertEqual(self.ht.size, 3)

        self.assertEqual(5, self.ht.remove('A'))
        self.assertEqual(self.ht.size, 2)
        self.assertEqual(None, self.ht.remove('A'))
        self.assertEqual(self.ht.size, 2)
        self.assertEqual(None, self.ht.remove('A'))
        self.assertEqual(self.ht.size, 2)
        
unittest.main(argv=[''], verbosity=2, exit=False)

test_capacity (__main__.TestHashTable) ... ok
test_hash (__main__.TestHashTable) ... ok
test_insert (__main__.TestHashTable) ... ok
test_issue2 (__main__.TestHashTable) ... ok
test_remove (__main__.TestHashTable) ... ok
test_search (__main__.TestHashTable) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.009s

OK


<unittest.main.TestProgram at 0x1cf9be12880>

In [84]:
import random
import statistics

def generate_random_binary_string(length):
    # Generate a random binary string of given length
    return "".join(str(random.randint(0, 1)) for i in range(length))

def generate_random_key():
    # Generate a random 3-digit key
    return str(random.randint(100, 999))

def run_experiment():
    # Initialize hash table and counters
    table = HashTable()
    expansions = 0
    contractions = 0
    
    # Generate random binary string
    binary_string = generate_random_binary_string(100)
    
    # Iterate over each character in binary string
    for char in binary_string:
        if char == '1':
            # Insert a random key-value pair
            key = generate_random_key()
            table.insert(key, None)
        elif char == '0':
            # Delete a random key-value pair
            keys = [node.key for node in table if node is not None]
            if len(keys) > 0:
                key_to_delete = random.choice(keys)
                table.remove(key_to_delete)
    
        # Update counters
        if table.capacity > INITIAL_CAPACITY and table.size >= table.capacity/2:
            expansions += 1
        elif table.capacity > INITIAL_CAPACITY and table.size <= table.capacity/4:
            contractions += 1
    
    # Return expansion and contraction counts
    return expansions, contractions

# Run experiments and collect results
expansion_counts = []
contraction_counts = []
for i in range(20):
    expansions, contractions = run_experiment()
    expansion_counts.append(expansions)
    contraction_counts.append(contractions)

# Compute and print averages
avg_expansions = statistics.mean(expansion_counts)
avg_contractions = statistics.mean(contraction_counts)
print("Average number of expansions:", avg_expansions)
print("Average number of contractions:", avg_contractions)


Average number of expansions: 15.15
Average number of contractions: 2.5
