## Problem Statement
  
In this assignment, you will recreate Python dictionaries from scratch using data structure called hash table. Dictionaries in Python are used to store key-value pairs. Keys are used to store and retrieve values. For example, here's a dictionary for storing and retrieving phone numbers using people's names.

In [None]:
phone_numbers = {
  'Aakash' : '9489484949',
  'Hemanth' : '9595949494',
  'Siddhant' : '9231325312'
}
phone_numbers

You can access a person's phone number using their name as follows:

In [None]:
phone_numbers['Aakash']

You can store new phone numbers, or update existing ones as follows:

In [None]:
# Add a new value
phone_numbers['Vishal'] = '8787878787'
# Update existing value
phone_numbers['Aakash'] = '7878787878'
# View the updated dictionary
phone_numbers

You can also view all the names and phone numbers stored in `phone_numbers` using a `for` loop.

In [None]:
for name in phone_numbers:
    print('Name:', name, ', Phone Number:', phone_numbers[name])

## The Method

Here's the systematic strategy we'll apply for solving problems:

1. State the problem clearly. Identify the input & output formats.
2. Come up with some example inputs & outputs. Try to cover all edge cases.
3. Come up with a correct solution for the problem. State it in plain English.
4. Implement the solution and test it using example inputs. Fix bugs, if any.
5. Analyze the algorithm's complexity and identify inefficiencies, if any.
6. Apply the right technique to overcome the inefficiency. Repeat steps 3 to 6.

Let's apply this approach step-by-step.

## Solution


### 1. State the problem clearly. Identify the input & output formats.

Dictionaries in Python are implemented using a data structure called **hash table**. A hash table uses a list/array to store the key-value pairs, and uses a _hashing function_ to determine the index for storing or retrieving the data associated with a given key. 

Here's a visual representation of a hash table:

<img src="images/03-hash-tables/hash-table.png" width="480">  

**Problem**

Your objective in this assignment is to implement a `HashTable` class which supports the following operations:

1. **Insert**: Insert a new key-value pair
2. **Find**: Find the value associated with a key
3. **Update**: Update the value associated with a key
5. **List**: List all the keys stored in the hash table

<br/>

Based on the above, we can now create a signature of our function:

In [None]:
class HashTable:
    def insert(self, key, value):
        """Insert a new key-value pair"""
        pass
    
    def find(self, key):
        """Find the value associated with a key"""
        pass
    
    def update(self, key, value):
        """Change the value associated with a key"""
        pass
    
    def list_all(self):
        """List all the keys"""
        pass

### Data List 

We'll build the HashTable class step-by-step. As a first step is to create a Python list which will hold all the key-value pairs. We'll start by creating a list of a fixed size.

In [None]:
MAX_HASH_TABLE_SIZE = 4096

**QUESTION 1: Create a Python list of size `MAX_HASH_TABLE_SIZE`, with all the values set to `None`.**

_Hint_: Use the `*` operator

In [None]:
# List of size MAX_HASH_TABLE_SIZE with all values None
data_list = [None]*MAX_HASH_TABLE_SIZE

In [None]:
len(data_list) == 4096

In [None]:
data_list[99] == None

### Hashing Function

A hashing function is used to convert strings and other non-numeric data types into numbers, which can then be used as list indices.  

For instance, if a hashing function converts the string "Aakash" into the number 4, then the key-value pair ('Aakash', '7878787878') will be stored at the position 4 within the data list.  

Here's a simple algorithm for hashing, which can convert strings into numeric list indices.  

* Iterate over the string, character by character
* Convert each character to a number using Python's built-in ord function.
* Add the numbers for each character to obtain the hash for the entire string
* Take the remainder of the result with the size of the data list

Complete the get_index function below which implements the hashing algorithm described above.

In [35]:
def get_index(data_list, a_string):
    # Variable to store the result (updated after each iteration)
    result = 0
    
    for a_character in a_string:
        # Convert the character to a number (using ord)
        a_number = ord(a_character)
        # Update result by adding the number
        result += a_number
    
    # Take the remainder of the result with the size of the data list
    list_index = result % len(data_list)
    return list_index

In [None]:
get_index(data_list, '') == 0

In [None]:
get_index(data_list, 'Aakash') == 585

In [None]:
get_index(data_list, 'Don O Leary') == 941

#### Insert

To insert a key-value pair into a hash table, we can simply get the hash of the key, and store the pair at that index in the data list.

In [None]:
key, value = 'Aakash', '7878787878'

In [None]:
idx = get_index(data_list, key)
idx

In [None]:
data_list[get_index(data_list, 'Hemanth')] = ('Hemanth', '9595949494')

In [None]:
data_list[idx] = (key, value)

In [None]:
data_list[idx]

#### Find

The retrieve the value associated with a pair, we can get the hash of the key and look up that index in the data list.

In [None]:
idx = get_index(data_list, 'Aakash')
key, value = data_list[idx]

In [None]:
key, value

#### List

To get the list of keys, we can use a simple list comprehension.

In [None]:
pairs = [kv[0] for kv in data_list if kv is not None]

In [None]:
pairs

## Basic Hash Table Implementation

We can now use the hashing function defined above to implement a basic hash table in Python.

In [67]:
class BasicHashTable:
    def __init__(self, max_size=MAX_HASH_TABLE_SIZE):
        # 1. Create a list of size `max_size` with all values None
        self.data_list = [None]*max_size
        # self.index = 0
     
    
    def insert(self, key, value):
        # 1. Find the index for the key using get_index
        idx = get_index(self.data_list, key)
        
        # 2. Store the key-value pair at the right index
        self.data_list[idx] = (key, value)
    
    
    def find(self, key):
        # 1. Find the index for the key using get_index
        idx = get_index(self.data_list, key)
        
        # 2. Retrieve the data stored at the index
        kv = self.data_list[idx]
        
        # 3. Return the value if found, else return None
        if kv is None:
            return None
        else:
            key, value = kv
            return value
    
    
    def update(self, key, value):
        # 1. Find the index for the key using get_index
        idx = get_index(self.data_list, key)
        
        # 2. Store the new key-value pair at the right index
        self.data_list[idx] = (key, value)

    
    def list_all(self):
        # 1. Extract the key from each key-value pair 
        return [kv[0] for kv in self.data_list if kv is not None]

    def __iter__(self):
        return self
    
    def __next__(self):
        try:
            result = self.data_list[self.index]
        except IndexError:
            raise StopIteration
        self.index += 1
        return result

In [73]:
basic_table = BasicHashTable(max_size=1024)
len(basic_table.data_list) == 1024

True

In [74]:
# Insert some values
basic_table.insert('Aakash', '9999999999')
basic_table.insert('Hemanth', '8888888888')

# Find a value
basic_table.find('Hemanth') == '8888888888'

True

In [75]:
# Update a value
basic_table.update('Aakash', '7777777777')

# Check the updated value
basic_table.find('Aakash') == '7777777777'

True

In [76]:
[kv for kv in basic_table if kv is not None]

[('Aakash', '7777777777'), ('Hemanth', '8888888888')]

### Handling Collisions with Linear Probing

As you might have wondered, multiple keys can have the same hash. For instance, the keys "listen" and "silent" have the same hash. This is referred to as collision. Data stored against one key may override the data stored against another, if they have the same hash.

In [77]:
basic_table.insert('listen', 99)

In [78]:
basic_table.insert('silent', 200)

In [79]:
basic_table.find('listen')

200

As you can see above, the value for the key listen was overwritten by the value for the key silent. Our hash table implementation is incomplete because it does not handle collisions correctly.  

To handle collisions we'll use a technique called linear probing. Here's how it works:  

1. While inserting a new key-value pair if the target index for a key is occupied by another key, then we try the next index, followed by the next and so on till we the closest empty location.
2. While finding a key-value pair, we apply the same strategy, but instead of searching for an empty location, we look for a location which contains a key-value pair with the matching key.
3. While updating a key-value pair, we apply the same strategy, but instead of searching for an empty location, we look for a location which contains a key-value pair with the matching key, and update its value.  

We'll define a function called get_valid_index, which starts searching the data list from the index determined by the hashing function get_index and returns the first index which is either empty or contains a key-value pair matching the given key.


In [80]:
def get_valid_index(data_list, key):
    # Start with the index returned by get_index
    idx = get_index(data_list, key)
    
    while True:
        # Get the key-value pair stored at idx
        kv = data_list[idx]
        
        # If it is None, return the index
        if kv is None:
            return idx
        
        # If the stored key matches the given key, return the index
        k, v = kv
        if k == key:
            return idx
        
        # Move to the next index
        idx += 1
        
        # Go back to the start if you have reached the end of the array
        if idx == len(data_list):
            idx = 0

In [81]:
# Create an empty hash table
data_list2 = [None] * MAX_HASH_TABLE_SIZE

# New key 'listen' should return expected index
get_valid_index(data_list2, 'listen') == 655

True

In [82]:
# Insert a key-value pair for the key 'listen'
data_list2[get_index(data_list2, 'listen')] = ('listen', 99)

# Colliding key 'silent' should return next index
get_valid_index(data_list2, 'silent') == 656

True

### Hash Table with Linear Probing

In [83]:
class ProbingHashTable:
    def __init__(self, max_size=MAX_HASH_TABLE_SIZE):
        # 1. Create a list of size `max_size` with all values None
        self.data_list = [None] * max_size
     
    
    def insert(self, key, value):
        # 1. Find the index for the key using get_valid_index
        idx = get_valid_index(self.data_list, key)
        
        # 2. Store the key-value pair at the right index
        self.data_list[idx] = (key, value)
    
    
    def find(self, key):
        # 1. Find the index for the key using get_valid_index
        idx = get_valid_index(self.data_list, key)
        
        # 2. Retrieve the data stored at the index
        kv = self.data_list[idx]
        
        # 3. Return the value if found, else return None
        return None if kv is None else kv[1]
    
    
    def update(self, key, value):
        # 1. Find the index for the key using get_valid_index
        idx = get_valid_index(self.data_list, key)
        
        # 2. Store the new key-value pair at the right index
        self.data_list[idx] = (key, value)

    
    def list_all(self):
        # 1. Extract the key from each key-value pair 
        return [kv[0] for kv in self.data_list if kv is not None]

In [84]:
# Create a new hash table
probing_table = ProbingHashTable()

# Insert a value
probing_table.insert('listen', 99)

# Check the value
probing_table.find('listen') == 99

True

In [85]:
# Insert a colliding key
probing_table.insert('silent', 200)

# Check the new and old keys
probing_table.find('listen') == 99 and probing_table.find('silent') == 200

True

In [86]:
# Update a key
probing_table.insert('listen', 101)

# Check the value
probing_table.find('listen') == 101

True

In [87]:
probing_table.list_all() == ['listen', 'silent']

True

### Python Dictionaries using Hash Tables

We can now implement Python dictionaries using hash tables. Also, Python provides a built-in function called hash which we can use instead of our custom hash function. It is likely to have far fewer collisions

In [96]:
MAX_HASH_TABLE_SIZE = 4096

class HashTable:
    def __init__(self, max_size=MAX_HASH_TABLE_SIZE):
        self.data_list = [None] * max_size
        
    def get_valid_index(self, key):
        # Use Python's in-built `hash` function and implement linear probing
        idx = hash(key) % len(self.data_list)

        while True:
            kv = self.data_list[idx]

            if kv is None:
                return idx
            
            k, v = kv
            if k == key:
                return idx
            
            idx+=1

            if idx == len(self.data_list):
                idx = 0
        
    def __getitem__(self, key):
        # Implement the logic for "find" here
        idx = self.get_valid_index(key)
        kv = self.data_list[idx]

        return None if kv is None else kv[1]
    
    def __setitem__(self, key, value):
        # Implement the logic for "insert/update" here
        idx = self.get_valid_index(key)
        self.data_list[idx] = (key, value)
    
    def __iter__(self):
        return (x for x in self.data_list if x is not None)
    
    def __len__(self):
        return len([x for x in self])
    
    def __repr__(self):
        from textwrap import indent
        pairs = [indent("{} : {}".format(repr(kv[0]), repr(kv[1])), '  ') for kv in self]
        return "{\n" + "{}".format(',\n'.join(pairs)) + "\n}"
    
    def __str__(self):
        return repr(self)

In [97]:
# Create a hash table
table = HashTable()

# Insert some key-value pairs
table['a'] = 1
table['b'] = 34

# Retrieve the inserted values
table['a'] == 1 and table['b'] == 34

True

In [98]:
# Update a value
table['a'] = 99

# Check the updated value
table['a'] == 99

True

In [99]:
# Get a list of key-value pairs
list(table) == [('a', 99), ('b', 34)]

True

Since we have also implemented the __repr__ and __str__ functions, the output of the next cell should be:
```
{
  'a' : 99,
  'b' : 34
}
```

In [100]:
table

{
  'a' : 99,
  'b' : 34
}

### Hash Table Improvements

Here are some more improvements/changes you can make to your hash table implementation:  

* Track the size of the hash table i.e. number of key-value pairs so that len(table) has complexity O(1).
* Implement deletion with tombstones as described here: https://research.cs.vt.edu/AVresearch/hashing/deletion.php
* Implement dynamic resizing to automatically grow/shrink the data list: https://charlesreid1.com/wiki/Hash_Maps/Dynamic_Resizing
* Implement separate chaining, an alternative to linear probing for collision resolution: https://www.youtube.com/watch/T9gct6Dx-jo

### Complexity Analysis

With choice of a good hashing function and other improvements like dynamic resizing, you can
  

|Operation | Average-case time Complexity |	Worst-case time Complexity |
|----------|------------------------------|------------------------------|
|Insert/Update|	O(1)|	O(n)|
|Find|	O(1)|	O(n)|
|Delete|	O(1)|	O(n)|
|List|	O(n)|	O(n)|  

Here are some questions to ponder upon?  

* What is average case complexity? How does it differ from worst-case complexity?
* Do you see why insert/find/update have average-case complexity of O(1) and worst-case complexity of O(n) ?
* How is the complexity of hash tables different from binary search trees?
* When should you prefer using hash table over binary trees or vice versa?
