# Outline

* Simple hashing - hotel with 26 rooms, labeled A-Z. Guests are assigned to room based on their first name initial. If room occupied guest is not admitted.
* Not a good business model. Lots of demand for room `J` and `M` because John, Mary very popular names. Less demand for `X` or `Z`.
* Superfast to tell if a person is a guest at the hotel. If room corresponding to person's first name initial is not empty, person with that present. $\mathcal O(1)$.
* Many room remain empty. Introduce probing. John goes to `J`. James cannot go to `J` because John is there but what if the next room `K` is available. James can go there. Now, it may take 2 steps to find if James is in hotel. We expect him in `J` but we'll also probe `K`. Still faster than linear search or binary search. 
* Eventually we'll end up underutilizing the hotel. So we reach a compromise, to underutilize storage for the sake of performance.
* Demonstrate with a hotel (a list, really) with 1000 rooms where guests are send to a room based on some hash function.
* Possible hash: ${\sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}}\bmod 1000$.

# Summary of assignments



# Coming up with the right hash code

Consider a hotel with $N$ rooms (maybe, $N=1,024$). Create $N$ guest names. Try to assign them to rooms using a simple hashcode and measure how many collisions you detect.

For this exercise you'll have to write several methods to manage the complexity of the problem. Class `simulate_collisions.py` suggests an outline for these methods -- you don't have to follow it if you prefer a different approach.

For a simple hashcode, implement the following function.

$$
\text{hashcode}(\texttt{name}) = \left( \sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])} \right) \bmod N
$$

Since you are using a random string generator, you may want to run the code a few times and compute the average number of collisions.

Next, change the hashcode to a slightly different function:

$$
\text{hashcode}(\texttt{name}) = \left( \prod_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])} \right) \bmod N
$$
and repeat the measurements. 

Next, change the hashcode to a more sophisticated function:

$$
\text{hashcode}(\texttt{name}) = \left( \sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}\times 31^{\text{len}(\texttt{name})-1-i} \right) \bmod N
$$
and repeat the measurements. 


What do you observe, and why you think it is?



In [78]:
DEFAULT_MIN_LENGTH = 10
DEFAULT_MAX_LENGTH = 15
ASCII_OFFSET = ord('A')
ASCII_SIZE = 26
N = 1_024

hotel = [None] * N

import random

def generate_random_string(min_length=DEFAULT_MIN_LENGTH, max_length=DEFAULT_MAX_LENGTH):
    length = random.randint(min_length, max_length)
    return ''.join(chr(ASCII_OFFSET + random.randint(0, ASCII_SIZE - 1)) for _ in range(length))

def hashcode(name:str) -> int:
    h = 0
    for char in name:
        h = (h * 31 + ord(char)) 
    return h

def hash_function(name:str) -> int:
    return hashcode(name) % N

def check_in(name:str) -> bool:
    h = hash_function(name)
    if hotel[h] is None:
        hotel[h] = name
        return True
    else:
        return False

def simulate_check_in(num_guests:int = N) -> int:
    success = 0
    for _ in range(num_guests):
        name = generate_random_string()
        if check_in(name):
            success += 1
    return success

def main():
    num_guests = N
    num_simulations = 10
    total_success = 0
    for _ in range(num_simulations):
        global hotel
        hotel = [None] * N
        current_success = simulate_check_in(num_guests)
        total_success += current_success

    print(f"Average success over {num_simulations} simulations: {total_success / num_simulations}")

if __name__ == "__main__":
    main()

Average success over 10 simulations: 647.5
