# Outline

* Simple hashing - hotel with 26 rooms, labeled A-Z. Guests are assigned to room based on their first name initial. If room occupied guest is not admitted.
* Not a good business model. Lots of demand for room `J` and `M` because John, Mary very popular names. Less demand for `X` or `Z`.
* Superfast to tell if a person is a guest at the hotel. If room corresponding to person's first name initial is not empty, person with that present. $\mathcal O(1)$.
* Brief/simple intro to $\mathcal O$ notation.
* Many room remain empty. Introduce probing. John goes to `J`. James cannot go to `J` because John is there but what if the next room `K` is available. James can go there. Now, it may take 2 steps to find if James is in hotel. We expect him in `J` but we'll also probe `K`. Still faster than linear search or binary search. 
* Eventually we'll end up underutilizing the hotel. So we reach a compromise, to underutilize storage for the sake of performance.
* Demonstrate with a hotel (a list, really) with 1000 rooms where guests are send to a room based on some hash function.
* Possible hash: ${\sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}}\bmod 1000$.
* Resolve collisions by chaining data -- `hotel` effectively becomes an array of linked lists. No needs to invoke a SLL class; just nodes. 
* $\mathcal O(1)$ insertions possible by placing new node to the head of an existing linked list.
* If we guarantee max length of node chains, say $L$, search operations guaranteed at $\mathcal O(L)=\mathcal O(1)$.

# Summary of assignments

This assignment comprises *** problems.

* Understanding hashcode choices



# Coming up with the right hash code

Consider a hotel with $N$ rooms (maybe, $N=1,024$). Create $N$ guest names. Try to assign them to rooms using a simple hashcode and measure how many collisions you detect.

For this exercise you'll have to write several methods to manage the complexity of the problem. Class `simulate_collisions.py` suggests an outline for these methods -- you don't have to follow it if you prefer a different approach.

A hash function $H$ takes the hashcode of a string $s$ and maps it to an integer value in the interval $[0,N)$ using the transformation 

$$
H(s) = {\color{purple}\text{hashcode}(s)}\bmod N
$$ 

Here, $\bmod$ is the integer division remainder operation. 

A simple hash function can be obtained using a sum-based hashcode:

$$
H(\texttt{name}) = {\color{purple}\underbrace{{\color{blue}\left(\sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}\right)}}_{\text{hashcode}}} \bmod N
$$

Since you are using a random string generator, you may want to run the code a few times and compute the average number of collisions.

Next, change the hashcode to a slightly different function:

$$
H(\texttt{name}) = \left( \prod_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])} \right) \bmod N
$$
and repeat the measurements. 

Finally, change the hashcode to a more sophisticated function:

$$
H(\texttt{name}) = \left( \sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}\times 31^{\text{len}(\texttt{name})-1-i} \right) \bmod N
$$
and repeat the measurements. 


What do you observe, and why you think it is?



## Solution

This problem showcases several classic hashing gotchas in one go. What we typically observe for N = 1024:

* Sum-based hash: OK-ish but not great. With random letters and variable lengths, we see a distribution that looks somewhat uniform but carries low-bit patterns; the collision count is often worse than ideal.

* Product-based hash: Disastrously bad with a power-of-two table size. Most names end up in bucket 0 (or a tiny set of buckets). Collisions skyrocket.

* Polynomial (base 31)-based hash: Best of the three. We see collisions close to the uniform baseline.

**Sanity check:** the expected number of collisions $\mathbf E$ is given by:

$$
\mathbf{E}[\text{collisions}]
= m - N\left(1 - \left(1 - \frac{1}{N}\right)^{m}\right).
$$

Under uniform hashing with $m=N$ guests into 
$N$ rooms, the expected collisions are about
$$
\mathbf{E}[\text{collisions}]
= N - N\left(1 - \left(1 - \frac{1}{N}\right)^{N}\right)
= N\left(1 - \frac{1}{N}\right)^{N}
\approx \frac{N}{e}
\approx 0.368\,N.
$$

**Why these outcomes happen**

* Sum-based hash: Modulo a power of two keeps only the lowest 10 bits of the sum. ASCII letters share high-bit patterns; the low bits of `ord` aren’t perfectly uniform (for `A`..`Z`, they range 1–26). Summing several such values tends toward uniformity by convolution, but low-bit periodicity and length coupling remain, so we get noticeable bias and more collisions than ideal.

* Product-based hash: Catastrophic with $N=2^{10}$. Each even character contributes at least one factor of 2. For uppercase/lowercase letters, the expected 2-adic valuation per letter is about 0.885; after only ~12 characters the product tends to be divisible by $2^{10}$. Result: the hash is very often 0, flooding one bucket → massive collisions.

* Polynomial-based hash: Multiplies by an odd base (31) and adds each `ord`, effectively mixing positions and values.
With $N=2^{10}$ and odd base (31) coprime to $N$, this scheme spreads mass much more evenly, so collision counts land close to the uniform baseline.


