# Week 09: Hash collisions

You were asked to study the behavior of three simple hash codes and compare their performance.


A hash code $H$ takes the hashcode of a string $s$ and maps it to an integer value in the interval $[0,N)$ using the transformation

$$H(s) =\text{hashcode}(s) \bmod N$$

Here, $\bmod$  is the integer division remainder operation.

* Sum-based

$$
H(\texttt{name}) = {\underbrace{{\left(\sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}\right)}}_{\text{hashcode}}} \bmod N
$$


* Product-based

$$
H(\texttt{name}) = \left( \prod_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])} \right) \bmod N
$$

* Polynomial


$$
H(\texttt{name}) = \left( \sum_{i=0}^{\text{len}(\texttt{name})} \texttt{ord(name[i])}\times 31^{\text{len}(\texttt{name})-1-i} \right) \bmod N
$$



In [None]:
import random


class SimulateCollisions:

    # Constants for random string generation
    DEFAULT_MIN_LENGTH = 10
    DEFAULT_MAX_LENGTH = 15
    ASCII_OFFSET = ord("A")
    ASCII_SIZE = 26

    # Default hotel size
    DEFAULT_N = 1_024
    DEFAULT_GUESTS = DEFAULT_N

    # Default simulations
    DEFAULT_TRIALS = 10

    def __init__(
        self,
        N: int = DEFAULT_N,
        guests: int = DEFAULT_GUESTS,
        trials: int = DEFAULT_TRIALS,
    ):
        self.N = N
        self.guests = guests
        self.hotel = [None] * N
        self.trials = trials
        self.min_length = self.DEFAULT_MIN_LENGTH
        self.max_length = self.DEFAULT_MAX_LENGTH

    def reset(self):
        """Reset the hotel for a new simulation. """
        self.hotel = [None] * self.N

    def generate_random_string(self):
        """Generate a random string of length between min_length and max_length."""
        length = random.randint(self.min_length, self.max_length)
        return "".join(
            chr(self.ASCII_OFFSET + random.randint(0, self.ASCII_SIZE - 1))
            for _ in range(length)
        )

    def hashcode(self, name: str) -> int:
        """A simple hash function for strings."""
        h = 0
        for char in name:
            h = h * 31 + ord(char)
        return h

    def hash_function(self, name: str) -> int:
        """Hash function to map a name to a hotel room."""
        return self.hashcode(name) % self.N

    def check_in(self, name: str) -> bool:
        """Attempt to check in a guest. Return True if successful, False if there's a collision."""
        # Find where to place the guest
        h = self.hash_function(name)
        # Check if the room is available
        checked_in = self.hotel[h] is None
        if checked_in:
            # No collision, check in the guest
            self.hotel[h] = name
        return checked_in

    def simulate_check_in(self) -> int:
        """Simulate the check-in process for all guests."""
        success = 0
        for _ in range(self.guests):
            # Generate a random name
            name = self.generate_random_string()
            # Attempt to check in the guest
            if self.check_in(name):
                # No collision, increment success count
                success += 1
        return success

    def main(self):
        """Run multiple simulations and report average success rate."""
        num_simulations = self.trials
        total_success = 0
        # Run the simulations
        for _ in range(num_simulations):
            # Reset the hotel for a new simulation
            self.reset()
            # Simulate check-ins and accumulate success count
            current_success = self.simulate_check_in()
            # Accumulate total success
            total_success += current_success
            # Print the success for this simulation
        print(
            f"\nN={self.N:,d}; simulations={num_simulations}; average admissions: {total_success / num_simulations}"
        )


if __name__ == "__main__":
    experiment = SimulateCollisions()
    experiment.main()



N=1,024; simulations=10; average admissions: 650.8


# Discussion

## Sum of ASCII values

The hash function cannot discriminate between anagrams. For example strings `"forest"` and `"foster"` yield the same hash code. Certain pairs have same values; for example `"ad"` and `"bc"` have ASCII valus that add to 197. If strings are of similar length and character set (letters-only), the sums cluster in a relatively small band, leading to more collisions.

## Multiplication of ASCII values

A product of several numbers is a big number -- for example the product of ASCII values in `"computer"` is

$$99×111×109×112×117×116×101×114=20,963,933,340,045,696$$

which is a big number, so in theory, we get a large spread of values and we may expect a better spread of hashed values. However, because multiplication is commutative, permutations of strings have the same hash: `"percomut"` (a made-up word) has the same product hash as `"computer"`. Furthermore, because many ASCII values are multiples of 2, 3, and 5, their products end up in the same group. 

## Polynomial 

Anagrams and permutations have different values because we use the position index as an exponent.

## Summary

Among the given choices, polynomial hashing is the best choice, followed by sum-based hash. Product-based hashes are the worst.