# Hash Collisions

## Lesson Overview

The reason hash functions are so useful is that we can store complex data types and values as integer-valued hash buckets. However, issues arise if a hash function maps two distinct values to the same bucket; in that case, we can't distinguish between these two values.

> A **hash collision** is when a hash function maps two different values to the same bucket.

For a hash function $H$, a collision exists for two values $a$ and $b$ if $a \neq b$ but $H(a) = H(b)$.

When there are more input values than buckets, the hash function is guaranteed to have collisions. Even when there are fewer input values than buckets, it is still possible to have collisions.

## Question 1

Which of the following statements about a hash collision are true? There may be more than one correct response.


**a)** A hash collision occurs when a hash function maps two different inputs to the same bucket.

**b)** If a hash function has more input values than buckets, a collision cannot be avoided.

**c)** If a hash function has fewer input values than buckets, a collision cannot occur.

### Solution

The correct answers are **a)** and **b)**. 

**c)**  A collision can still occur even when there are more buckets than input values.

## Question 2

You have data coming in that can be any lower-case character. Will the following hash function have any potential collisions? If so, for what values and buckets?

In [None]:
import string

def hash_bucket(char):
  # Raise an error if the character is not a lower-case character.
  if len(char) != 1 or not char.islower():
    raise ValueError('Input must be a single lower-case letter.')

  # string.ascii_lowercase.index returns the position of a letter in the
  # alphabet. For example:
  # - string.ascii_lowercase.index('a') = 0
  # - string.ascii_lowercase.index('e') = 4
  return string.ascii_lowercase.index(char) % 25

In [None]:
#freetext

### Solution

This hash function has 25 buckets, since the `return` statement ends with `% 25`. In general, whenever a hash function ends with `% n` it has $n$ buckets, since `x % n` is always an integer between 0 and $n-1$.

There are 26 possible input values, `'a'` through `'z'`. Since there are more input values than buckets, this function is guaranteed to have collisions.

The collision occurs for `'a'` and `'z'`.

```python
hash_bucket('a') = 0 % 25
                 = 0
hash_bucket('z') = 25 % 25
                 = 0
```

In [None]:
import string

def hash_bucket(char):
  # Raise an error if the character is not a lower-case character.
  if len(char) != 1 or not char.islower():
    raise ValueError('Input must be a single lower-case letter.')

  # string.ascii_lowercase.index returns the position of a letter in the
  # alphabet. For example:
  # - string.ascii_lowercase.index('a') = 0
  # - string.ascii_lowercase.index('e') = 4
  return string.ascii_lowercase.index(char) % 25

In [None]:
print(hash_bucket('a'))
print(hash_bucket('z'))

## Question 3

Consider the following hash function for integers.

In [None]:
def hash_bucket(i):
  return i**2 % 10

Below are the data you need to store using this hash function.

In [None]:
data = [0, 6, -6, 3, 9, -5, 2, 1]

For this data and hash function, do you expect any collisions? If so, for what values and buckets?

In [None]:
# TODO(you): Write code to check for collisions.

### Solution

This hash function has 10 buckets and there are 8 data points. Therefore, we are not *guaranteed* to have collisions, but there may be some anyway.

The easiest way to check if there are collisions is to compute the hash bucket for every input value. (Note that realistically, this may not always be possible; for example when you have billions of inputs.) This can be accomplished with a `for` loop.

In [None]:
def hash_bucket(i):
  return i**2 % 10

data = [0, 6, -6, 3, 9, -5, 2, 1]

In [None]:
for i in data:
  print("input: %d, hash_bucket: %d" % (i, hash_bucket(i)))

Both 6 and -6 map to 6, and both 1 and 9 map to 1. Therefore, two buckets (6 and 1) have collisions, with both containing two data entries.

## Question 4

Consider the following hash function for integers.

In [None]:
def hash_bucket(i):
  return i**2 % 10

Below are the data you need to store using this hash function.

In [None]:
data = [0, 6, -6, 3, 9, -5, 2, 1]

In the previous question, we showed that both 6 and -6 map to 6, and both 1 and 9 map to 1. What changes might you make to this hash function in order to reduce the number of collisions to 0?

In [None]:
def hash_bucket(i):
  # TODO(you): Edit this function to avoid collisions.
  return i**2 % 10

### Solution

This is not the only correct solution, but given that negative numbers are handled differently than positive numbers by Python's modulo operator, you can change `hash_bucket` to cube the number, rather than square it.

In [None]:
def hash_bucket(i):
  return i**3 % 10

data = [0, 6, -6, 3, 9, -5, 2, 1]

In [None]:
for i in data:
  print("input: %d, hash_bucket: %d" % (i, hash_bucket(i)))

Now, no two inputs map to the same output.

## Question 5

Your coworker needs some help with a hash function, which is being used to store employee identification values. The company has 1,000 employees. Each employee's ID is a random integer between 0 and 99,999. The hash function used by your coworker has 2,000 buckets. Since there are more hash buckets than input values, your coworker doesn't know for sure if there are any collisions.

Instead of going through each value one by one, can you automate the checking of collisions? Assume:

- The hash function has the signature `def hash_bucket(employee_id)`, where:
  - `employee_id` is a string
  - `hash_bucket` returns an integer between 0 and 1,999

- The employee identification values are stored in a list called `employee_ids`.

In [None]:
def check_collisions(hash_bucket, employee_ids):
  """Checks if a hash bucket function has collisions for a list of employee IDs.

  Args:
    hash_bucket: A hash function of employee ID to integer.
    employee_ids: A list of employee_ids.

  Returns:
    A list of buckets that have multiple entries, in the form of a tuple. The
    first element of the tuple is the hash bucket. The second element is a list
    of all the employee IDs in that bucket.
  """
  # TODO(you): Implement
  print("This function has not been implemented.")

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
# Get some random employee IDs in the range 0 to 99.
import random
# Set a seed for consistent results.
# If you want random results, comments the following line out.
random.seed(1)

employee_ids = []
n_employees = 10
for _ in range(n_employees):
  employee_ids.append(random.randrange(100))

print(check_collisions(lambda x: x % 20, employee_ids))
# Should print: [(17, [17, 97, 97, 57]), (12, [72, 32])]

### Solution

This can be automated in a `for` loop. Note that this is not the only solution.

In [None]:
def check_collisions(hash_bucket, employee_ids):
  """Checks if a hash bucket function has collisions for a list of employee IDs.

  Args:
    hash_bucket: A hash function of employee ID to integer.
    employee_ids: A list of employee_ids.

  Returns:
    A list of buckets that have multiple entries, in the form of a tuple. The
    first element of the tuple is the hash bucket. The second element is a list
    of all the employee IDs in that bucket.
  """
  # Since we need to check if a bucket has already been hit, it makes sense to
  # use a dictionary, mapping a bucket to all of its entries.
  buckets = {}

  # Loop through all the employee_ids.
  for id in employee_ids:
    bucket = hash_bucket(id)
    if bucket in buckets:
      buckets[bucket].append(id)
    else:
      buckets[bucket] = [id]

  # Return the buckets that have more than one entry.
  output = []
  for key, val in buckets.items():
    if len(val) > 1:
      output.append((key, val))

  return output

## Question 6

Design a hash function with 10 buckets that satisfies the following map. The hash function only accepts non-negative integers.

Input | Hash Bucket
----- | -----------
0     | 1
3     | 8
4     | 6
6     | 4
7     | 8
10    | 4

In [None]:
def hash_bucket(i):
  if not isinstance(i, int) or i < 0:
    raise ValueError("Input must be a non-negative integer.")
  # TODO(you): Implement
  print("This function has not been implemented.")

### Hint

Notice that 0 is the only input that has an odd bucket. What function produces even results for all positive integers, but 1 for 0?

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
inputs = [0, 3, 4, 6, 7, 10]
buckets = []

for i in inputs:
  buckets.append(hash_bucket(i))

print(buckets)
# Should print: [1, 8, 6, 4, 8, 4]

### Solution

There may be several solutions. Here is just one.

In [None]:
def hash_bucket(i):
  if not isinstance(i, int) or i < 0:
    raise ValueError("Input must be a non-negative integer.")
  return 2**i % 10

We can check our answers using a `for` loop.

In [None]:
inputs = [0, 3, 4, 6, 7, 10]
buckets = []

for i in inputs:
  buckets.append(hash_bucket(i))

print(buckets)

## Question 7

Your colleague has written the following hash function, that maps a word to integer buckets.

In [None]:
def hash_bucket(word):
  return len(word) % 100

They then use this hash function to store the words in the following sentence ([source](https://en.wikipedia.org/wiki/Collision_(computer_science))) into integers.

> *In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value.*

However, they notice that there are some collisions, even though the number of buckets (100) is much greater than the number of unique words in the sentence (22).

Can you explain to your colleague why this might be a suboptimal choice of hash function, for this use case?

In [None]:
#freetext

### Solution

The problem with this hash function is that while there are theoretically 100 buckets, in order for a word to be put in bucket 99, it needs to have 99 characters in it. This is basically never true in English. Therefore, while there are 100 buckets, the *distribution* of values into buckets is not uniform. Most words will fall into buckets 1-15, with almost none falling in buckets above 20.

In [None]:
words = ["In", "computer", "science", "a", "collision", "or", "clash", "is",
         "a", "situation", "that", "occurs", "when", "two", "distinct",
         "pieces", "of", "data", "have", "the", "same", "hash", "value"]

unique_words = set(words)

def hash_bucket(string):
  return len(string) % 100

for i in unique_words:
  print("'%s' hashes to: %d" % (i, hash_bucket(i)))

Words with the same length always have the same hash bucket.