# Hash Functions

## Lesson Overview

In order to understand the use cases and implementation of a hash table, it is important to first understand the underlying mechanism, the hash function.

Hash tables use hash functions to store complex values of any type as integers. The key of the hash table is passed through a hash function, and the resulting integer is the storage location of the corresponding value. This key must be **hashable**, in that the result of calling the hash function on a key does not change if the hash function is called multiple times.

### Definition

> A **hash function** is a [function](https://en.wikipedia.org/wiki/Function_(mathematics)) that maps an arbitrary input to an integer within a pre-specified range.

Hash functions can only output a certain number of outputs (called **buckets**). For example, this hash function accepts an arbitrary integer and has 10 hash buckets. To do so, this function uses the **modulo** operator ($\%$). For two numbers $n$ and $m$, $n \% m$ is equal to the remainder of $\frac{n}{m}$.

In [None]:
def hash_bucket(value):
  return value % 10

In [None]:
print(hash_bucket(107))

## Question 1

Which of the following statements about a hash function are true? There may be more than one correct response.

**a)** A hash function can be used only with integer inputs.

**b)** A hash function should produce only integer outputs.

**c)** A hash function's output should be randomized so that a given input can produce different outputs when run through the same hash function.

**d)** The hash function you use depends on the number of hash buckets you have available.

### Solution

The correct answers are **b)** and **d)**.

**a)** As long as you have the correct hash function, your inputs can be arbitrary. The input itself just needs to be hashable.

**c)** A hash function should be consistent. If you input a value, a hash function should always return the same output for the given input.

## Question 2

Which one of the following statements about a hash bucket is true?

**a)** Hash tables use hash buckets to store values.

**b)** A hash table uses a single hash bucket that can store multiple values.

**c)** Hash buckets can store only integers.

### Solution

The correct answer is **a)**.

Which of the following statements about a hash bucket are true?

**b)** Generally hash tables attempt to store values so that they're fairly evenly spread across multiple buckets.

**c)** A hash bucket should be able to store any arbitrary data. We use integer outputs from hash functions to identify the hash bucket that we will store our data in.

## Question 3

Which of the following statements about the modulo operator are true? There may be more than one correct response.


**a)** Any number modulo 0 raises an error.

**b)** The modulo operator can return fractional / decimal values.

**c)** Modulo raises a `NoRemainderError` if no remainder exists.

**d)** A number modulo a larger number raises an error, such as `5 % 6`.

### Solution

The correct answers are **a)** and **b)**. 

**c)** In that case, it just returns 0. We can check this by verifying that `10 % 10 == 0`.

**d)** Modulo would just return the number on the left-hand side. In this case, `5 % 6 == 5` because 5 divided by 6 is 0, with a remainder of 5.

## Question 4

["Duck, Duck, Goose"](https://en.wikipedia.org/wiki/Duck,_duck,_goose) is a common children's game in which children sit in a circle while one player (the caller) walks around the circle saying "Duck" or "Goose" to the other players as they pass by. If "Goose" is ever said, the indicated player must quickly stand up and run clockwise around the circle to tag the caller before the caller can sit down in the player's spot. This action ends the caller's turn. We can trace this out in code using the modulo operator.

Write a `duck_duck_goose` function that takes a list of players as input and some integer `num_calls_before_goose` which is the number of times that the caller will say "duck" before saying "goose". Your function should print a player's name and "Duck..." or "Goose!", depending on which should be said. You can assume `num_calls_before_goose` is an integer greater than 0.

In [None]:
def duck_duck_goose(player_list, num_calls_before_goose):
  # TODO(you): implement
  print('This function has not been implemented.')

### Hint

You'll need to use the modulo operator to iterate through a list. You can use `% len(player_list)` to avoid indexing out of bounds as you iterate. You'll also need to print the player's name and "Duck" or "Goose", like so:

```python
current_player = 0 # You'll need to determine the current player.
print(player_list[current_player], ' - Duck...')
```

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
duck_duck_goose(['Jerry', 'Tom'], 4)
# Should print:
# Jerry - Duck...
# Tom - Duck...
# Jerry - Duck...
# Tom - Duck...
# Jerry - Goose!

duck_duck_goose(['Abigail', 'Billy', 'Chih'], 4)
# Should print:
# Abigail - Duck...
# Billy - Duck...
# Chih - Duck...
# Abigail - Duck...
# Billy - Goose!

duck_duck_goose(['Abigail', 'Billy', 'Chih'], 6)
# Should print:
# Abigail - Duck...
# Billy - Duck...
# Chih - Duck...
# Abigail - Duck...
# Billy - Duck...
# Chih - Duck...
# Abigail - Goose!

### Solution

Don't forget that `num_calls_before_goose` doesn't include calling "goose", so you need to add an extra call for the final player.

In [None]:
def duck_duck_goose(player_list, num_calls_before_goose):
  for i in range(num_calls_before_goose):
    print(player_list[i % len(player_list)], '- Duck...')
  # Now we can call goose. We can use num_calls_before_goose explicitly,
  # since arrays are 0-indexed. Don't forget to use modulo!
  print(player_list[num_calls_before_goose % len(player_list)], '- Goose!')

## Question 5

Parameterize the following hash function so that the number of buckets (10 in the example) is an input to the function.

In [None]:
def hash_bucket(value):
  # TODO(you): Parameterize so that the number of buckets is a function input.
  return value % 10

### Solution

In [None]:
def hash_bucket(value, n_buckets):
  return value % n_buckets

## Question 6

For what values of `n_buckets` is this hash function undefined? Add some lines to raise a `ValueError` for these input values.

In [None]:
def hash_bucket(value, n_buckets):
  # TODO(you): Raise a ValueError if the hash function is undefined.
  return value % n_buckets

### Hint

You can use the `isinstance` function to check if an object is a given type.

In [None]:
isinstance(3, int) # Is 3 an integer?

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
print(hash_bucket(3, 2))
# Should print: 1

print(hash_bucket(3, -2))
# Should raise: ValueError

print(hash_bucket(3, 1.5))
# Should raise: ValueError

### Solution

The function should raise an error whenever `n_buckets` is not a positive integer.

In [None]:
def hash_bucket(value, n_buckets):
  if not isinstance(n_buckets, int) or n_buckets <= 0:
    raise ValueError('n_buckets must be a positive integer.')

  return value % n_buckets

## Question 7

Write a hash function with 100 buckets for a `float` input. The output buckets should be the integers from 0 to 99 inclusive. (There are several correct answers to this question.)

In [None]:
def hash_bucket(f):
  """Hash function for a float-valued input f with 100 buckets."""
  # TODO(you): Implement
  print("This function has not been implemented.")

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
def output_is_ok(x):
  # Check that a bucket is an integer in [0, 99].
  return isinstance(x, int) and 0 <= x <= 99

print(output_is_ok(hash_bucket(0.123456789)))
# Should print: True

print(output_is_ok(hash_bucket(-9876543210)))
# Should print: True

print(output_is_ok(hash_bucket(-2718281.828459045)))
# Should print: True

print(output_is_ok(hash_bucket(3.141592653589793)))
# Should print: True

### Solution

As stated in the question, there are several possible answers; this is only one of them.

In this hash function, we take the [floor](https://en.wikipedia.org/wiki/Floor_and_ceiling_functions) of the value (the highest integer that is less than or equal to the value), then return the modulo of that number by 100.

`int(f//1)` is just an easy way to calculate `floor(f)` without having to import any packages.

In [None]:
def hash_bucket(f):
  """Hash function for a float-valued input f with 100 buckets."""
  return int(f//1) % 100

## Question 8

Write a hash function that takes a string of any length and has 25 buckets. (Use the integers from 0 to 24 as the values for your hash buckets.)

In [None]:
def hash_bucket(string):
  # TODO(you): Implement
  print("This function has not been implemented.")

### Hint

If you like, you can use the `ord` function, which takes a single character and returns the [ASCII](https://en.wikipedia.org/wiki/ASCII) numeric value for that character.

In [None]:
for char in 'P@s$w0Rd':
  print(ord(char))

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
def output_is_ok(x):
  return isinstance(x, int) and 0 <= x <= 24

print(output_is_ok(hash_bucket('')))
# Should print: True

print(output_is_ok(hash_bucket('¿Cómo estás?')))
# Should print: True

print(output_is_ok(hash_bucket('how_much_wood_could_a_woodchuck_chuck')))
# Should print: True

print(output_is_ok(hash_bucket('What does Anakin think about sand?')))
# Should print: True

### Solution

This function sums the ASCII values for each character in the string, then takes then takes the modulus by 25, to ensure there are 25 buckets. (Note that this is not the only solution.)

In [None]:
def hash_bucket(string):
  # Sum the value of all the characters in the string.
  char_sum = 0
  for i in string:
    char_sum += ord(i)

  # Return the sum of the values, mod the number of buckets.
  return char_sum % 25

## Question 9

Use your hash function from the previous exercise to calculate the hash buckets for the following strings.

- "Hash"
- "functions"
- "are"
- "really..."
- "REALLY"
- "cool!"

The code from the solution to the previous exercise is copied below. If you have a different `hash_bucket` function, then your solution may be different.

In [None]:
def hash_bucket(string):
  # Sum the value of all the characters in the string.
  char_sum = 0
  for i in string:
    char_sum += ord(i)

  # Return the sum of the values, mod the number of buckets.
  return char_sum % 25

In [None]:
# TODO(you): Calculate the hash bucket for the following strings:
# - "Hash"
# - "functions"
# - "are"
# - "really..."
# - "REALLY"
# - "cool!"

### Solution

If you have a different `hash_bucket` function in the previous exercise, your solution may be different, but not necessarily incorrect.

Let's use a `for` loop to print the hash bucket for each of the words in the list.

In [None]:
# This is the hash_bucket function in the solution to a previous question.

def hash_bucket(string):
  # Sum the value of all the characters in the string.
  char_sum = 0
  for i in string:
    char_sum += ord(i)

  # Return the sum of the values, mod the number of buckets.
  return char_sum % 25

In [None]:
words = ["Hash", "functions", "are", "really...", "REALLY", "cool!"]

for word in words:
  print("'%s' is mapped to bucket %d." % (word, hash_bucket(word)))

In this case, both "are" and "cool!" are mapped to bucket 12. This is called a *hash collision*, and you'll learn more about them in a subsequent lesson.

## Question 10

Python implements its own hash function by default, called `hash`. Hashing works on a variety of different objects and variables, including instances of a class, if correctly implemented. By default, classes implement a `__hash__` method, but you can override it if you want a specific piece of information about that class to be the main hash feature. Take this `Student` class, for instance:

In [None]:
class Student:

  def __init__(self, id, name, grade):
    self.id = id
    self.name = name
    self.grade = grade

  def __hash__(self):
    return hash(self.id)

If we were to call `hash` on an instance of the `Student` class, it would now depend on their ID number. Write a `student_hash` method that uses `hash` to take an instance of the `Student` class as input and return a value between 1 and `num_buckets` as output.

In [None]:
def student_hash(student, num_buckets):
  # TODO(you): implement
  print('This method has not been implemented!')

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
students = [Student(1048, 'M. Morales', 11),
            Student(65, 'G. Stacey', 9)]

print(student_hash(students[0], 5))
# Should print: 3

print(student_hash(students[1], 5))
# Should print: 0

print(student_hash(students[0], 10))
# Should print: 8

### Solution

`hash` works essentially like the hash functions we've already written, even though we don't see specifically what the `hash` function is doing.

In [None]:
def student_hash(student, num_buckets):
  return hash(student) % num_buckets

## Question 11

How many buckets does the following hash function have?

In [None]:
def hash_function(i):
  if not isinstance(i, float) or not (0 < i < 1):
    raise ValueError("Input must be a float between 0 and 1 exclusive.")
  return int(i**i * 500)

In [None]:
#freetext

### Solution


This hash function returns integers between 0 and 499, so it has 500 buckets.

The first thing to notice is that this function is only valid for float inputs in the range $(0, 1)$. Therefore, since we know that $0 < i < 1$, we know that $0 < i^j < 1$ for any positive $j$. And since we know that $i$ itself is positive, we know that $0 < i^i < 1$.

Since $0 < i^i < 1$, it follows that $0 < 500i^i < 500$. The `int` function acts as a `floor` function, returning the highest integer less than or equal to the input. Therefore, $0 \leq \textrm{floor} (500i^i) < 500$. (Note that the lower inequality became a strict inequality in the last step.)

## Question 12

You wrote the following hash function to hash a float-valued number to one of 100 buckets. (Remember that `int(x//1)` is equivalent to `floor(x)`.)

In [None]:
def hash_bucket(f):
  """Hash function for a float-valued input f with 100 buckets."""
  return int(f//1) % 100

The function worked fine when the input `f` was any random float. But what if the input is restricted to the range 0 to 1 exclusive? If you use this same function on the new inputs, all values are hashed to 0. This isn't a very effective hash function, since it is supposed to have 100 buckets.

Modify this function to raise a `ValueError` if the input `f` is not within the required range $(0, 1)$, and to be evenly distributed across all of the possible bucket values.

In [None]:
def hash_bucket(f):
  """Hash function for a float-valued input f with 100 buckets."""
  # TODO(you): Raise a ValueError if f is not in (0, 1).
  # TODO(you): Modify the return statement so that the output is more even.
  return int(f//1) % 100

### Solution

Since the function should still have 100 buckets, it makes sense to keep the `% 100` operation. Instead of applying the modulus to `f`, we can apply it to the first two decimal places of `f`. These digits can be accessed by multiplying `f` by 100.

In [None]:
def hash_bucket(f):
  if not (0 < f < 1):
    raise ValueError("Input must be between 0 and 1 inclusive.")

  return int((f*100)//1)

## Question 13

A few days ago, a colleague came to you for some advice on how to store data. You suggested a hash function. Now, they are having some problems getting their hash function to work. Once the hash is used to store data, your colleague cannot retrieve the data. Can you identify why the hash function is not working as intended? What possible solutions might you suggest?

In [None]:
import random

def my_hash(input):
  """Hash function with 100 buckets."""
  # Return a random integer between 0 and 99.
  return random.randrange(100)

### Solution

The main problem with this hash function is that it is *independent of the input*. You might notice that `input` does not appear in the function body. As a result, this function is useless for data storage; it is just a random number generator.

Your suggestion for a better hash function would probably depend on what kind of data your colleague is expecting. If it is equally likely to be any float in the computer's range, it may make sense just to use `input % 100`. However, if all the inputs are within a given range (e.g. within 0 and 10), you would need to ensure a reasonable distribution between buckets. If the input is a string, then a different approach would be required.

## Question 14

Your colleague is working with word data, and has written the following hash function. The input to the function is always a single word.

In [None]:
def my_hash(word):
  """Hash function with a string input and 250 buckets."""
  return len(word) % 250

print(my_hash('tundra'))
print(my_hash('taiga'))
print(my_hash('savanna'))
print(my_hash('grassland'))

In [None]:
# TODO(you): Write a better hash function for these inputs.

They are finding that even though this function should have 250 buckets, the hash buckets for their input words are usually below 10, and always below 20. Why is this? Can you think of a way to fix this?

### Solution

The input to `my_hash` is always a single word. Therefore, when `my_hash` calculates `len(word)` for a single word, it is just the number of letters in the word, which is below 20 for most words.

We could consider squaring or cubing `len(s)` before applying `% 250`. However, this will result in all 3-letter words having the same bucket, all 4-letter words having the same bucket, and so on.

A better approach would be to convert characters to numbers and then sum the values of each character. The [ASCII](https://en.wikipedia.org/wiki/ASCII) encoding is the most ubiquitous method of converting a character to an integer value, and Python's `ord` function implements this.

In [None]:
def my_hash(word):
  """Hash function with a string input and 250 buckets."""
  # Sum the value of all the characters in the string.
  char_sum = 0
  for i in word:
    char_sum += ord(i)

  # Return the sum of the values, mod the number of buckets.
  return char_sum % 250

If this approach doesn't give us enough buckets, we could even add the squared ASCII values `ord(i)**2` of each letter in the string.