# Crafting my own hash function

When aiming to develop an effective hash function, several key features should be considered:

- **Deterministic:** Ensuring that the same input consistently produces the same output.
- **Universal input:** Handling a wide range of input types or data structures.
- **Fixed-sized output:** Generating a hash of consistent length regardless of input size.
- **Fast to compute:** Optimizing for efficiency to process hashes quickly.
- **Uniformly distributed:** Distributing hash values evenly across the output space.

Additionally, some desirable extras to incorporate, specially for encryption, are:

- **Randomly distributed:** Ensuring a balanced distribution of hash values.
- **Randomized seed:** Introducing randomness for enhanced security.
- **One-way function:** Preventing reverse engineering to retrieve original data.
- **Avalanche effect:** Producing significant changes in output with minimal changes in input.

Designing a hash function that fulfills all these requirements from scratch is undoubtedly challenging, and **for now** that isn't my goal.

Embarking on the journey of creating a hash function from scratch offers an invaluable learning opportunity. By the end of this endeavor, the result may be a rudimentary hash function, far from perfect, but the insights gained will be invaluable.

In [1]:
import sys

In [2]:
def hash_function(value):
    if isinstance(value, dict) or isinstance(value, set) or isinstance(value, list):
        raise Exception("Only works for immutable types.")
        
    output_size = len(str(sys.maxsize))
    basic_hash = ""
    order = 1
    result = 0
    while len(basic_hash) < output_size:
        if result == 0:
            for i, char in enumerate(repr(value).lstrip("'"), 1):
                if len(str(result)) >= output_size:
                    return int(result[:output_size])
                result += i * ord(char)
        else:
            basic_hash += str(result % order)
            order += 10
        
    return int(basic_hash[:output_size])

## Testing

Hash output example:

In [3]:
hash_function("I love Python!")

333040184114426667

Deterministic: `True`

In [4]:
hash_function(18) == hash_function(18)

True

Universal input: `True`

In [5]:
hash_function(3.14) == hash_function("3.14")

False

Fixed-sized output: `True`

In [6]:
len(str(hash_function(10_000_028))) == len(str(hash_function("Data Science is awesome!")))

True

Fast to compute: `False`

In [7]:
from time import perf_counter

In [8]:
start = perf_counter()
hash("This is a looooong string" * 1_000_000) # hash built-in function
stop = perf_counter()
print(f"Elapsed time: {stop - start}")

Elapsed time: 0.011321300000417978


In [9]:
start = perf_counter()
hash_function("This is a looooong string" * 1_000_000) # my_hash_function
stop = perf_counter()
print(f"Elapsed time: {stop - start}")

Elapsed time: 7.3015291999909095


Uniformly distributed: `False`

In [10]:
from hash_distribution import plot, distribute
from string import printable

In [11]:
plot(distribute(printable, num_containers=6)) # hash built-in function

  0 ■■■■■■■■■■■■■■■■     (16)
  1 ■■■■■■■■■■■■■■■■     (16)
  2 ■■■■■■■■■■■■■■■■■■■  (19)
  3 ■■■■■■■■■■■■■■■■■■■■ (20)
  4 ■■■■■■■■■■■■■        (13)
  5 ■■■■■■■■■■■■■■■■     (16)


In [12]:
plot(distribute(printable, num_containers=6, hash_function=hash_function)) # my_hash_function

  0 ■■■■■■■■■                     (9)
  1 ■■■■■■■■■■■■                  (12)
  2 ■■■■■■■■■■■■■                 (13)
  3 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (29)
  4 ■■■■■■■■■■■■■■■               (15)
  5 ■■■■■■■■■■■■■■■■■■■■■■        (22)
