### UC Berkeley, MICS, W202-Cryptography
### Week 03 Breakout 2

### Collisions in Hash Functions and using the Birthday Paradox to help understand them



In our lectures, we learned that Hash Function resistance is categorized in three ways:

* Preimage resistance

* Second preimage resistance

* Collision resistance

In this breakout, we will see an example of each of the above, starting with collision resistance, then second preimage resistance, and then preimage resisance.

Next, we will see the way in which we measure the collision resistance of a hash function:

The way we generally measure collision resistance is based on the Birthday Paradox, which is generally stated: How many persons do we need before the probability of at least one common birthday is > 50%.  (This ignores leap years and assumes 365 days per year.)

The answer to the birthday paradox is 23, which suprises most people.  Most people guess 1/365, which is wrong.  Below we will see the probabilities of the birthday paradox grow as we increase the number of persons. 

The birthday paradox forms the way we measure the collision resistance of a hash function.  To translate the birthday paradox into a hash function, we assume the hash value range is 365 and each person's birthday is the input value to the hash. 

Below we will see for various hash sizes the number of input values needed before we reach a 50% chance of a collision.

Attacks based on collisions are often called "birthday attacks" because they use the principles of the birthday paradox.  


In [1]:
import hashlib
import binascii
import subprocess
from sage.all import *

### Example of an Attack on Collision Resistance

The example belows shows two random hexidecimal strings which are not identical, yet they both yield the same MD5 hash value, which proves a collision.

In [2]:
hex_string_1 = 'd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70'

byte_buffer_1 = binascii.a2b_hex(hex_string_1)

hex_string_2 = 'd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70'

byte_buffer_2 = binascii.a2b_hex(hex_string_2)

In [3]:
# the hex strings above are not identical, but they only differ in 6 hex digits
# the following code will show you the 6 positions in which they differ
# positions are zero based

for (index, c1) in enumerate(hex_string_1):
    if c1 != hex_string_2[index]:
        print ("position:", index, "hex_string_1:", c1, "hex_string_2:", hex_string_2[index])

position: 38 hex_string_1: 8 hex_string_2: 0
position: 90 hex_string_1: 7 hex_string_2: f
position: 118 hex_string_1: f hex_string_2: 7
position: 166 hex_string_1: b hex_string_2: 3
position: 218 hex_string_1: a hex_string_2: 2
position: 246 hex_string_1: 2 hex_string_2: a


In [4]:
# the the md5 hash values are identical for the two different byte strings
# this proves a collision has occurred

print ("string 1 MD5:", hashlib.md5(byte_buffer_1).hexdigest())
print ("string 2 MD5:", hashlib.md5(byte_buffer_2).hexdigest())

string 1 MD5: 79054025255fb1a26e4bc422aef54eb4
string 2 MD5: 79054025255fb1a26e4bc422aef54eb4


### Example of an Attack on Second Preimage Resistance

In this example, we have two executables, both with the same MD5 hash value.  However they are two totally different programs with totally different outputs.  This proves a second preimage attack.  This shows that someone can substitute one executable for another executable and it will pass a verification of an MD5 hash check.

You will need to uploade the executables hello and erase.  Make sure you don't accidentially view them in the editor, as that makes slight changes to them and invalidates the MD5 hash value. Make sure you don't in anyway alter them that invalidates the MD5 hash value.

Note: if you are running this jupyter notebook on Windows, you will need to use different versions of the executables and make appropriate changes to the cells below.

**Note: the programs hello & erase are Linux executables.  You will have to have shell access to a Linux shell where the Jupyter Notebook is running for this to work.  If you can't get it to work, it's not a big deal, you can follow along in class to understand it.  You can still run the code cells below this section, as they will still work.**

In [5]:
# change the mode of our files to be executable

print (subprocess.Popen("chmod 744 hello erase", shell=True, stdout=subprocess.PIPE).stdout.read().decode())




In [6]:
# see the listing of our two executables

print (subprocess.Popen("ls -l hello erase", shell=True, stdout=subprocess.PIPE).stdout.read().decode())

-rwxr--r-- 1 user user 4072 Aug 10 12:22 erase
-rwxr--r-- 1 user user 4072 Aug 10 12:22 hello



In [7]:
# see that the MD5 hash value is the same for both executables

f1 = open('hello', "rb")
r1 = f1.read()
f1.close()
print ("hello program MD5:", hashlib.md5(r1).hexdigest())

f2 = open("erase", "rb")
r2 = f2.read()
f2.close()
print ("erase program MD5:", hashlib.md5(r2).hexdigest())


hello program MD5: da5c61e1edc0f18337e46418e48c1290
erase program MD5: da5c61e1edc0f18337e46418e48c1290


In [8]:
# run the hello program and see its output

print (subprocess.Popen("./hello", shell=True, stdout=subprocess.PIPE).stdout.read().decode())

Hello, world!

(press enter to quit)


In [9]:
# run the erase program and see its output

# it does NOT really erase, just demonstrating that a malicous program can be substituted for a good program
# and still have a valid MD5 hash

print (subprocess.Popen("./erase", shell=True, stdout=subprocess.PIPE).stdout.read().decode())

This program is evil!!!
Erasing hard drive...1Gb...2Gb... just kidding!
Nothing was erased.

(press enter to quit)


### Example of an Attack on Preimage Resistance

In this example, let's suppose that a big box store has an in house credit card that requires a 5 digit pin.  The computer system for point of sale does not store the actual pin.  It generates an SHA256 hash of the pin and stores the hash value.  Suppose some hackers attack the big box store and recover the  list of SHA256 hashes.  Since they know pins must be 5 digits, they can easily recover the pin for all accounts by simply looping through all 100,000 possibilites and finding a hash that matches, which proves they have recovered the preimage of the hash.

Even though the hash is collision resistant, if the input lengths are small, it's possible to check all possibilities.

In [10]:
# suppose a customer sets their pin to 74858
# the big box store's computer generates an SHA256 hash and stores the hash value

pin = 74858

pin_sha256 = hashlib.sha256(str(pin).encode())

print (pin_sha256.hexdigest())

15b75c24361d4ab459be3a5893e2323d1b018182ae8aa16f0e85f9c634e2388e


In [11]:
# we can simply loop from 0 to 99,999 to cover all possible values for a 5 digit pin
# if the SHA256 hash value matches the SHA256 hash value
# we have successfully recovered the customer's pin value of 74858

i = 0

while i <= 99999:
    if hashlib.sha256(str(i).encode()).hexdigest() == pin_sha256.hexdigest():
        print ("preimage is recovered as:", i)
        break
    i += 1

preimage is recovered as: 74858


### Birthday Paradox: How many persons do we need before the probability of at least one common birthday is > 50%. (This ignores leap years and assumes 365 days per year.)

In [12]:
def my_birthday_paradox(stop_value):
    "loop from 1 to stop value, for the given number calculate the probability that two persons will have the same birthday"
    
    for i in range(1, stop_value):
        
        p = (1 - float(factorial(365) / ( (365 ** i) * factorial(365 - i)))) * 100
        
        print ("persons:", i, "     probability:","{:.5g}".format(p), "%")

In [13]:
# watch the probability grow as the number of persons grows

my_birthday_paradox(91)

persons: 1      probability: 0 %
persons: 2      probability: 0.27397 %
persons: 3      probability: 0.82042 %
persons: 4      probability: 1.6356 %
persons: 5      probability: 2.7136 %
persons: 6      probability: 4.0462 %
persons: 7      probability: 5.6236 %
persons: 8      probability: 7.4335 %
persons: 9      probability: 9.4624 %
persons: 10      probability: 11.695 %
persons: 11      probability: 14.114 %
persons: 12      probability: 16.702 %
persons: 13      probability: 19.441 %
persons: 14      probability: 22.31 %
persons: 15      probability: 25.29 %
persons: 16      probability: 28.36 %
persons: 17      probability: 31.501 %
persons: 18      probability: 34.691 %
persons: 19      probability: 37.912 %
persons: 20      probability: 41.144 %
persons: 21      probability: 44.369 %
persons: 22      probability: 47.57 %
persons: 23      probability: 50.73 %
persons: 24      probability: 53.834 %
persons: 25      probability: 56.87 %
persons: 26      probability: 59.824 %
pers

### Using the same logic of the birthday paradox, for all of the common hash sizes, calculate the number of input values needed before we reach a 50% chance of a collision.

In [14]:
def my_hash_collision_probability(hash_bits):
    "for the given number of hash bits, calculate the hash value range, and the number of input points to have a 50% probability of a collision "
    
    print ("\nbits:", hash_bits)
    
    print ("\nhash value range:", "{:,}".format(2 ** hash_bits))
    print ("\n    digits:", (2 ** hash_bits).ndigits())
    
    n = float(2 ** hash_bits)
    
    # find k where probability > 50%
    
    k = float(sqrt( ( log(2) + (2 * n) ) ))
    
    print ("\nApproximate number of input points to have a 50% probability of a collision:", "{:,}".format(floor(k)))
    print ("\n    digits:", (floor(k)).ndigits())
    

In [15]:
my_hash_collision_probability(16)


bits: 16

hash value range: 65,536

    digits: 5

Approximate number of input points to have a 50% probability of a collision: 362



    digits: 3


In [16]:
my_hash_collision_probability(32)


bits: 32

hash value range: 4,294,967,296

    digits: 10

Approximate number of input points to have a 50% probability of a collision: 92,681

    digits: 5


In [17]:
my_hash_collision_probability(64)


bits: 64

hash value range: 18,446,744,073,709,551,616

    digits: 20

Approximate number of input points to have a 50% probability of a collision: 6,074,000,999

    digits: 10


In [18]:
# MD5 is 128 bits

my_hash_collision_probability(128)


bits: 128

hash value range: 340,282,366,920,938,463,463,374,607,431,768,211,456

    digits: 39

Approximate number of input points to have a 50% probability of a collision: 26,087,635,650,665,566,208

    digits: 20


In [19]:
# SHA1 is 160 bits

my_hash_collision_probability(160)


bits: 160

hash value range: 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976

    digits: 49

Approximate number of input points to have a 50% probability of a collision: 1,709,679,290,002,018,547,007,488

    digits: 25


In [20]:
# SHA224 is 224 bits

my_hash_collision_probability(224)


bits: 224

hash value range: 26,959,946,667,150,639,794,667,015,087,019,630,673,637,144,422,540,572,481,103,610,249,216

    digits: 68

Approximate number of input points to have a 50% probability of a collision: 7,343,016,637,207,169,433,382,599,627,112,448

    digits: 34


In [21]:
# SHA256 is 256 bits

my_hash_collision_probability(256)


bits: 256

hash value range: 115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936

    digits: 78

Approximate number of input points to have a 50% probability of a collision: 481,231,938,336,009,055,986,162,049,162,441,392,128

    digits: 39


In [22]:
# SHA384 is 384 bits

my_hash_collision_probability(384)


bits: 384

hash value range: 39,402,006,196,394,479,212,279,040,100,143,613,805,079,739,270,465,446,667,948,293,404,245,721,771,497,210,611,414,266,254,884,915,640,806,627,990,306,816

    digits: 116

Approximate number of input points to have a 50% probability of a collision: 8,877,162,406,579,535,435,504,187,527,070,416,120,068,134,593,633,712,078,848

    digits: 58


In [23]:
# SHA512 is 512 bits

my_hash_collision_probability(512)


bits: 512

hash value range: 13,407,807,929,942,597,099,574,024,998,205,846,127,479,365,820,592,393,377,723,561,443,721,764,030,073,546,976,801,874,298,166,903,427,690,031,858,186,486,050,853,753,882,811,946,569,946,433,649,006,084,096

    digits: 155

Approximate number of input points to have a 50% probability of a collision: 163,754,743,014,928,266,429,063,303,432,457,985,651,110,375,675,065,736,754,496,156,385,529,317,818,368

    digits: 78
