# Hashing


https://www.sentinelone.com/cybersecurity-101/hashing/

https://www.geeksforgeeks.org/introduction-to-hashing-data-structure-and-algorithm-tutorials/

https://crypto.stanford.edu/~mironov/papers/hash_survey.pdf

A hash function takes an arbitrary file as input and produces a *unique*, fixed length code that identifies that file.

A *perfect hash function* would be one-to-one, so two different files would not have the same hash code.

Given that there are infinitely many possible input files, and a fixed length hash code, there *cannot be a perfect hash function*.  However, hash functions are designed so that the chance of two different files having the same hash code is not likely.

## The hashlib Package

The python package hashlib implements quite a few of the best known hash functions.

The attribute hashlib.algorithms_guaranteed is a set of algorithm names that are guaranteed to be available in the package.

Additional algorithms may be present, depending on the python version. The set of all available algorithms is in the attribute hashlib.algorithms_available.

In [2]:
import hashlib

hashlib.algorithms_guaranteed

{'blake2b',
 'blake2s',
 'md5',
 'sha1',
 'sha224',
 'sha256',
 'sha384',
 'sha3_224',
 'sha3_256',
 'sha3_384',
 'sha3_512',
 'sha512',
 'shake_128',
 'shake_256'}

In [3]:
hashlib.algorithms_available

{'blake2b',
 'blake2s',
 'md4',
 'md5',
 'md5-sha1',
 'mdc2',
 'ripemd160',
 'sha1',
 'sha224',
 'sha256',
 'sha384',
 'sha3_224',
 'sha3_256',
 'sha3_384',
 'sha3_512',
 'sha512',
 'sha512_224',
 'sha512_256',
 'shake_128',
 'shake_256',
 'sm3',
 'whirlpool'}

We will see examples of hashing data files, but we start by using a short byte array object:

In [4]:
input = b'gpu envy'

The hashlib package has a new() constructor that can take the name of an algorithm and create an object that will produce that type of hash.

Example:

In [5]:
h = hashlib.new('md5')

h.update(b'gpu envy')

h.hexdigest()

'3398e366e9d659d6cf5398101012f972'

There are also named constructors for the algorithms:

In [8]:
h = hashlib.md5(b'gpu envy')
h.hexdigest()

'3398e366e9d659d6cf5398101012f972'

In this example I will create the hash codes of a few of the guaranteed algorithms for the input b'gpu envy':

In [10]:
for a in ['md5','sha256','blake2b']:
    h = hashlib.new(a)
    h.update(b'gpu envy')
    print(a,'\n',h.hexdigest(),'\n\n')

md5 
 3398e366e9d659d6cf5398101012f972 


sha256 
 54d880294627db997dc0af19671f472bcf2d34d955a4681e9331d8b2c2fdfdaf 


blake2b 
 a04bf3e12ad5529c424485d414ab16f43faab171fe99e1b5bfc6372ceca42a9b6b5daeea5b71c370021c08c3e10fd521c56dd497c9e6d7ca48fdf69787f92b36 




## Civitai Hashes

Civitai has thousands of models, and it uses hash codes to identify them.

Here is a screen shot of the civitai info on one model:

<img src='../data/realistic.png'>

Notice in the lower right that it shows the beginning of the SHA256 hash of the realistic vision model.

For the purpose of this demo notebook, I have put that model the directory /tmp

The file is 4.0G, so I put it in /tmp where it will eventually disappear.

In [13]:
!ls -lh /tmp/real*

-rw-r--r-- 1 brownj brownj 4.0G Jan 31 07:26 /tmp/realisticVisionV60B1_v60B1VAE.safetensors


In the examples we have seen so far, I created the hash object, then called it update() method one time with a short bytes object.

To hash a file I will open the file for reading bytes, then read chunks of the file, calling update on each chunk. When I have read the complete file I call hexdigest().

This model file is large, so the cell below takes over 10 seconds to complete.

In [15]:
BUF_SIZE = 65536

sha256 = hashlib.sha256()

with open('/tmp/realisticVisionV60B1_v60B1VAE.safetensors', 'rb') as f:
    
    while True:
        data = f.read(BUF_SIZE)

        if not data:
            break

        sha256.update(data)
        
sha256.hexdigest()

'e5f3cbc5f7669457d3bec1fd492420995fb9a79e735dce438b81af61fd5d77f0'

This hash code is different from the one in the screen shot because that is a pruned version.

hashlib also has a file_digest() method that handles reading the chunks.

In [16]:
with open('/tmp/realisticVisionV60B1_v60B1VAE.safetensors', 'rb') as f:
    digest = hashlib.file_digest(f,'sha256')
    
digest.hexdigest()

'e5f3cbc5f7669457d3bec1fd492420995fb9a79e735dce438b81af61fd5d77f0'