# Cryptographic hash functions
## Table of Contents

1. [Introduction](#introduction)  
2. [Introduction to Cryptographic Hashing](#introduction-to-cryptographic-hashing)  
3. [Basic Rationale and Design of Hash Functions](#basic-rationale-and-design-of-hash-functions)  
4. [Example of Cryptographic Hashing in Python with SHA-256](#example-of-cryptographic-hashing-in-python-with-sha-256)  
5. [Applications of Cryptographic Hashing](#applications-of-cryptographic-hashing)  
6. [Security of Cryptographic Hashing](#security-of-cryptographic-hashing)  
   - [Pre-image Resistance](#pre-image-resistance)  
   - [Collision Resistance](#collision-resistance)  
   - [Hash Length](#hash-length)  
7. [Commonly Used Cryptographic Hash Functions](#commonly-used-cryptographic-hash-functions)  
   - [Quantum Risks to Traditional Cryptographic Hashing](#quantum-risks-to-traditional-cryptographic-hashing)  
9. [Summary](#summary)


## Introduction

In this lesson we will look at cryptographic hash functions which see extensive use in quick validation & authentication.

By the end of the lesson we will have covered:

What cryptographic hash functions are
Python code examples demonstrating the use of hash functions
A look at applications of cryptographic hashing
The security of cryptographic hashing
Threats to these algorithms from both classical and quantum computers

## Introduction to Cryptographic Hashing


Hash functions represent a valuable construct in cryptography as they help enable validation with confidentiality. As such, hash functions form an important component of mechanisms for data authentication and integrity, such as hash-based message authentication codes (HMAC) and digital signatures. This article will discuss the basic ideas and security considerations underpinning cryptographic hash functions and outline potential vulnerabilities from the advent of quantum computing.

## Basic Rationale and Design of Hash Functions



There are many situations where authentication and integrity verification need to be performed cheaply and without revealing private information to the party performing the validation.

For example, when downloading software from a remote server, an efficient mechanism is needed to verify that the software actually downloaded has not been modified since being created by the original author of the software. Similarly, when authenticating users of web applications, it would be desirable to use a mechanism that does not involve physically storing or transmitting the actual passwords, which can potentially compromise their confidentiality.

*Cryptographic hash functions* (CHFs) address such needs efficiently and securely.

Fundamentally, a cryptographic hash function takes an input (or *message*) of arbitrary length and returns a fixed-size string of n-bits as output. The output of a CHF is also referred to as a *digest*.

---

### A useful CHF should satisfy several key properties:

1. **Uniformity**: The digests produced by a CHF should be distributed uniformly and should look random. The aim is to ensure the output leaks no information about the input.

2. **Determinism**: For a given input, a CHF must always produce the same *digest*, i.e., it must be deterministic.

3. **Irreversibility**: A CHF is a *one-way function* in that, given a digest, it should be infeasible to invert the hashing and obtain the input.

4. **Approximate injectivity**: While CHFs are many-to-one functions, they should appear to look like one-to-one functions. This is achieved by combining an enormous output space size with the avalanche effect whereby tiny changes in the input lead to wildly divergent digests. This characteristic is known as approximate injectivity.

---

Given this, it's possible to validate a piece of data against the original instance by comparing a digest of the data to a digest of the original.

- If the two digests match, we can be confident with high probability that the data is identical to the original.
- If the digests differ, we can be sure that the data was tampered with or is otherwise inauthentic.

Since the CHF digests themselves do not reveal the actual contents of the data or the original, they enable validation while preserving privacy.

---

### ▼ Mathematical description

A hash function $\mathcal{H}$ can be defined as:

$$
\mathcal{H} : \Sigma^* \rightarrow \{0, 1\}^n
$$

where $\Sigma^*$ is the set of all possible strings which we may consider to be binary strings of any length.

The fact that the size of the input domain $\Sigma^*$ of $\mathcal{H}$ is *unbounded* while that of the co-domain $\{0, 1\}^n$ is *bounded* means that $\mathcal{H}$ is necessarily *many-to-one*, mapping different inputs to an n-bit string.

The properties of uniformity and determinism may be nicely encapsulated within the *random oracle model* of cryptographic hashing.


## Example of Cryptographic Hashing in Python with SHA-256

This simple example demonstrates cryptographic hashing using the popular SHA-256 algorithm as provided by the `cryptography` Python library.

First we show how a minor difference in plain texts leads to a very large difference in the hash digests.


In [2]:
# Begin by importing some necessary modules
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes

#Helper function that returns the number of characters different in two strings
def char_diff(str1, str2):
    return sum ( str1[i] != str2[i] for i in range(len(str1)) )

# Messages to be hashed
message_1 = b"Buy 10000 shares of WXYZ stock now!"
message_2 = b"Buy 10000 shares of VXYZ stock now!"

print(f"The two messages differ by { char_diff(message_1, message_2)} characters")

The two messages differ by 1 characters


The two messages differ in exactly one character.

Next, we instantiate `hash` objects to enable the hashing process, which involves calls to two methods: `update` and `finalize` .

We see that due to the avalanche effect in the SHA-256 CHF, a one-character difference in input messages yields two very different digests.

In [3]:
# Create new SHA-256 hash objects, one for each message
chf_1 = hashes.Hash(hashes.SHA256(), backend=default_backend())
chf_2 = hashes.Hash(hashes.SHA256(), backend=default_backend())

# Update each hash object with the bytes of the corresponding message
chf_1.update(message_1)
chf_2.update(message_2)

# Finalize the hash process and obtain the digests
digest_1 = chf_1.finalize()
digest_2 = chf_2.finalize()

#Convert the resulting hash to hexadecimal strings for convenient printing
digest_1_str = digest_1.hex()
digest_2_str = digest_2.hex()

#Print out the digests as strings 
print(f"digest-1: {digest_1_str}")
print(f"digest-2: {digest_2_str}")

print(f"The two digests differ by { char_diff(digest_1_str, digest_2_str)} characters")

digest-1: 6e0e6261b7131bd80ffdb2a4d42f9d042636350e45e184b92fcbcc9646eaf1e7
digest-2: 6b0abb368c3a1730f935b68105e3f3ae7fd43d7e786d3ed3503dbb45c74ada46
The two digests differ by 57 characters


## Applications of Cryptographic Hashing

The unique properties of CHFs make them suitable for a wide array of applications:

- **Data integrity checks**: Hash functions can be used to create a checksum for a set of data. Any modifications to the data, intentional or not, will result in a different checksum, alerting systems or users to the change. The checksum is also typically much more compact than the original data, which makes checksum comparisons very fast.

![data-integrity.png](attachment:data-integrity.png)

*Figure 1. Secure hashing for data integrity*

- **Digital signatures**: Cryptographic hashes are essential to the functioning of digital signatures as they involve comparing cryptographically hashed messages to establish authenticity and integrity while preserving privacy.

![digital-signature.png](attachment:digital-signature.png)

*Figure 2. Digital signatures*

- Blockchain and cryptocurrencies: Cryptocurrencies like Bitcoin rely heavily on CHFs, particularly in creating transaction integrity and enabling consensus mechanisms like proof of work.

## Security of Cryptographic Hashing

The security of a CHF is typically assessed based on resistance to two types of attacks: [pre-image](#pre-image-resistance) and [collision](#collision-resistance).



### Pre-image Resistance

*Pre-image resistance* means that given a digest, it should be infeasible to find the input.

This is related to the one-way property of CHFs.

A good CHF is designed in such a way that a party wishing to conduct a pre-image attack has no better option than a brute-force approach, which has time complexity $2^n$.

#### ▼ *Mathematical details*
Given a CHF $\mathcal{H}$ and digest $g$, it should be computationally infeasible to find any input $m$ from the pre-image of $g$ whereby  
$$
\mathcal{H}(m) = g.
$$


### Collision Resistance

*Collision resistance* means that it is difficult to find two different inputs that hash to the same digest.

A *cryptographic hash collision* occurs when two inputs hash to the same digest. While collisions inevitably exist given the many-to-one nature of CHFs, a good CHF nevertheless makes it infeasible to locate one at will.

Collision resistance is crucial for applications like digital signatures and certificates, where it could be disastrous if a malicious party were able to create a forgery that hashes to the same value.

#### ▼ *Mathematical details of hash collisions*
$$
m_1, m_2 \text{ can be found such that } \mathcal{H}(m_1) = \mathcal{H}(m_2).
$$


### Hash Length

Collision resistance is a harder requirement than pre-image resistance and necessitates output lengths twice as long as that needed for pre-image resistance. This is because a brute force attack known as the *birthday attack*, which can be used to identify hash collisions, has time complexity $2^{n/2}$.

In the absence of cryptanalytic weaknesses, the security of a hash function is therefore primarily influenced by its hash length.  
The longer the hash, the more secure it is, as it becomes harder to mount brute force attacks.


## Commonly Used Cryptographic Hash Functions

The following table lists some commonly used cryptographic hash functions, along with their hash lengths and primary application domains:

| **Hash Function** | **Output Length (bits)** | **Common Applications** |
|------------------|--------------------------|--------------------------|
| MD5              | 128                      | File integrity checking, older systems, non-crypto uses |
| SHA-1            | 160                      | Legacy systems, Git for version control |
| SHA-256          | 256                      | Cryptocurrency (Bitcoin), digital signatures, certificates |
| SHA-3            | Variable (up to 512)     | Various cryptographic applications, successor to SHA-2 |
| Blake2           | Variable (up to 512)     | Cryptography, replacing MD5/SHA-1 in some systems |
| Blake3           | Variable (up to 256)     | Cryptography, file hashing, data integrity |

- **MD5** and **SHA-1**, while still seen in less sensitive applications, are deprecated due to weak collision resistance.  
- **SHA-256**, part of the SHA-2 family, is widely used and secure for most modern applications.  
- **SHA-3**, selected by NIST in 2015, has internal differences from SHA-2 and better resistance to certain attack types.  
- **Blake2** and **Blake3** are faster than MD5, SHA-1, SHA-2, and SHA-3, and are being adopted in speed-critical applications.

## Quantum Risks to Traditional Cryptographic Hashing


The primary quantum threat to cryptographic hashing is posed by **brute-force attacks**.

Given a digest, an attacker tries inputs at random until a match is found. With $n$ bits in the input, there are $2^n$ possibilities, so they need to try out about $2^{n-1}$ inputs for a >50% success chance.

---

### Grover’s algorithm

In an unstructured search, **Grover’s algorithm** can speed up brute-force attacks via **quantum amplitude amplification**, reducing the time complexity to $2^{n/2}$.

- A 256-bit CHF secure against classical pre-image attacks would be equivalent to **128-bit security** against Grover-assisted attacks.
- **Birthday attacks** remain effective on quantum computers, meaning **collision resistance still requires double the length**.

> For instance, performing Grover's search on SHA-256 would require $2^{128}$ operations — still infeasible in the foreseeable future.

### BHT algorithm

A quantum algorithm that combines aspects of the birthday attack with Grover search was proposed in 1997 by  
**Bassard, Høyer, and Tapp** (BHT) and affords a theoretical scaling of $O(2^{n/3})$ for finding hash collisions.  
However, this improved scaling assumes the existence of **quantum random access memory** ([QRAM](https://en.wikipedia.org/wiki/Quantum_RAM)) technology, which does not currently exist.

Without QRAM, the realizable scaling is $\tilde{O}(2^{2n/5})$ and for hash lengths currently in use, this marginal improvement in collision-finding capability relative to the birthday attack does not represent a critical threat.



## Summary

Cryptographic hash functions (CHFs) are essential for ensuring data integrity and privacy in digital information systems and find widespread application in many contexts.

- The security requirements of CHFs are mainly based on **resistance to pre-image and collision attacks**.
- For strong CHFs, **hash length** is a good proxy for the **security level**.

While **quantum computers** using Grover and BHT algorithms could impact CHF security:
- The **long hash lengths** already in use (e.g., SHA-256, SHA-3) make pre-image and collision attacks infeasible even for quantum adversaries.
- Unless a **new cryptanalytic technique** is discovered, modern CHFs remain **secure**.

**CHFs** continue to serve as a **core building block** for quantum-resistant cryptographic systems, ensuring data security even in the face of future quantum advancements.