# Chapter 23: Hashing and HMAC

Cryptographic hashing is the foundation of data integrity verification, password storage,
and message authentication. This notebook explores Python's `hashlib` and `hmac` modules
for computing secure digests and verifying data authenticity.

## Topics Covered
- **hashlib module**: Available algorithms, SHA-256, SHA-512, MD5, BLAKE2b
- **Creating digests**: `hexdigest()`, `digest()`, `digest_size`
- **Incremental hashing**: `update()` for large files
- **File integrity checking** with SHA-256
- **HMAC**: Message authentication codes with the `hmac` module
- **hmac.new()**: Key, message, and hash algorithm
- **hmac.compare_digest()**: Constant-time comparison
- **Practical**: Verifying data integrity and authenticity

## hashlib Module: Available Algorithms

The `hashlib` module provides a common interface to many secure hash and message digest
algorithms. `hashlib.algorithms_guaranteed` lists algorithms available on all platforms,
while `hashlib.algorithms_available` lists everything the current interpreter supports
(which may include additional algorithms from OpenSSL).

In [None]:
import hashlib

# Algorithms guaranteed on all platforms
print("Guaranteed algorithms:")
print(sorted(hashlib.algorithms_guaranteed))

# Algorithms available on this platform (may include extras from OpenSSL)
print(f"\nAvailable algorithms ({len(hashlib.algorithms_available)} total):")
print(sorted(hashlib.algorithms_available))

# Extra algorithms only available on this platform
extras: set[str] = hashlib.algorithms_available - hashlib.algorithms_guaranteed
if extras:
    print(f"\nPlatform-specific extras: {sorted(extras)}")

## Creating Digests: SHA-256, SHA-512, MD5, BLAKE2b

Each hash algorithm produces a fixed-size output (the digest) regardless of input size.
- `hexdigest()` returns the digest as a hexadecimal string
- `digest()` returns the raw bytes
- `digest_size` gives the output length in bytes
- `block_size` gives the internal block size in bytes

**Important**: MD5 and SHA-1 are considered cryptographically broken for collision
resistance. Use SHA-256 or better for security-sensitive applications.

In [None]:
import hashlib

message: bytes = b"The quick brown fox jumps over the lazy dog"

# SHA-256: the workhorse of modern hashing (32 bytes / 256 bits)
sha256_hash = hashlib.sha256(message)
print("SHA-256:")
print(f"  hexdigest:   {sha256_hash.hexdigest()}")
print(f"  digest_size: {sha256_hash.digest_size} bytes")
print(f"  block_size:  {sha256_hash.block_size} bytes")
print(f"  name:        {sha256_hash.name}")

# SHA-512: stronger variant (64 bytes / 512 bits)
sha512_hash = hashlib.sha512(message)
print(f"\nSHA-512:")
print(f"  hexdigest:   {sha512_hash.hexdigest()}")
print(f"  digest_size: {sha512_hash.digest_size} bytes")

# MD5: fast but NOT cryptographically secure (16 bytes / 128 bits)
md5_hash = hashlib.md5(message)
print(f"\nMD5 (not secure -- legacy use only):")
print(f"  hexdigest:   {md5_hash.hexdigest()}")
print(f"  digest_size: {md5_hash.digest_size} bytes")

# BLAKE2b: modern, fast, and secure (configurable size, default 64 bytes)
blake2_hash = hashlib.blake2b(message)
print(f"\nBLAKE2b:")
print(f"  hexdigest:   {blake2_hash.hexdigest()}")
print(f"  digest_size: {blake2_hash.digest_size} bytes")

In [None]:
import hashlib

message: bytes = b"Hello, world!"

# hexdigest() vs digest(): string representation vs raw bytes
h = hashlib.sha256(message)

hex_result: str = h.hexdigest()
raw_result: bytes = h.digest()

print(f"hexdigest() type: {type(hex_result).__name__}")
print(f"hexdigest():      {hex_result}")
print(f"Length:           {len(hex_result)} characters\n")

print(f"digest() type:    {type(raw_result).__name__}")
print(f"digest():         {raw_result}")
print(f"Length:           {len(raw_result)} bytes\n")

# They represent the same value -- hex is just the human-readable form
print(f"Hex of raw bytes: {raw_result.hex()}")
print(f"Match:            {raw_result.hex() == hex_result}")

# BLAKE2b with custom digest size (e.g., 16 bytes for a short hash)
short_hash = hashlib.blake2b(message, digest_size=16)
print(f"\nBLAKE2b (16-byte): {short_hash.hexdigest()}")
print(f"Digest size:       {short_hash.digest_size} bytes")

## Incremental Hashing with update()

For large data that cannot fit in memory, you can feed chunks incrementally using
`update()`. Calling `h.update(a); h.update(b)` is equivalent to `h.update(a + b)`.
This is essential for hashing large files without loading them entirely into memory.

In [None]:
import hashlib

# Demonstrate that update() is equivalent to hashing concatenated data
chunk1: bytes = b"Hello, "
chunk2: bytes = b"world!"
full_message: bytes = chunk1 + chunk2

# Method 1: Hash all at once
h_once = hashlib.sha256(full_message)
print(f"All at once: {h_once.hexdigest()}")

# Method 2: Hash incrementally with update()
h_incremental = hashlib.sha256()
h_incremental.update(chunk1)
h_incremental.update(chunk2)
print(f"Incremental: {h_incremental.hexdigest()}")

# They produce identical digests
print(f"Match:       {h_once.hexdigest() == h_incremental.hexdigest()}")

# copy() creates a snapshot of the hash state
h_base = hashlib.sha256(b"base data")
h_copy = h_base.copy()
h_base.update(b" + extra")

print(f"\nOriginal (with extra): {h_base.hexdigest()[:32]}...")
print(f"Copy (without extra):  {h_copy.hexdigest()[:32]}...")
print(f"Different:             {h_base.hexdigest() != h_copy.hexdigest()}")

## File Integrity Checking with SHA-256

A common use of hashing is verifying file integrity. By computing a SHA-256 checksum
of a file and comparing it to a known-good value, you can detect corruption or tampering.
The `update()` approach lets you hash files of any size efficiently.

In [None]:
import hashlib
import tempfile
from pathlib import Path


def compute_file_hash(filepath: Path, algorithm: str = "sha256",
                      chunk_size: int = 8192) -> str:
    """Compute the hash of a file by reading it in chunks.

    Args:
        filepath: Path to the file to hash.
        algorithm: Hash algorithm name (default: sha256).
        chunk_size: Number of bytes to read per iteration.

    Returns:
        Hex digest string of the file contents.
    """
    h = hashlib.new(algorithm)
    with open(filepath, "rb") as f:
        while True:
            chunk: bytes = f.read(chunk_size)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()


def verify_file_integrity(filepath: Path, expected_hash: str,
                          algorithm: str = "sha256") -> bool:
    """Verify a file matches an expected hash."""
    actual_hash: str = compute_file_hash(filepath, algorithm)
    return actual_hash == expected_hash


# Create a temporary file to demonstrate
with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
    f.write("This is important data that must not be tampered with.\n")
    temp_path = Path(f.name)

# Compute the checksum
original_hash: str = compute_file_hash(temp_path)
print(f"File:     {temp_path.name}")
print(f"SHA-256:  {original_hash}")

# Verify integrity (should pass)
is_valid: bool = verify_file_integrity(temp_path, original_hash)
print(f"\nIntegrity check (unmodified): {'PASS' if is_valid else 'FAIL'}")

# Tamper with the file and re-check
with open(temp_path, "a") as f:
    f.write("sneaky modification")

is_valid_after = verify_file_integrity(temp_path, original_hash)
print(f"Integrity check (tampered):   {'PASS' if is_valid_after else 'FAIL'}")

new_hash: str = compute_file_hash(temp_path)
print(f"\nNew hash: {new_hash}")
print(f"Hashes differ: {original_hash != new_hash}")

# Cleanup
temp_path.unlink()

## HMAC: Message Authentication Codes

A plain hash verifies **integrity** (data was not corrupted), but it cannot verify
**authenticity** (data came from a trusted source). An attacker who modifies data can
simply recompute the hash.

**HMAC** (Hash-based Message Authentication Code) solves this by incorporating a secret
key into the hash. Only someone who knows the key can produce a valid HMAC. The `hmac`
module implements this using any hash algorithm from `hashlib`.

In [None]:
import hmac
import hashlib

# A shared secret key (in practice, use secrets.token_bytes())
secret_key: bytes = b"my-super-secret-key-2024"
message: bytes = b"Transfer $500 to account 12345"

# Create an HMAC with SHA-256
mac = hmac.new(key=secret_key, msg=message, digestmod=hashlib.sha256)

print(f"Message:     {message.decode()}")
print(f"HMAC-SHA256: {mac.hexdigest()}")
print(f"Digest size: {mac.digest_size} bytes")
print(f"Block size:  {mac.block_size} bytes")
print(f"Name:        {mac.name}")

# HMAC also supports incremental update()
mac2 = hmac.new(key=secret_key, digestmod=hashlib.sha256)
mac2.update(b"Transfer $500 ")
mac2.update(b"to account 12345")
print(f"\nIncremental: {mac2.hexdigest()}")
print(f"Match:       {mac.hexdigest() == mac2.hexdigest()}")

In [None]:
import hmac
import hashlib

secret_key: bytes = b"my-super-secret-key-2024"
message: bytes = b"Transfer $500 to account 12345"

# Different keys produce completely different HMACs
mac_correct = hmac.new(secret_key, message, hashlib.sha256)
mac_wrong = hmac.new(b"wrong-key", message, hashlib.sha256)

print("Same message, different keys:")
print(f"  Correct key: {mac_correct.hexdigest()[:32]}...")
print(f"  Wrong key:   {mac_wrong.hexdigest()[:32]}...")

# Different messages with the same key also differ
mac_tampered = hmac.new(secret_key, b"Transfer $5000 to account 12345", hashlib.sha256)
print(f"\nSame key, tampered message:")
print(f"  Original: {mac_correct.hexdigest()[:32]}...")
print(f"  Tampered: {mac_tampered.hexdigest()[:32]}...")

# You can also use string names for the digestmod
mac_str = hmac.new(secret_key, message, "sha256")
print(f"\nUsing string digestmod: {mac_str.hexdigest()[:32]}...")
print(f"Same result:           {mac_str.hexdigest() == mac_correct.hexdigest()}")

## hmac.compare_digest(): Constant-Time Comparison

When verifying an HMAC, you must **never** use `==` to compare digests. A regular string
comparison short-circuits on the first differing character, which leaks timing information
that an attacker can exploit to guess the correct HMAC byte by byte.

`hmac.compare_digest()` performs a **constant-time** comparison: it always takes the same
amount of time regardless of how many bytes match, preventing timing attacks.

In [None]:
import hmac
import hashlib

secret_key: bytes = b"server-secret-key"


def sign_message(key: bytes, message: bytes) -> str:
    """Create an HMAC signature for a message."""
    return hmac.new(key, message, hashlib.sha256).hexdigest()


def verify_message(key: bytes, message: bytes, signature: str) -> bool:
    """Verify a message signature using constant-time comparison.

    IMPORTANT: Always use hmac.compare_digest() instead of == to prevent
    timing attacks.
    """
    expected: str = hmac.new(key, message, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, signature)


# Simulate sender creating a signed message
original_message: bytes = b"user_id=42&action=withdraw&amount=100"
signature: str = sign_message(secret_key, original_message)
print(f"Message:   {original_message.decode()}")
print(f"Signature: {signature}")

# Receiver verifies the message
is_authentic: bool = verify_message(secret_key, original_message, signature)
print(f"\nVerification (authentic): {is_authentic}")

# Attacker tries to tamper with the message
tampered_message: bytes = b"user_id=42&action=withdraw&amount=10000"
is_tampered_valid: bool = verify_message(secret_key, tampered_message, signature)
print(f"Verification (tampered):  {is_tampered_valid}")

# Attacker tries a forged signature
forged_sig: str = "a" * 64
is_forged_valid: bool = verify_message(secret_key, original_message, forged_sig)
print(f"Verification (forged):    {is_forged_valid}")

## Practical: Verifying Data Integrity and Authenticity

This example combines file hashing and HMAC to build a simple system that signs
files with a secret key and later verifies both integrity (content unchanged) and
authenticity (signed by someone with the key).

In [None]:
import hashlib
import hmac
import json
import tempfile
from pathlib import Path


def sign_file(filepath: Path, key: bytes,
              chunk_size: int = 8192) -> dict[str, str]:
    """Create a signed manifest for a file.

    Returns a dict with the filename, SHA-256 hash, and HMAC signature.
    """
    sha256 = hashlib.sha256()
    mac = hmac.new(key, digestmod=hashlib.sha256)

    with open(filepath, "rb") as f:
        while True:
            chunk: bytes = f.read(chunk_size)
            if not chunk:
                break
            sha256.update(chunk)
            mac.update(chunk)

    return {
        "filename": filepath.name,
        "sha256": sha256.hexdigest(),
        "hmac_sha256": mac.hexdigest(),
    }


def verify_file(filepath: Path, manifest: dict[str, str],
                key: bytes, chunk_size: int = 8192) -> dict[str, bool]:
    """Verify a file against a signed manifest.

    Returns a dict indicating whether integrity and authenticity checks pass.
    """
    sha256 = hashlib.sha256()
    mac = hmac.new(key, digestmod=hashlib.sha256)

    with open(filepath, "rb") as f:
        while True:
            chunk: bytes = f.read(chunk_size)
            if not chunk:
                break
            sha256.update(chunk)
            mac.update(chunk)

    integrity_ok: bool = sha256.hexdigest() == manifest["sha256"]
    authenticity_ok: bool = hmac.compare_digest(
        mac.hexdigest(), manifest["hmac_sha256"]
    )

    return {
        "integrity": integrity_ok,
        "authenticity": authenticity_ok,
    }


# Create a test file
secret: bytes = b"release-signing-key-v1"
with tempfile.NamedTemporaryFile(mode="w", suffix=".dat", delete=False) as f:
    f.write("Critical configuration data: server=prod, port=443\n")
    test_file = Path(f.name)

# Sign the file
manifest: dict[str, str] = sign_file(test_file, secret)
print("Signed manifest:")
print(json.dumps(manifest, indent=2))

# Verify unmodified file
result = verify_file(test_file, manifest, secret)
print(f"\nVerification (original):  {result}")

# Tamper with the file
with open(test_file, "a") as f:
    f.write("server=evil-server\n")

result_tampered = verify_file(test_file, manifest, secret)
print(f"Verification (tampered):  {result_tampered}")

# Verify with wrong key (authenticity fails even if hash matches)
# Re-create the original file to test key mismatch
with open(test_file, "w") as f:
    f.write("Critical configuration data: server=prod, port=443\n")

result_wrong_key = verify_file(test_file, manifest, b"wrong-key")
print(f"Verification (wrong key): {result_wrong_key}")

# Cleanup
test_file.unlink()

## Hashing Pitfalls and Best Practices

Understanding when and how to use hashing correctly is just as important as knowing
the API. Here are common mistakes and their solutions.

In [None]:
import hashlib
import hmac

# Pitfall 1: Hashing strings directly (must encode to bytes first)
text: str = "Hello, world!"
try:
    hashlib.sha256(text)  # type: ignore[arg-type]
except TypeError as e:
    print(f"Pitfall 1 - Cannot hash str directly: {e}")

# Correct: encode the string to bytes
correct_hash: str = hashlib.sha256(text.encode("utf-8")).hexdigest()
print(f"Correct:   {correct_hash[:32]}...")

# Pitfall 2: Encoding matters! Different encodings produce different hashes
text_special: str = "caf\u00e9"  # cafe with accent
hash_utf8: str = hashlib.sha256(text_special.encode("utf-8")).hexdigest()
hash_latin1: str = hashlib.sha256(text_special.encode("latin-1")).hexdigest()
print(f"\nPitfall 2 - Encoding matters:")
print(f"  UTF-8:   {hash_utf8[:32]}...")
print(f"  Latin-1: {hash_latin1[:32]}...")
print(f"  Same?    {hash_utf8 == hash_latin1}")

# Pitfall 3: Using == instead of compare_digest for security checks
sig_a: str = "abc123"
sig_b: str = "abc123"

# BAD: vulnerable to timing attacks
insecure: bool = sig_a == sig_b

# GOOD: constant-time comparison
secure: bool = hmac.compare_digest(sig_a, sig_b)

print(f"\nPitfall 3 - Use compare_digest for security:")
print(f"  == (insecure):           {insecure}")
print(f"  compare_digest (secure): {secure}")

## Summary

### Key Takeaways

| Concept | Tool | Purpose |
|---------|------|---------|
| **Hash algorithms** | `hashlib.sha256()`, `sha512()`, `blake2b()` | Produce fixed-size digests from arbitrary data |
| **Digest output** | `hexdigest()`, `digest()` | Get hash as hex string or raw bytes |
| **Incremental hashing** | `update()` | Hash large data in chunks without loading it all |
| **File integrity** | `compute_file_hash()` pattern | Verify files have not been corrupted |
| **HMAC** | `hmac.new(key, msg, digestmod)` | Authenticate messages with a shared secret |
| **Safe comparison** | `hmac.compare_digest()` | Prevent timing attacks when verifying signatures |

### Best Practices
- Use SHA-256 or BLAKE2b for new applications; avoid MD5 and SHA-1 for security
- Always encode strings to bytes before hashing, and be consistent about the encoding
- Use `update()` to hash large files in chunks rather than loading them entirely
- Use HMAC (not plain hashing) when you need to verify both integrity and authenticity
- Always use `hmac.compare_digest()` for security-sensitive comparisons
- Never use hashing alone for passwords -- use `hashlib.pbkdf2_hmac()` or similar (see next notebook)