# Part 4, Lab 4: NVFP4 and Microscaling Formats

**Time:** ~30 minutes

NVIDIA's Blackwell architecture introduces NVFP4—a 4-bit floating point format with per-block FP8 scaling. This lab explores microscaling formats and their implementation.

## Learning Objectives

1. Understand NVFP4 (E2M1) format
2. Implement block scaling with FP8 scale factors
3. Compare NVFP4 vs INT4
4. Understand OCP Microscaling (MX) formats

In [None]:
import numpy as np
import torch

np.random.seed(42)

---
## 1. FP4 E2M1 Format

NVFP4 uses E2M1: 1 sign bit, 2 exponent bits, 1 mantissa bit.

This gives only 16 distinct values (including ±0), but with block scaling, it works well for neural network weights.

In [None]:
# E2M1 representable values (without scaling)
# Exponent bias = 1, so exp range is [-1, 2] for values [0, 1, 2, 3]
# Value = (1 + m/2) * 2^(e-1) for normalized, or (m/2) * 2^(-1) for subnormal

def get_fp4_e2m1_values():
    """Generate all representable FP4 E2M1 values."""
    values = []
    
    for sign in [0, 1]:
        for exp in range(4):  # 2 bits: 0-3
            for mantissa in range(2):  # 1 bit: 0-1
                if exp == 0:  # Subnormal
                    value = (mantissa / 2) * (2 ** -1)
                else:  # Normalized
                    value = (1 + mantissa / 2) * (2 ** (exp - 1))
                
                if sign:
                    value = -value
                values.append(value)
    
    return sorted(set(values))

fp4_values = get_fp4_e2m1_values()
print("FP4 E2M1 representable values:")
print(fp4_values)
print(f"\nRange: [{min(fp4_values)}, {max(fp4_values)}]")
print(f"Unique values: {len(fp4_values)}")

---
## 2. Block Scaling for NVFP4

NVFP4 uses 16-element blocks with an FP8 scale factor per block. This extends the effective range dramatically.

In [None]:
def quantize_nvfp4(x, block_size=16):
    """
    Quantize to NVFP4 with block scaling.
    
    Each block of 16 values shares one FP8 scale factor.
    Values are quantized to the nearest FP4 E2M1 representable value.
    """
    original_shape = x.shape
    x_flat = x.flatten()
    
    # Pad to multiple of block_size
    pad_size = (block_size - len(x_flat) % block_size) % block_size
    if pad_size > 0:
        x_flat = np.pad(x_flat, (0, pad_size), mode='constant')
    
    # Reshape into blocks
    x_blocks = x_flat.reshape(-1, block_size)
    
    # Compute scale per block (map to FP4 range of ±6)
    abs_max = np.abs(x_blocks).max(axis=1, keepdims=True)
    scales = abs_max / 6.0  # FP4 E2M1 max positive value
    scales = np.maximum(scales, 1e-8)
    
    # Normalize values
    x_normalized = x_blocks / scales
    
    # Quantize to nearest FP4 value
    fp4_values = np.array(get_fp4_e2m1_values())
    x_quantized = np.zeros_like(x_normalized)
    
    for i in range(x_normalized.shape[0]):
        for j in range(x_normalized.shape[1]):
            val = x_normalized[i, j]
            # Find nearest FP4 value
            idx = np.abs(fp4_values - val).argmin()
            x_quantized[i, j] = fp4_values[idx]
    
    return x_quantized, scales.squeeze(), original_shape, pad_size

def dequantize_nvfp4(x_quantized, scales, original_shape, pad_size, block_size=16):
    """Dequantize NVFP4."""
    x_dequant = x_quantized * scales[:, np.newaxis]
    x_flat = x_dequant.flatten()
    if pad_size > 0:
        x_flat = x_flat[:-pad_size]
    return x_flat.reshape(original_shape)

# Test NVFP4 quantization
weights = np.random.randn(1024, 1024).astype(np.float32)

x_quant, scales, shape, pad = quantize_nvfp4(weights)
x_dequant = dequantize_nvfp4(x_quant, scales, shape, pad)

mse = np.mean((weights - x_dequant) ** 2)
rel_error = np.abs(weights - x_dequant) / (np.abs(weights) + 1e-8)

print(f"NVFP4 Quantization Results:")
print(f"  MSE: {mse:.6f}")
print(f"  Mean relative error: {rel_error.mean() * 100:.2f}%")
print(f"  Bits per value: 4 + {len(scales) * 8 / weights.size:.2f} (scale overhead)")

---
## 3. NVFP4 vs INT4 Comparison

FP4 has non-uniform spacing (denser near zero), which can be advantageous for weight distributions.

In [None]:
def quantize_int4_grouped(x, block_size=16):
    """INT4 with same block size for fair comparison."""
    original_shape = x.shape
    x_flat = x.flatten()
    
    pad_size = (block_size - len(x_flat) % block_size) % block_size
    if pad_size > 0:
        x_flat = np.pad(x_flat, (0, pad_size), mode='constant')
    
    x_blocks = x_flat.reshape(-1, block_size)
    
    # Scale to INT4 range [-8, 7]
    abs_max = np.abs(x_blocks).max(axis=1, keepdims=True)
    scales = abs_max / 7.0
    scales = np.maximum(scales, 1e-8)
    
    x_quantized = np.round(x_blocks / scales).clip(-8, 7)
    
    return x_quantized, scales.squeeze(), original_shape, pad_size

def dequantize_int4_grouped(x_quantized, scales, original_shape, pad_size):
    x_dequant = x_quantized * scales[:, np.newaxis]
    x_flat = x_dequant.flatten()
    if pad_size > 0:
        x_flat = x_flat[:-pad_size]
    return x_flat.reshape(original_shape)

# Compare on different distributions
print("Comparison: NVFP4 vs INT4 (block_size=16)")
print("=" * 50)

distributions = [
    ("Normal(0, 1)", np.random.randn(10000)),
    ("Normal(0, 0.1)", np.random.randn(10000) * 0.1),
    ("Uniform[-1, 1]", np.random.uniform(-1, 1, 10000)),
    ("Laplace(0, 0.5)", np.random.laplace(0, 0.5, 10000)),
]

for name, data in distributions:
    data = data.astype(np.float32)
    
    # NVFP4
    q_fp4, s_fp4, sh, p = quantize_nvfp4(data)
    d_fp4 = dequantize_nvfp4(q_fp4, s_fp4, sh, p)
    mse_fp4 = np.mean((data - d_fp4) ** 2)
    
    # INT4
    q_int4, s_int4, sh, p = quantize_int4_grouped(data)
    d_int4 = dequantize_int4_grouped(q_int4, s_int4, sh, p)
    mse_int4 = np.mean((data - d_int4) ** 2)
    
    winner = "NVFP4" if mse_fp4 < mse_int4 else "INT4"
    print(f"{name:20s}: NVFP4={mse_fp4:.6f}, INT4={mse_int4:.6f} → {winner}")

---
## 4. OCP Microscaling (MX) Formats

The Open Compute Project defines standardized microscaling formats. MXFP4 is similar to NVFP4 but uses a shared exponent within blocks.

In [None]:
# MX format overview
mx_formats = {
    "MXFP8 E5M2": {"elem_bits": 8, "block_size": 32, "scale_bits": 8, "desc": "FP8 elements, shared E8M0 scale"},
    "MXFP8 E4M3": {"elem_bits": 8, "block_size": 32, "scale_bits": 8, "desc": "FP8 elements, shared E8M0 scale"},
    "MXFP6 E3M2": {"elem_bits": 6, "block_size": 32, "scale_bits": 8, "desc": "FP6 elements, shared E8M0 scale"},
    "MXFP6 E2M3": {"elem_bits": 6, "block_size": 32, "scale_bits": 8, "desc": "FP6 elements, shared E8M0 scale"},
    "MXFP4 E2M1": {"elem_bits": 4, "block_size": 32, "scale_bits": 8, "desc": "FP4 elements, shared E8M0 scale"},
    "MXINT8": {"elem_bits": 8, "block_size": 32, "scale_bits": 8, "desc": "INT8 elements, shared E8M0 scale"},
}

print("OCP Microscaling Formats:")
print("=" * 70)
for name, info in mx_formats.items():
    effective_bits = info["elem_bits"] + info["scale_bits"] / info["block_size"]
    print(f"{name:15s}: {info['elem_bits']}b elements, {info['block_size']}-elem blocks, "
          f"~{effective_bits:.2f} effective bits/elem")
    print(f"                {info['desc']}")

---
## Exercises

1. **E8M0 Scale**: Implement FP8 E8M0 (exponent-only) format for scale factors
2. **Hardware Sim**: Simulate NVFP4 Tensor Core MMA operation with proper accumulation
3. **Quality Eval**: Compare NVFP4 vs AWQ INT4 on actual LLM weights

## Key Takeaways

- NVFP4 E2M1 has only 16 distinct values but works well with block scaling
- Non-uniform FP4 spacing can match or beat INT4 for many distributions
- Block size of 16-32 balances accuracy vs scale factor overhead
- MX formats are standardized for interoperability across hardware vendors