# Model Quantization Principles

This notebook covers the fundamentals of model quantization, including number formats, quantization strategies, and hands‑on examples for quantization and dequantization of float matrices.


## 1. Floating Point Numbers vs Fixed Point Numbers

Floating point numbers represent real values with a sign bit, exponent field, and mantissa (fraction) field. Fixed point numbers use a fixed number of integer and fractional bits without an explicit exponent.

**Float32 representation (32 bits total):**

```
| S (1 bit) | E (8 bits)        | M (23 bits)                         |
|-----------|-------------------|-------------------------------------|
```

**Float16 representation (16 bits total):**

```
| S (1) | E (5)        | M (10)         |
```

**BFloat16 representation (16 bits total):**

```
| S (1) | E (8)        | M (7)          |
```

**int8 (8 bits signed):**

```
| S (1) | V (7 bits)      |
```

**uint8 (8 bits unsigned):**

```
| V (8 bits)              |
```

| Format   | Sign bits | Exponent bits | Mantissa bits | Range (approx.)             |
| -------- | --------- | ------------- | ------------- | --------------------------- |
| Float32  | 1         | 8             | 23            | \~1.2×10^−38 to \~3.4×10^38 |
| Float16  | 1         | 5             | 10            | \~6.1×10^−5 to \~6.5×10^4   |
| BFloat16 | 1         | 8             | 7             | \~1.2×10^−38 to \~3.4×10^38 |
| int8     | 1         | –             | –             | −128 to 127                 |
| uint8    | 0         | –             | –             | 0 to 255                    |


## 2. Quantization Approaches

* **Uniform vs Non‑Uniform Quantization**: Uniform quantization divides the range into equal steps; non‑uniform uses variable spacing (e.g., logarithmic) to allocate more precision where data is dense.
* **Symmetric vs Asymmetric Quantization**: Symmetric uses zero as the midpoint of the quant range; asymmetric allows a different offset (zero point) to handle non‑zero‑centered distributions.
* **Layer‑wise vs Channel‑wise Quantization**: Layer‑wise uses a single scale/zero point per tensor; channel‑wise assigns separate parameters per output channel for finer granularity.
* **Post‑Training Quantization vs Quantization‑Aware Training**: Post‑training quantization is applied after model training (fast but may lose accuracy); quantization‑aware training simulates quantization during training (more accurate, more compute).
* **Clustering‑based Quantization**: Uses vector quantization (VQ) and product quantization (PQ) to cluster values or sub‑vectors into a small set of centroids (codebooks), encoding data by centroid indices.


## 3. Uniform, Symmetric Quantization

This section demonstrates symmetric uniform quantization and its corresponding de‑quantization.

### 3.1 Quantization Example (Uniform, Symmetric)

We will quantize a small Float32 matrix to int8 using symmetric uniform quantization.

In [1]:
import numpy as np
from typing import Tuple

# Define a 3×4 float32 matrix
def create_sample_matrix() -> np.ndarray:
    return np.array([
        [ 0.1, -0.5,  0.0,  1.0],
        [ 2.5, -1.2,  0.7, -0.3],
        [ 0.9,  1.5, -2.0,  0.2]
    ], dtype=np.float32)

# Quantization function for symmetric uniform quant
def quantize_symmetric(
    x: np.ndarray, num_bits: int
) -> Tuple[np.ndarray, float]:
    # Maximum absolute value
    max_val = np.max(np.abs(x))
    # Scale: map max_val -> max representable int
    qmax = 2**(num_bits - 1) - 1
    scale: float = max_val / qmax
    # Quantize
    q: np.ndarray = np.round(x / scale).astype(np.int8)
    # Clip to range
    q = np.clip(q, -qmax, qmax)
    return q, scale

# Apply quantization
matrix = create_sample_matrix()
quantized_matrix, scale = quantize_symmetric(matrix, num_bits=8)
print('Original matrix:\n', matrix)
print('Quantized matrix (int8):\n', quantized_matrix)
print('Scale used:', scale)

Original matrix:
 [[ 0.1 -0.5  0.   1. ]
 [ 2.5 -1.2  0.7 -0.3]
 [ 0.9  1.5 -2.   0.2]]
Quantized matrix (int8):
 [[   5  -25    0   51]
 [ 127  -61   36  -15]
 [  46   76 -102   10]]
Scale used: 0.01968504



### 3.2 De‑quantization Example (Uniform, Symmetric)

To restore the quantized values back to approximate float32 values:

In [2]:
def dequantize_symmetric(q: np.ndarray, scale: float) -> np.ndarray:
    return q.astype(np.float32) * scale

reconstructed_matrix = dequantize_symmetric(quantized_matrix, scale)
print('Reconstructed matrix:\n', reconstructed_matrix)

Reconstructed matrix:
 [[ 0.09842519 -0.492126    0.          1.003937  ]
 [ 2.5        -1.2007874   0.70866144 -0.2952756 ]
 [ 0.9055118   1.496063   -2.007874    0.19685039]]


This completes a basic demonstration of quantization and de‑quantization using symmetric uniform quantization on a small float32 matrix. Adjustments (e.g., zero‑point for asymmetric) can be added further.

## 4. Clustering-based Quantization

Clustering-based quantization uses vector quantization (VQ) or product quantization (PQ) to map high-dimensional float values to discrete indices referencing a small set of centroids (the codebook).

### 4.1 Vector Quantization (VQ)

In VQ, each scalar (or vector) is treated as a sample and clustered into $K$ centroids:

1. **Form samples**: Flatten the matrix into a 1D list of values (or treat each row as a vector).
2. **K-Means clustering**: Compute $K$ centroids.
3. **Quantize**: Replace each sample by the index of its nearest centroid, producing an index matrix of the same shape.
4. **Store**: Save the centroid array (codebook) and the index matrix.


In [3]:
from sklearn.cluster import KMeans
import numpy as np

# Using the same sample matrix
data = create_sample_matrix()                # shape (3, 4)
flat_data = data.reshape(-1, 1)              # treat each value as a 1D sample

# Train k-means
K = 4                                       # number of centroids
kmeans = KMeans(n_clusters=K, random_state=0).fit(flat_data)
codebook = kmeans.cluster_centers_.flatten()  # shape (K,)

# Quantize: assign each value to nearest centroid
indices = kmeans.predict(flat_data)         # shape (12,)
index_matrix = indices.reshape(data.shape)  # shape (3, 4)

print('Codebook (centroids):', codebook)
print('Index matrix:\n', index_matrix)

Codebook (centroids): [ 1.025      -1.6         2.5        -0.10000001]
Index matrix:
 [[3 3 3 0]
 [2 1 0 3]
 [0 0 1 3]]


In [4]:
# 3) Dequantize: reconstruct approximate floats
reconstructed_flat = codebook[indices]
reconstructed = reconstructed_flat.reshape(data.shape)
print('Reconstructed matrix:\n', reconstructed)

Reconstructed matrix:
 [[-0.10000001 -0.10000001 -0.10000001  1.025     ]
 [ 2.5        -1.6         1.025      -0.10000001]
 [ 1.025       1.025      -1.6        -0.10000001]]



### 4.2 Product Quantization (PQ)

PQ splits each row vector into $M$ sub-vectors and applies VQ within each subspace:

1. **Split** each row into $M$ equal-length sub-vectors.
2. **Train** K-Means separately on each subspace to obtain $M$ codebooks.
3. **Encode**: For each sub-vector, store the centroid index, resulting in an index matrix of shape $(num\_rows, M)$.
4. **Store**: Save all $M$ codebooks and the index matrix.

**Benefits of PQ in Computation**

* **Memory-bandwidth and cache efficiency:** By compressing weights to small integer indices (e.g., a few bits per sub-vector), the number of bytes fetched from memory during matrix operations is reduced, alleviating memory-bound bottlenecks on modern hardware.
* **Lookup-based dot-products:** Instead of reconstructing every float weight, small lookup tables of dot-products between input sub-vectors and each centroid are precomputed. At inference, each sub-vector multiply becomes a single table lookup plus an addition, reducing the number of floating-point multiplies.


#### 4.2.1 PQ Quantization & Dequantization


In [5]:
from sklearn.cluster import KMeans
import numpy as np
from typing import List

# Prepare data
matrix = create_sample_matrix()             # shape (3, 4)
n, d = matrix.shape
M = 2                                      # number of subspaces
pc = 3                                     # clusters per subspace
sub_dim = d // M                           # sub-vector dimension

# Split into subspaces
subspaces = [matrix[:, i*sub_dim:(i+1)*sub_dim] for i in range(M)]
codebooks: List[np.ndarray] = []
codes: List[np.ndarray] = []
for sub in subspaces:
    kmeans = KMeans(n_clusters=pc, random_state=0).fit(sub)
    codebooks.append(kmeans.cluster_centers_)  # shape (pc, sub_dim)
    codes.append(kmeans.predict(sub))          # shape (n,)

print('Codebooks:\n', codebooks)

# Combined codes (n x M)
pq_codes = np.stack(codes, axis=1)
print('PQ Codes:\n', pq_codes)

# Dequantize: reconstruct and concatenate
reconstructed_pq = np.hstack([codebooks[i][pq_codes[:, i]] for i in range(M)])
print('PQ reconstructed matrix:\n', reconstructed_pq)

Codebooks:
 [array([[ 2.5       , -1.2       ],
       [ 0.9       ,  1.5       ],
       [ 0.10000002, -0.5       ]], dtype=float32), array([[ 0.70000005, -0.30000004],
       [-2.        ,  0.2       ],
       [ 0.        ,  1.        ]], dtype=float32)]
PQ Codes:
 [[2 2]
 [0 0]
 [1 1]]
PQ reconstructed matrix:
 [[ 0.10000002 -0.5         0.          1.        ]
 [ 2.5        -1.2         0.70000005 -0.30000004]
 [ 0.9         1.5        -2.          0.2       ]]



#### 4.2.2 Lookup-based Dot-Product Inference

At inference time, given an input activation (e.g., a query or key) vector in float32:

1. Split it into the same $M$ sub-vectors as used for weight PQ.

2. Compute lookup tables: for each subspace $i$, build a table of size $K$ (number of centroids) where

    ```python
    L_i[j] = np.dot(query_subs[i], codebooks[i][j])  # j=0..K-1
    ```
    This pre‑computes the dot-product (or distance) between the `query sub-vector` and `each centroid` in subspace $i$.

3. Dot-product via lookup: to compute the dot-product between this activation and any weight row $r$ (represented by indices in pq_codes), simply sum the corresponding lookups:

```python
dot(query, weight_row_r) = sum(
    L_i[ pq_codes[r, i] ]
    for i in range(M)
)
```

By doing this, each dot-product uses only $M$ lookups and $M-1$ additions, instead of $d$ multiplies and full weight reconstruction, leading to significant computational and memory-bandwidth savings.

In [6]:
# Given a query activation vector
query = np.array([0.2, -0.1, 0.5, 1.0], dtype=np.float32)  # shape (4,)
print('Query:\n', query)

# Split query into same subspaces
query_subs = [query[i*sub_dim : (i+1)*sub_dim] for i in range(M)]
print('Query subspaced:\n', query_subs)

# Build lookup tables: dot(query_sub, centroid) for each centroid
lookup_tables = [
    np.dot(query_subs[i], codebooks[i].T)  # shape (pc,)
    for i in range(M)
]
print('Lookup tables:\n', lookup_tables)

# Compute dot-product for each row of reconstructed weights via lookup
#    dot = sum over subspaces of lookup_tables[i][pq_codes[row, i]]

dot_products: list[np.float32] = []
for row_idx in range(n):
    dp = sum(
        lookup_tables[i][pq_codes[row_idx, i]]
        for i in range(M)
    )
    dot_products.append(dp)

print('Dot-products via PQ lookup:\n', dot_products)

Query:
 [ 0.2 -0.1  0.5  1. ]
Query subspaced:
 [array([ 0.2, -0.1], dtype=float32), array([0.5, 1. ], dtype=float32)]
Lookup tables:
 [array([0.62      , 0.02999999, 0.07000001], dtype=float32), array([ 0.04999998, -0.8       ,  1.        ], dtype=float32)]
Dot-products via PQ lookup:
 [np.float32(1.07), np.float32(0.66999996), np.float32(-0.77000004)]



Here, each dot-product uses only $M$ lookups and $M-1$ additions instead of $d$ multiplies, yielding faster inference when $d \gg M$.

This section shows both simple VQ and PQ quantization and dequantization, as well as an optimized inference strategy using lookup-based dot-products. You can adjust **K** (number of centroids) or **M** (number of subspaces) to trade off compression, accuracy, and computational cost.
