# Quantization in Depth - Part 4

**Weight Packing and LLM Quantization Challenges**

This final lesson in the Hugging Face quantization series covers the practical and theoretical limitations of ultra-low-bit quantization (e.g., 2-bit and 4-bit) and challenges in applying quantization to large language models (LLMs).

## 1. Why Weight Packing is Necessary

PyTorch does not currently support native 2-bit or 4-bit tensors:

```python
torch.tensor([0, 1], dtype=torch.int4)  # ❌ Not supported
```

In practice, we must store these quantized values in `uint8` tensors (8 bits), which can introduce overhead if not handled correctly. To truly benefit from quantization, we need to **pack multiple low-bit weights into a single byte**.

In [1]:
import torch 

torch.tensor([0,1], dtype=torch.int4)

AttributeError: module 'torch' has no attribute 'int4'

## 2. Manual Weight Packing (2-bit)

To store four 2-bit weights in a single `uint8`, we manually shift and OR their bits together.

### Example:
Given: `[1, 0, 3, 2]` with 2-bit quantization, we want to pack into one byte:
```
From 0000 0001 - 0000 0000 - 0000 0011 - 0000 0010
To Binary:  01 00 11 10  =>  10110001 => 177
```

In [2]:
import torch

def pack_weights(uint8tensor, bits):
    """
    Packs a tensor of low-bit values into a uint8 tensor.

    Args:
        uint8tensor (torch.Tensor): 1D tensor of integers (e.g., values between 0-3 for 2-bit).
        bits (int): Number of bits per weight (e.g., 2 or 4).

    Returns:
        torch.Tensor: Packed tensor of type uint8.
    """
    # Check if the input length is a multiple of 8 // bits
    if uint8tensor.shape[0] * bits % 8 != 0:
        raise ValueError(f"The input length must be a multiple of {8 // bits}")

    # Calculate the number of values and steps
    num_values = uint8tensor.shape[0] * bits // 8
    # Number of steps per value
    num_steps = 8 // bits

    # Initialize the packed tensor
    packed_tensor = torch.zeros((num_values,), dtype=torch.uint8)
    unpacked_idx = 0

    # Pack the weights
    for i in range(num_values):
        for j in range(num_steps):
            # Shift the current value to the correct position and OR it with the packed tensor
            packed_tensor[i] |= uint8tensor[unpacked_idx] << (bits * j)
            # Move to the next value    
            unpacked_idx += 1

    return packed_tensor

# Test the function
unpacked = torch.tensor([1, 0, 3, 2], dtype=torch.uint8)
packed = pack_weights(unpacked, 2)
print("Packed:", packed)

Packed: tensor([177], dtype=torch.uint8)


## 3. Manual Weight Unpacking (2-bit)

To use the packed weights in operations, we need to unpack them back into individual low-bit values.
We apply **bit-shifting** and a **bitmask** to recover the original values.

The bitmask for 2-bit is:
$$
\text{mask} = 2^\text{bits} - 1 = 3
$$

In [3]:
def unpack_weights(uint8tensor, bits):
    """
    Unpacks a tensor of packed uint8 values into original low-bit weights.

    Args:
        uint8tensor (torch.Tensor): Packed uint8 tensor.
        bits (int): Number of bits per weight.

    Returns:
        torch.Tensor: Unpacked weights as uint8.
    """
    # Number of values for each uint8
    num_values = uint8tensor.shape[0] * 8 // bits
    # Number of steps per value
    num_steps = 8 // bits
    # Initialize the unpacked tensor
    unpacked_tensor = torch.zeros((num_values,), dtype=torch.uint8)

    # Create a mask for the bits we want to extract
    mask = 2 ** bits - 1
    unpacked_idx = 0

    # Unpack the weights
    for i in range(uint8tensor.shape[0]):
        for j in range(num_steps):
            # Shift the current value to the correct position 
            val = uint8tensor[i] >> (bits * j)
            # Extract the bits we want
            unpacked_tensor[unpacked_idx] = val & mask
            # Move to the next value    
            unpacked_idx += 1

    return unpacked_tensor

# Test unpacking
recovered = unpack_weights(torch.tensor([177], dtype=torch.uint8), 2)
print("Unpacked:", recovered)

Unpacked: tensor([1, 0, 3, 2], dtype=torch.uint8)


## 4. Why Quantizing LLMs is Hard

Quantizing LLMs introduces unique challenges due to **emergent outlier features** at scale. As shown in the [LLM.int8](https://arxiv.org/abs/2208.07339) paper, outliers appear in more layers and a larger portion of the sequence space as models grow.

This causes linear quantization to break down:

- **Outliers dominate the quantization range**, wasting bit precision for typical values.
- **Final predictions are sensitive to errors**, especially in autoregressive settings.

To address this, recent research proposes **outlier-aware** methods like LLM.INT8, SmoothQuant, GPTQ, AWQ, etc.

## 5. Recent SOTA Quantization in LLMs

| Method   | Bits | Key Idea                                    | Calibration? | Summary                                                                                         | Paper Link                                      |
|----------|------|---------------------------------------------|--------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------|
| LLM.INT8 | 8    | Outlier-aware two-stage (detect & scale)    | No           | Detects and rescales rare large weights (outliers) to minimize distortion.                       | [LLM.INT8](https://arxiv.org/abs/2208.07339)      |
| QLoRA    | 4    | LoRA adapters + QAT at 4-bit                | No           | Fine-tunes low-rank adapters on a quantized model to recover performance in 4-bit precision.      | [QLoRA](https://arxiv.org/abs/2305.14314)        |
| AWQ      | 4    | Per-channel activation-aware scaling        | Yes          | Learns per-channel scales based on activation statistics, improving INT4/INT8 accuracy.          | [AWQ](https://arxiv.org/abs/2306.00978)         |
| GPTQ     | 4    | Hessian-aware greedy rounding               | No           | Uses Hessian information to guide greedy weight rounding, preserving model loss surface.          | [GPTQ](https://arxiv.org/abs/2210.17323)     |
| HQQ      | 2    | Hybrid quant + learned reconstruction       | Yes          | Combines coarse quantization with reconstruction layers to achieve robust 2-bit representation. | [HQQ](https://mobiusml.github.io/hqq_blog/)         |
| QuIP     | 2    | Importance-based pruning + quantization     | Yes          | Prunes negligible weights based on importance, then applies quantization to remaining ones.       | [QuIP](https://arxiv.org/abs/2307.13304)        |

Quantizing large language models (LLMs) requires much more than simply converting weights to int8 or int4. These models exhibit **emergent behaviors** at scale — including the appearance of **activation outliers**, which break the assumptions behind naive uniform quantization.

---

#### The Outlier Problem in LLMs (From LLM.int8 Paper)

A critical insight from [LLM.int8 (Dettmers et al., 2022)](https://arxiv.org/abs/2208.07339) is that **outliers — extremely large activation values — appear increasingly in deeper layers and across many sequence dimensions** as model size grows.

> **Key figure:** The paper shows that for models like OPT-175B:
> - Outliers are present in **all layers**
> - Over **75% of the sequence dimensions** are affected
> - The effect becomes worse as **C4 perplexity improves** (i.e., more capable models)

This phenomenon makes classical linear quantization **infeasible**, as the quantization range becomes dominated by a small number of large values, wasting bit precision on normal features.

---

#### LLM.int8 (2022)

**Key idea:** Separate weight matrix multiplication into two components:
1. **Main weights:** quantized normally (INT8)
2. **Outlier rows/columns:** kept in higher precision (e.g. FP16)

This is done by detecting high-magnitude weights/activations and **processing them outside of quantization**. The approach allows:
- High accuracy preservation
- Real-world deployment at 8-bit
- No calibration required, simple two-stage implementation.

---

#### SmoothQuant (2022)

**Problem:** Activations are harder to quantize than weights, due to outliers stretching their range.

**Key idea:** Smooth the activation distributions by **migrating the variance from activations to weights**.

Let:
- \( X \) = activation
- \( W \) = weight

SmoothQuant introduces a **scaling factor \( s \)** and applies:
$$
\\tilde{X} = X / s \\quad \\text{and} \\quad \\tilde{W} = W \\cdot s
$$

This transformation leaves the output unchanged but **balances** the dynamic range, so both \( \\tilde{X} \) and \( \\tilde{W} \) are easier to quantize.

- Works with **A8W8** quantization (both activations and weights in int8).
- Requires **calibration dataset**.

---

#### AWQ (Activation-aware Weight Quantization, 2023)

**Key idea:** LLM performance depends disproportionately on a small subset of weights (called **salient weights**). 

**How it works:**
- Run a small calibration set
- Determine which channels/layers produce **activation outliers**
- Keep top ~1% of **salient weights in FP16**, quantize the rest

Also applies **per-channel scaling** based on activation magnitude.

- Produces **high accuracy** at 4-bit with **low overhead**  
- Efficient for **deployment on edge devices**  
- Compatible with many LLMs (OPT, LLaMA, etc.)

---

> These methods show that successful quantization of LLMs is not only a matter of reducing bit width, but of **respecting the structure of information flow** within the network. Outlier handling, scaling, and weight importance must be taken into account.


## 6. Conclusion & Further Resources

**Key Takeaways:**
- Manual packing/unpacking is essential for true 2/4-bit quantization.
- Quantizing LLMs is harder due to outlier sensitivity.
- New methods address these challenges by selectively preserving precision.

**Further Reading:**
- [LLM.INT8](https://arxiv.org/abs/2208.07339)
- [QLoRA](https://arxiv.org/abs/2305.14314)
- [GPTQ](https://arxiv.org/abs/2210.17323)
- [SmoothQuant](https://arxiv.org/abs/2211.10438)
- [AWQ](https://arxiv.org/abs/2306.00978)
- [QuIP](https://arxiv.org/abs/2307.13304)
- [HQQ Blog](https://mobiusml.github.io/hqq_blog/)

> Further reading: Check also the `llama.cpp` repo, MITHanLab resources (https://hanlab.mit.edu/ and their course "TinyML and Efficient Deep Learning Computing" https://hanlab.mit.edu/courses/2024-fall-65940 ), and HF quantization docs (https://huggingface.co/docs/transformers/quantization).