## GPTQ

#### What is GPTQ
- GPTQ: Generative Pre-trained Transformer Quantization
- Goal: Compress LLM weights from FP16/FP32 -> INT4/INT8 without significant accuracy loss
- Key Innovation: Uses second-order information (Hessian) to intelligently compensate for quantization errors

#### Core Problem Formulation
- For each layer with weight matrix W:
    - Minimize: ||WX - WqX||^2

#### The Hessian
- H = 2XX^T / N
- What it captures: 
    - Diagonal (H_ii): Importance of weight i
    - Off-Diagonal (H_ij): Correlation between weights i and j
- Why we need H^(-1):
    - Forward (H): "Change weight -> affects output"
    - Backward (H^(-1)): "Output is wrong -> adjust weight by this much"

In [2]:
import torch

def quantize(w, n_bits=8):
    q_max = 2 ** (n_bits - 1) - 1
    q_min = - q_max

    max_val = w.abs().max()

    scale = max_val / q_max

    w_q = torch.clamp(
        torch.round(w / scale), 
        q_min, q_max
    ).to(torch.int8)

    return w_q.float() * scale

def gptq_quantize_layers(W, X, block_size=128):
    """
    Quantize a single layer using GPTQ
    
    Args:
        W: Weight matrix [d_out, d_in]
        X: Calibration data (layer inputs) [d_in, N]
        bits: Target bit-width (typically 4)
        block_size: Number of weights to quantize together (typically 128)
    
    Returns:
        W_quant: Quantized weights [d_out, d_in]
    """
    d_out, d_in = W.shape

    hessian = 2 * (X @ X.T) / X.shape[1]

    hessian_inv = torch.linalg.inv(hessian)

    W_quant = torch.zeros_like(W)
    for row_idx in range(X.shape[0]):
        w_row = W[row_idx, :].clone()

        for block_start in range(0, X.shape[1], block_size):
            block_end = min(block_start + block_size, block_end)
            block_idx = slice(block_start, block_end)

            w_block = w_row[block_idx]
            w_block_quant = quantize(w_block)

            W_quant[row_idx, block_idx] = w_block_quant

            error = w_block - w_block_quant

            if block_end < d_in:
                future_idx = slice(block_end, d_in)

                hessian_inv_block = hessian_inv[block_idx, future_idx]
                hessian_inv_diag = hessian_inv[block_idx, block_idx].diag()

                compensation = error.unsqueeze(1) * (hessian_inv_block / hessian_inv_diag.unsqueeze(1))

                w_row[future_idx] -= compensation.sum(dim=0)

    return W_quant

#### Limitations
- Weight-only: Doesnt quantize activations
- Calibration Dependency
- Memory Intensive

#### Why it works?
- Hessian captures weight importance and correlations
- When we quantize one weight, H_inv tells us exactly how to adjust other weights to cancel out the error on real data

#### Pros
- High accuracy
- No retraining required
- Per-channel quantization

##### 13. Likely Interview Questions

Q: "Why use Hessian instead of just gradients?"**
- "Gradients (first-order) only show sensitivity - how much output changes with weight. Hessian (second-order) shows curvature and correlations - how weights interact. This lets us adjust multiple weights together optimally when compensating for quantization errors."

Q: "What if calibration data doesn't match deployment?"**
- "Accuracy degrades because the Hessian is computed on the wrong distribution. The compensation will be suboptimal. We need domain-representative calibration - for example, if deploying for code, calibrate on code, not general text."

Q: "Can you quantize below 4-bit with GPTQ?"**
- "Yes, but 3-bit is the boundary where accuracy starts dropping noticeably. At 2-bit, quantization noise dominates even with optimal compensation. Would need either mixed-precision (4-bit for important layers, 2-bit for others) or fine-tuning."

Q: "How do you handle outlier weights?"**
- "GPTQ handles outliers implicitly - they get high H_ii values (high importance), so their errors trigger more aggressive compensation in correlated weights. Can also use per-channel scales which give each output channel its own quantization range."

Q: "Why block-wise quantization instead of one weight at a time?"**
- "Computational efficiency. Quantizing one weight at a time is O(d_in²) per row. With blocks of 128, we reduce to O(d_in²/128). The accuracy difference is negligible because weights within a block have similar correlations."

Q: "Does GPTQ help with inference speed?"**
- "Yes, but requires custom kernels. The quantized weights are smaller (4× reduction), but you need specialized CUDA/Metal kernels to actually compute in low precision. Without custom kernels, you dequantize to FP16 for computation, which saves memory but not compute time."

## AWQ