🧠 Quantization Explained Simply:
4-bit quantization = Replace full precision (16/32-bit) weights with compressed 4-bit versions.

Used only for inference/fine-tuning, not pretraining.

During forward/backward pass, weights are temporarily converted to higher precision (compute_dtype).

In [None]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"  # NormalFloat4
)

🔁 3. How Does LoRA Work with Quantized Models?
LoRA (Low-Rank Adaptation) + Quantization = a powerful combo for efficient fine-tuning:

Instead of modifying all model weights, LoRA injects trainable matrices (A, B) into selected layers (like attention).

You freeze the base model.

You train only the small LoRA adapters, while model runs in quantized (4-bit) mode.

This keeps training cost low, memory usage tiny, and supports consumer GPUs.

</br>
🔧 What Is Quantization?
Quantization is the process of reducing the precision of the numbers used to represent a model’s weights, activations, or gradients.

Normally, models use 32-bit floats (float32) for everything.

Quantization replaces them with smaller types, like:

float16 (16-bit floating point)

int8 (8-bit integer)

int4 (4-bit integer)

    🧠 Why Do We Quantize?
Because:

Memory is expensive.

4-bit numbers take 1/8th the space of 32-bit floats.

You can't run 7B+ models on small GPUs unless you compress them.

Inference and training speed up because smaller numbers = faster operations.

⚙️ Real-Life Analogy:
Imagine storing prices:

32-bit: $32.149732 → very precise, but takes space.

8-bit: $32.14 → enough precision for many tasks.

4-bit: $32 → rough, but still usable.

Quantization keeps the structure, just approximates the values

🧱 4. Where Is LoRA Applied?
Usually applied to:

Attention layers:

q_proj (query projection)

v_proj (value projection)

Optionally:

k_proj, o_proj, MLP layers

Why not all?
→ Because attention layers have the most impact with fewer trainable params.

🧩 What is PEFT?
PEFT stands for:

🔧 Parameter-Efficient Fine-Tuning

It is:

A Python library by Hugging Face

Built to make fine-tuning large models easier and cheaper

Especially helpful for resource-limited environments (like laptops or small GPUs)