<a href="https://colab.research.google.com/github/premkumarkora/HF_Model_Quantization/blob/main/HF_Model_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


```bash
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate
```

you have all the core pieces needed to download, run, quantize, and accelerate modern NLP models from HuggingFace—and make HTTP calls too. Here’s what each library gives you, and some of the things you can do immediately afterwards:


## 1. `requests`

* **What it is:** A simple, human-friendly HTTP client for Python.
* **What you can do:**

  * Fetch data from web APIs (e.g. pull text from a REST service).
  * Download files or model weights if you want to manage HTTP yourself.
  * Send results of your model to another service (for example, posting generated text to a web server).


## 2. `torch`

* **What it is:** PyTorch, the most popular deep-learning framework in Python.
* **What you can do:**

  * Load and run neural networks (including transformers).
  * Write and train your own models, or fine-tune existing ones.
  * Manipulate tensors, move them onto GPU, and do low-level operations when needed.


## 3. `bitsandbytes`

* **What it is:** A library for ultra-memory-efficient model loading and inference via quantization.
* **What you can do:**

  * Load giant transformer models in 4-bit or 8-bit precision.
  * Drastically reduce GPU RAM usage so you can run large models on smaller cards.
  * Combine with HuggingFace’s `quantization_config` to trade a tiny bit of accuracy for big memory savings.


## 4. `transformers`

* **What it is:** HuggingFace’s flagship library for all things transformer-based NLP (and beyond).
* **What you can do:**

  * Download thousands of pre-trained models (GPT, BERT, T5, BLOOM, etc.) with a single line.
  * Tokenize text, run model inference, and decode outputs back to human text.
  * Use pipelines for common tasks (text generation, summarization, translation, question answering) in just a couple lines of code.


## 5. `sentencepiece`

* **What it is:** A fast, language-agnostic tokenizer library often used under the hood by big models.
* **What you can do:**

  * Tokenize or detokenize your own text in the same way models like T5 or mT5 do.
  * Train new subword tokenizers if you have domain-specific text.
  * Ensure compatibility when you load models that explicitly require SentencePiece (e.g. certain multilingual or translation models).


## 6. `accelerate`

* **What it is:** HuggingFace’s lightweight tool for easily running code on CPU, single GPU, or multi-GPU setups.
* **What you can do:**

  * Spin up your training or inference with zero boilerplate for distributed/parallel setups.
  * Automatically move your model and data to the right devices.
  * Scale from your laptop’s CPU to a multi-GPU server by changing one command-line flag.


With these six packages in place, you’re set to explore or build anything from simple demos to large-scale, memory-efficient NLP services. Enjoy!


In [None]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

In [None]:
pip install bitsandbytes



Sign in to Hugging Face

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

Models that are goint to be used in this notebook

In [None]:
# instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct" # exercise for you
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" # If this doesn't fit it your GPU memory, try others from the hub

Creating a message template

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of ship captains"}
  ]

# Function to Quantize the Model


The `BitsAndBytesConfig` defined tells HuggingFace’s loading logic how to quantize your model down to 4-bit weights. Quantization lets you squeeze a large model into GPU memory (or even CPU RAM) by storing its weights in fewer bits—at the cost of (usually very small) drops in numerical precision. Here’s what each argument does:



### 1. `load_in_4bit=True`

* **What it does:**
  Instructs the loader to convert every weight tensor from 16- or 32-bit floats down to 4 bits per value as you load the model.

  A 4 bit weight uses one-eighth the memory of a 32 bit weight. For a 10 billion-parameter model, that’s a drop from \~40 GB down to \~5 GB—often the difference between “fits on one consumer GPU” or “doesn’t.”



### 2. `bnb_4bit_use_double_quant=True`

* **What it does:**
  Applies a two-step (“double”) quantization:

  1. **Primary quantization** reduces each block of weights to 4 bits.
  2. **Secondary quantization** further compresses the scale factors themselves (the small floating-point numbers that map 4-bit integers back to approximate original floats).

  You get even smaller overall memory footprint without materially worsening accuracy. Double-quantized scale factors take less space, but you still recover enough precision to generate sensible outputs.



### 3. `bnb_4bit_compute_dtype=torch.bfloat16`


  Chooses the data-type for all *compute* (matrix multiplications, attention, etc.) once the 4 bit model has been de-quantized on the fly.

* **Why **`bfloat16`**?**

  * **bfloat16** (Brain Floating Point) is a 16-bit format with the same exponent range as 32-bit floats, but fewer mantissa bits.
  * It keeps dynamic range high (so very large and very small numbers behave well) while still cutting memory in half versus full 32 bit.

* **Result:**
  When you actually run the model, tensors are cast to `bfloat16` for speed and moderate precision, rather than back to full `float32` which costs more memory and compute.


### 4. `bnb_4bit_quant_type="nf4"`

* **What it does:**
  Chooses which 4-bit quantization scheme to use. `"nf4"` stands for **NormalFloat-4**, a recent format developed to give better accuracy than simple integer quantization.
* **How it works at a high level:**

  * **NF4** stores 4 bit values not as straight 0–15 integers, but as tiny floating-point representations with a tiny shared exponent per block.
  * This lets very small and very large weights both be encoded more faithfully than pure linear quantization.

  In practice, models quantized with NF4 lose less performance (measured in perplexity or downstream task scores) than ones quantized with uniform 4 bit.


In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

# LLAMA

In [None]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

# Quantization Configuration

In [None]:
# The model

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 5,591.5 MB


# Looking under the hood at the Transformer model
The next cell prints the HuggingFace model object for Llama.

This model object is a Neural Network, implemented with the Python framework PyTorch. The Neural Network uses the architecture invented by Google scientists in 2017: the Transformer architecture.

While we're not going to go deep into the theory, this is an opportunity to get some intuition for what the Transformer actually is.

If you're completely new to Neural Networks, check out my YouTube intro playlist for the foundations.

Now take a look at the layers of the Neural Network that get printed in the next cell. Look out for this:

It consists of layers
There's something called "embedding" - this takes tokens and turns them into 4,096 dimensional vectors. We'll learn more about this in Week 5.
There are then 32 sets of groups of layers called "Decoder layers". Each Decoder layer contains three types of layer: (a) self-attention layers (b) multi-layer perceptron (MLP) layers (c) batch norm layers.
There is an LM Head layer at the end; this produces the output
Notice the mention that the model has been quantized to 4 bits.

It's not required to go any deeper into the theory at this point, but if you'd like to, I've asked our mutual friend to take this printout and make a tutorial to walk through each layer. This also looks at the dimensions at each point. If you're interested, work through this tutorial after running the next cell:

https://chatgpt.com/canvas/shared/680cbea6de688191a20f350a2293c76b

In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

In [None]:
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of ship captains<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here's one that's sure to anchor a laugh:

Why did the ship captain bring a ladder to the party?

Because he heard the drinks were on the house!

I hope that one charted a course for some chuckles among your audience!<|eot_id|>


In [None]:
# Clean up memory
# Thank you Kuan L. for helping me get this to properly free up memory!
# If you select "Show Resources" on the top right to see GPU memory, it might not drop down right away
# But it does seem that the memory is available for use by new models in the later code.

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Super Heros"}
  ]

In [None]:
def generate(model, messages):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  streamer = TextStreamer(tokenizer)
  model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
  outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
  memory = model.get_memory_footprint() / 1e6
  print(f"Memory footprint: {memory:,.1f} MB")
  #del model, inputs, tokenizer, outputs, streamer
  #gc.collect()
  #torch.cuda.empty_cache()

How to quantize and load the model back to Hunning Face

In [None]:
model = "google/gemma-2-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
from huggingface_hub import notebook_login
notebook_login()                # enter your HF token
model.push_to_hub("premkumarkora/gemma-2-2b-it")
tokenizer.push_to_hub("premkumarkora/gemma-2-2b-it")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

model.safetensors:   0%|          | 0.00/2.22G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/premkumarkora/gemma-2-2b-it/commit/70afc201f2cfcfbac614520ff079c3f75d219c84', commit_message='Upload tokenizer', commit_description='', oid='70afc201f2cfcfbac614520ff079c3f75d219c84', pr_url=None, repo_url=RepoUrl('https://huggingface.co/premkumarkora/gemma-2-2b-it', endpoint='https://huggingface.co', repo_type='model', repo_id='premkumarkora/gemma-2-2b-it'), pr_revision=None, pr_num=None)

In [None]:
generate(GEMMA2, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos><start_of_turn>user
Tell a light-hearted joke for a room of Super Heros<end_of_turn>
<start_of_turn>model
Why did the superhero bring a ladder to the bank robbery? 

Because he heard the interest rates were sky-high! 😜 
<end_of_turn>
Memory footprint: 2,192.3 MB


In [None]:
# Clean up memory
# Thank you Kuan L. for helping me get this to properly free up memory!
# If you select "Show Resources" on the top right to see GPU memory, it might not drop down right away
# But it does seem that the memory is available for use by new models in the later code.

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()