# Quantization Examples

Loading [AdamLucek/Orpo-Llama-3.2-1B-15k](https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k) regularly, then quantized to [8-Bit](https://arxiv.org/abs/2208.07339), and quantized to [4-Bit](https://arxiv.org/abs/2305.14314)

In [None]:
%%capture
!pip install -U transformers bitsandbytes accelerate

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

In [None]:
# Loading the Model
model_id = 'AdamLucek/Orpo-Llama-3.2-1B-15k'
base_model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")

# INT8 Config
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
)

model_8bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config_8bit, low_cpu_mem_usage=True)

# 4 Bit Config
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config_4bit, low_cpu_mem_usage=True)

# Loading Tokenizer
tokenizer = AutoTokenizer.from_pretrained("AdamLucek/Orpo-Llama-3.2-1B-15k")

# Print model size
print(f"Base Model size: {base_model.get_memory_footprint():,} bytes\n")
print(f"INT8 Model size: {model_8bit.get_memory_footprint():,} bytes\n")
print(f"4Bit Model size: {model_4bit.get_memory_footprint():,} bytes")

config.json:   0%|          | 0.00/909 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/206 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.2k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/419 [00:00<?, ?B/s]

Base Model size: 4,943,276,160 bytes

INT8 Model size: 1,498,560,640 bytes

4Bit Model size: 1,012,021,376 bytes


### VRAM Requirements

**Base Model**: 4.9 GB / 4921 MiB   
**INT8 Model**: 1.7 GB / 1667 MiB  
**4Bit Model**: 1.2 GB / 1235 MiB

### Storage Requirements

**Base Model**: 4.94 GB  / 4,943,276,160 bytes  
**INT8 Model**: 1.50 GB / 1,498,560,640 bytes  
**4Bit Model**: 1.01 GB / 1,012,021,376 bytes  

---
## Comparing the First Layer

A deeper look into the literal weights of the first layer in each version

In [None]:
# Looking at the Full Model Architecture
base_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128258, 2048, padding_idx=128257)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
     

In [None]:
# Weights from the First layer
base_weights = base_model.model.layers[0].self_attn.q_proj.weight.data
print("Original weights:")
print(base_weights)
print("Shape: ", base_weights.shape, "\n")
print("-" * 50, "\n")

# Weights From the First Layer - 8bit
weights_8bit = model_8bit.model.layers[0].self_attn.q_proj.weight.data
print("INT8 weights:")
print(weights_8bit)
print("Shape: ", weights_8bit.shape, "\n")

print("-" * 50, "\n")

# Weights From the First Layer - 4bit
weights_4bit = model_4bit.model.layers[0].self_attn.q_proj.weight.data
print("4Bit weights:")
print(weights_4bit)
print("Shape: ", weights_4bit.shape, "\n")

Original weights:
tensor([[-0.0183,  0.0071,  0.0218,  ..., -0.0070, -0.0089,  0.0149],
        [ 0.0112,  0.0593,  0.0630,  ..., -0.0334, -0.0147,  0.0058],
        [ 0.0182,  0.0141,  0.0362,  ..., -0.0432, -0.0388, -0.0233],
        ...,
        [ 0.0305,  0.0289,  0.0801,  ..., -0.0768, -0.0311, -0.0333],
        [ 0.0242, -0.0325,  0.0369,  ..., -0.0124, -0.0268, -0.0150],
        [-0.0264, -0.0498, -0.0210,  ...,  0.0602,  0.0130, -0.0008]],
       device='cuda:0')
Shape:  torch.Size([2048, 2048]) 

-------------------------------------------------- 

INT8 weights:
tensor([[-49,  19,  59,  ..., -19, -24,  40],
        [ 13,  69,  73,  ..., -39, -17,   7],
        [ 17,  13,  34,  ..., -40, -36, -22],
        ...,
        [ 16,  15,  43,  ..., -41, -17, -18],
        [ 36, -48,  54,  ..., -18, -39, -22],
        [-14, -27, -11,  ...,  32,   7,   0]], device='cuda:0',
       dtype=torch.int8)
Shape:  torch.Size([2048, 2048]) 

-------------------------------------------------- 

4B

---
## Testing Generation

Running the same prompt through each model version

In [None]:
# Function for Generating With the Models
def generate_response(model, message, tokenizer):

    # Format message
    messages = [{"role": "user", "content": message}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Tokenize and generate
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    )

    # Decode
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = "What is a language model?"

base_response = generate_response(base_model, prompt, tokenizer)
print("Base Model Response:\n")
print(base_response)
print("-" * 50)

int8_response = generate_response(model_8bit, prompt, tokenizer)
print("INT8 Model Response:\n")
print(int8_response)
print("-" * 50)

bit4_response = generate_response(model_4bit, prompt, tokenizer)
print("4Bit Model Response:\n")
print(bit4_response)

Base Model Response:

user
What is a language model?
assistant
A language model is a mathematical model that can generate text based on the language it has been trained on. It is used in various applications, such as natural language processing (NLP), machine translation, and speech recognition.
--------------------------------------------------
INT8 Model Response:

user
What is a language model?
assistant
A language model is a mathematical function that maps input text into output text. It is used in natural language processing, a field that involves the analysis, translation, and generation of human languages. The output of a language model is often a sequence of words or phrases, and it is trained on a large corpus of text.
--------------------------------------------------
4Bit Model Response:

user
What is a language model?
assistant
A language model is a mathematical model that is used to predict the likelihood of a sequence of characters in a text or a speech sequence. It is tr

## Calculating Perplexity

Perplexity is a measure of how well the model predicts the next token where:  
- Lower perplexity = model is more confident/accurate in its predictions
- Higher perplexity = model is more uncertain/confused



In [None]:
def calculate_perplexity(model, text):
    # Encode the text
    encodings = tokenizer(text, return_tensors='pt').to("cuda")

    # Define input_ids and target_ids
    input_ids = encodings.input_ids
    target_ids = input_ids.clone()

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

    # Loss calculation
    neg_log_likelihood = outputs.loss

    # Perplexity calculation
    ppl = torch.exp(neg_log_likelihood)

    return ppl

In [None]:
base_perplexity = calculate_perplexity(base_model, base_response[41:])
print(f"Base Model Perplexity:  {base_perplexity.item():.2f}", "\n")
int8_perplexity = calculate_perplexity(model_8bit, int8_response[41:])
print(f"INT8 Model Perplexity:  {int8_perplexity.item():.2f}", "\n")
bit4_perplexity = calculate_perplexity(model_4bit, bit4_response[41:])
print(f"4Bit Model Perplexity:  {bit4_perplexity.item():.2f}")

Base Model Perplexity:  3.04 

INT8 Model Perplexity:  3.48 

4Bit Model Perplexity:  3.46


---
# Quantizing and Converting to GGUF Format

[GGUF](https://huggingface.co/docs/hub/gguf) is a specific format developed by Georgi Gerganov in [GGML](https://github.com/ggerganov/ggml?tab=readme-ov-file), specifically for the [llama.cpp](https://github.com/ggerganov/llama.cpp) library. It has become a popular way of storing and distributing language models for a variety of downstream tasks, and comes with built in quantization capabilities.

## Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used. K represents the [k-quants quantization method](https://github.com/ggerganov/llama.cpp/pull/1684), a specific quantization method developed for LLMs that handles different parts of the model differently based on their sensitivity to quantization. Here's a simplified writeup of popular methods

* `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
* `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_s`: Uses Q3_K for all tensors
* `q4_0`: Original quant method, 4-bit.
* `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
* `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
* `q4_k_s`: Uses Q4_K for all tensors
* `q5_0`: Higher accuracy, higher resource usage and slower inference.
* `q5_1`: Even higher accuracy, resource usage and slower inference.
* `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
* `q5_k_s`:  Uses Q5_K for all tensors
* `q6_k`: Uses Q8_K for all tensors
* `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

We'll be using q4_k_m for 4-bit quantization using the k-quants method, using the outline [here](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md)



In [None]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 37013, done.[K
remote: Counting objects: 100% (8999/8999), done.[K
remote: Compressing objects: 100% (629/629), done.[K
remote: Total 37013 (delta 8679), reused 8516 (delta 8368), pack-reused 28014 (from 1)[K
Receiving objects: 100% (37013/37013), 60.72 MiB | 16.21 MiB/s, done.
Resolving deltas: 100% (27010/27010), done.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cpu
Collecting transformers<5.0.0,>=4.45.1 (from -r llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 3))
  Downloading transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gguf>=0.1.0 (from -r llama.cpp/./requirements/requirements-convert_legacy_llama.txt (line 4))


In [None]:
%cd llama.cpp
!make

/content/llama.cpp
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_AMX  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_AMX 
I NVCCFL

In [None]:
# Download model
%cd .
!git lfs install
!git clone https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k

/content
Git LFS initialized.
Cloning into 'Orpo-Llama-3.2-1B-15k'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 39 (delta 16), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (39/39), 17.60 KiB | 1.76 MiB/s, done.


In [None]:
# convert the model to ggml FP16 format
!python3 llama.cpp/convert_hf_to_gguf.py Orpo-Llama-3.2-1B-15k

INFO:hf-to-gguf:Loading model: Orpo-Llama-3.2-1B-15k
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {2048, 128258}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {8192, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {2048, 2048}


In [None]:
# quantize the model to 4-bits (using Q4_K_M method)
!llama.cpp/llama-quantize Orpo-Llama-3.2-1B-15k/Llama-3.2-1B-15k-F16.gguf ./Orpo-Llama-3.2-1B-15k/Orpo-Llama-3.2-1B-15k-Q4_K_M.gguf Q4_K_M

main: build = 4067 (54ef9cfc)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'Orpo-Llama-3.2-1B-15k/Llama-3.2-1B-15k-F16.gguf' to './Orpo-Llama-3.2-1B-15k/Orpo-Llama-3.2-1B-15k-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 37 key-value pairs and 147 tensors from Orpo-Llama-3.2-1B-15k/Llama-3.2-1B-15k-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = 

## Command Line Testing

`llama.cpp/llama-cli -m Orpo-Llama-3.2-1B-15k/Orpo-Llama-3.2-1B-15k-Q4_K_M.gguf -cnv -p "You are a helpful assistant"`

Model has been uploaded to [AdamLucek/Orpo-Llama-3.2-1B-15k-Q4_K_M-GGUF](https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-15k-Q4_K_M-GGUF)

Alternatively, you can use the [gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space to quickly convert your HF model to GGUF format with any quantization method.