<a href="https://colab.research.google.com/github/harshyadav1508/GGUF_Quantization_of_LLM/blob/main/gguf_quant_simple.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

In [1]:
# Variables
MODEL_ID = "Qwen/Qwen1.5-1.8B"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

In [2]:
print(f"our MODEL_NAME is : {MODEL_NAME}")
print(f"our MODEL_ID is : {MODEL_ID}")

our MODEL_NAME is : Qwen1.5-1.8B
our MODEL_ID is : Qwen/Qwen1.5-1.8B


In [4]:
# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

Git LFS initialized.
Cloning into 'Qwen1.5-1.8B'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (66/66), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 70 (delta 31), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (70/70), 3.61 MiB | 4.99 MiB/s, done.


In [16]:
# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
print(f"our fp16 model will be : {fp16}\n")

!python llama.cpp/convert-hf-to-gguf.py {MODEL_NAME} --outtype f16 --outfile {fp16}

our fp16 model will be : Qwen1.5-1.8B/qwen1.5-1.8b.fp16.bin

Loading model: Qwen1.5-1.8B
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 151387 merge(s).
gguf: Setting special token type eos to 151643
gguf: Setting special token type pad to 151643
gguf: Setting special token type bos to 151643
gguf: Setting chat_template to {% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'Qwen1.5-1.8B/qwen1.5-1.8b.fp16.bin'
gguf: loading model part 'model.safetensors'
output.weight, n_dims = 2, torch.bfloat16 --> float16
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_down.weight, n_dims 

In [17]:
# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2442 (d84c4850)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'Qwen1.5-1.8B/qwen1.5-1.8b.fp16.bin' to 'Qwen1.5-1.8B/qwen1.5-1.8b.Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from Qwen1.5-1.8B/qwen1.5-1.8b.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-1.8B
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.co

In [18]:
!cd Qwen1.5-1.8B; du -sh qwen1*


3.5G	qwen1.5-1.8b.fp16.bin
1.2G	qwen1.5-1.8b.Q4_K_M.gguf
1.3G	qwen1.5-1.8b.Q5_K_M.gguf


In [19]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

print(f"gguf model which we are going to run: {model_list}")

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    print(f"new quatized model will be {qtype}")
    !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

gguf model which we are going to run: ['qwen1.5-1.8b.Q4_K_M.gguf', 'qwen1.5-1.8b.Q5_K_M.gguf']
Enter your prompt: funny quote
Name of the model (options: qwen1.5-1.8b.Q4_K_M.gguf, qwen1.5-1.8b.Q5_K_M.gguf): qwen1.5-1.8b.Q4_K_M.gguf
new quatized model will be Qwen1.5-1.8B/qwen1.5-1.8b.Q5_K_M.gguf
Log start
main: build = 2442 (d84c4850)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1710593447
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from Qwen1.5-1.8B/qwen1.5-1.8b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:   