# Quantization of Qwen1.5-1.8B(32-bit) LLM model using llama.cpp into FP16 format using GGUF and into 4-bit using Q4_K_m

## Cloning the llama cpp from git

In [1]:
! git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 25694, done.[K
remote: Counting objects: 100% (9361/9361), done.[K
remote: Compressing objects: 100% (585/585), done.[K
remote: Total 25694 (delta 9066), reused 8846 (delta 8775), pack-reused 16333[K
Receiving objects: 100% (25694/25694), 47.57 MiB | 25.02 MiB/s, done.
Resolving deltas: 100% (18277/18277), done.


## Installing the requirements to convert LLm to gguf from the same llama cpp

In [2]:
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements/requirements-convert-hf-to-gguf.txt

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/i

## Importing and downloading the LLm from huggingface hub. Also creating and storing the orginal model & quantized model folders within the workspace

In [3]:
from huggingface_hub import snapshot_download

In [4]:
model_name="Qwen/Qwen1.5-1.8B"

In [5]:
base_model="./original_model"

In [6]:
quantized_path = "./quantized_model"

In [7]:
snapshot_download(repo_id=model_name,local_dir=base_model,local_dir_use_symlinks=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/2.79k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.28k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

'/content/original_model'

## Creating FP16 gguf format file to store the quantized model and also converting the orginal model 32-bit to FP16 gguf format

In [8]:
quantized_model_path=quantized_path+'/FP16.gguf'

In [9]:
quantized_model_path

'./quantized_model/FP16.gguf'

In [10]:
!mkdir quantized_model

In [11]:
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

INFO:hf-to-gguf:Loading model: original_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 32768
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 5504
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 1000000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to

## Further converting the model to 4-bit model using Q4_K_m method

In [12]:
methods=["q4_k_m"]

In [13]:
quantized_path = "./quantized_model"

In [14]:
import os

for method in methods:
  qtype=f"{quantized_path}/{method.upper()}.gguf"
  print(qtype)

./quantized_model/Q4_K_M.gguf


In [15]:
import os

for method in methods:
  qtype=f"{quantized_path}/{method.upper()}.gguf"
  os.system("./llama.cpp/quantize "+ "./quantized_model"+"/FP16.gguf "+ qtype + " " + method)

## Finaly using our converted model(quantized model) for chat with bot

In [16]:
! /content/llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt

Log start
main: build = 3043 (972b555a)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1717061848
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ./quantized_model/Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = original_model
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 5504
llama_model_loader: - kv   

## pushing our quantized model into hugging face hub

In [17]:
from huggingface_hub import notebook_login

In [23]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [19]:
from huggingface_hub import HfApi, HfFolder, create_repo, upload_file

In [20]:
model_path="/content/quantized_model/Q4_K_M.gguf"

In [21]:
repo_name="qwen1.5-llm-quantized"

In [24]:
repo_url=create_repo(repo_name,private=False)

In [25]:
api=HfApi()

In [27]:
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf",
    repo_id="Kiranmai97/qwen1.5-llm-quantized",
    repo_type="model",

)

Q4_K_M.gguf:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Kiranmai97/qwen1.5-llm-quantized/commit/fa5e327450c1d9f7b5c458add2fb9f9158d5f411', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='fa5e327450c1d9f7b5c458add2fb9f9158d5f411', pr_url=None, pr_revision=None, pr_num=None)