
##Let's go step by step through the procedure of quantizing a model using the llama.cpp tool.
##    In this tutorial, we'll be applying quantization to the "google/gemma-2b-it" model.

#### First check ur model supports or not - https://github.com/ggerganov/llama.cpp

**1) First, we need to clone the llama.cpp repository and install the necessary requirements:**

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

**2) This command clones the llama.cpp repository and compiles the necessary binaries with CUDA support for GPU acceleration. It also installs Python dependencies required for the process.**

In [None]:
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 20666, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 20666 (delta 0), reused 1 (delta 0), pack-reused 20663[K
Receiving objects: 100% (20666/20666), 24.76 MiB | 20.14 MiB/s, done.
Resolving deltas: 100% (14549/14549), done.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual 

**3) we will download the model from Hugging Face Hub using the snapshot_download**

In [None]:
from huggingface_hub import snapshot_download

model_name = "mirajbhandari/gemma-2b-it_yungri_final"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, local_dir_use_symlinks=False)

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/522 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

'/content/original_model'

**4) First, we need to make the model compatible with llama.cpp by converting it to a format called "gguf." We'll also choose a precision level (like f16 for half-precision floating point) and specify where to save the converted file.**

#### Choosing a lower precision level, such as 8-bit or 4-bit, can potentially reduce the performance of the model.

In [None]:
!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

Loading model: original_model
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Setting special token type bos to 2
gguf: Setting special token type eos to 1
gguf: Setting special token type unk to 3
gguf: Setting special token type pad to 1
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Exporting model to

**5) we're using a quantization method called q4_k_m, which is specified in the methods list. This method quantizes the model to 4-bit precision with knowledge distillation and mapping techniques for better performance.**

##### for others methods- https://github.com/ggerganov/llama.cpp/pull/1684

`Model--> F16,	Q2_K,	Q3_K_S,	Q3_K_M,	Q3_K_L,	Q4_K_S,	Q4_K_M,	Q5_K_S,	Q5_K_M,	Q6_K`

In [None]:
import os

methods = ["q4_k_m"]
quantized_path = "./quantized_model/"

for m in methods:
    qtype = f"{quantized_path}/{m.upper()}.gguf"
    os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)

**6) Getting Hugging Face Token**

In [None]:
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')


## Uploading GGUF Model To Hugging Face

In [None]:
from huggingface_hub import HfApi, create_repo, upload_file

model_path = "/content/quantized_model/FP16.gguf"
try:
  repo_name = "gemma-2b-it-GGUF-quantizedd"
  repo_url = create_repo(repo_name, private=False)
except:
  print('The repo already Exists')


api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf", # set model name for hugging face
    repo_id="mirajbhandari/gemma-2b-it-GGUF-quantizedd", #change repo name
    repo_type="model",
)


FP16.gguf:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mirajbhandari/gemma-2b-it-GGUF-quantizedd/commit/298516cb50e1935ec4e34797c644258ae01ff33c', commit_message='Upload Q4_K_M.gguf with huggingface_hub', commit_description='', oid='298516cb50e1935ec4e34797c644258ae01ff33c', pr_url=None, pr_revision=None, pr_num=None)