
##Let's go step by step through the procedure of quantizing a model using the llama.cpp tool.
##    In this tutorial, we'll be applying quantization to the "google/gemma-2b-it" model.

#### First check ur model supports or not - https://github.com/ggerganov/llama.cpp

**1) First, we need to clone the llama.cpp repository and install the necessary requirements:**

In [None]:
!git clone https://github.com/ggerganov/llama.cpp

**2) This command clones the llama.cpp repository and compiles the necessary binaries with CUDA support for GPU acceleration. It also installs Python dependencies required for the process.**

In [1]:
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements.txt

**3) we will download the model from Hugging Face Hub using the snapshot_download**

In [2]:
from huggingface_hub import snapshot_download

model_name = "mirajbhandari/gemma-2b-it_yungri_final"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, local_dir_use_symlinks=False)

**4) First, we need to make the model compatible with llama.cpp by converting it to a format called "gguf." We'll also choose a precision level (like f16 for half-precision floating point) and specify where to save the converted file.**

#### Choosing a lower precision level, such as 8-bit or 4-bit, can potentially reduce the performance of the model.

In [3]:
!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

**5) we're using a quantization method called q4_k_m, which is specified in the methods list. This method quantizes the model to 4-bit precision with knowledge distillation and mapping techniques for better performance.**

##### for others methods- https://github.com/ggerganov/llama.cpp/pull/1684

`Model--> F16,	Q2_K,	Q3_K_S,	Q3_K_M,	Q3_K_L,	Q4_K_S,	Q4_K_M,	Q5_K_S,	Q5_K_M,	Q6_K`

In [None]:
import os

methods = ["q4_k_m"]
quantized_path = "./quantized_model/"

for m in methods:
    qtype = f"{quantized_path}/{m.upper()}.gguf"
    os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)

**6) Getting Hugging Face Token**

In [None]:
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')


## Uploading GGUF Model To Hugging Face

In [4]:
from huggingface_hub import HfApi, create_repo, upload_file

model_path = "/content/quantized_model/FP16.gguf"
try:
  repo_name = "gemma-2b-it-GGUF-quantizedd"
  repo_url = create_repo(repo_name, private=False)
except:
  print('The repo already Exists')


api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf", # set model name for hugging face
    repo_id="mirajbhandari/gemma-2b-it-GGUF-quantizedd", #change repo name
    repo_type="model",
)
