# Quantize LLM to GGUF format with llama.cpp

Reference: [llama.cpp](https://github.com/ggerganov/llama.cpp)

### Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used. Here is a list of all the possible quant methods based on model cards by [TheBloke](https://huggingface.co/TheBloke/):

| Quantization method | Remarks |
| ------------------- | ------- |
| `q2_k` | Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors |
| `q3_k_l` | Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K |
| `q3_k_m` | Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K |
| `q3_k_s` | Uses Q3_K for all tensors |
| `q4_0` | Original quant method, 4-bit |
| `q4_1` | Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models |
| `q4_k_m` | Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K **(recommended)** |
| `q4_k_s` | Uses Q4_K for all tensors |
| `q5_0` | Higher accuracy, higher resource usage and slower inference |
| `q5_1` | Even higher accuracy, resource usage and slower inference |
| `q5_k_m` | Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K **(recommended)** |
| `q5_k_s` | Uses Q5_K for all tensors |
| `q6_k` | Uses Q8_K for all tensors |
| `q8_0` | Almost indistinguishable from float16. High resource use and slow. Not recommended for most users |

In [None]:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 15726, done.[K
remote: Counting objects: 100% (5714/5714), done.[K
remote: Compressing objects: 100% (443/443), done.[K
remote: Total 15726 (delta 5499), reused 5377 (delta 5270), pack-reused 10012[K
Receiving objects: 100% (15726/15726), 18.25 MiB | 23.34 MiB/s, done.
Resolving deltas: 100% (10994/10994), done.
Already up to date.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn

In [None]:
quant_method = "q4_k_m"

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model_path = model_name.split("/")[-1]

In [None]:
# Download model
!git lfs install
!git clone https://huggingface.co/{model_name}

Git LFS initialized.
Cloning into 'TinyLlama-1.1B-Chat-v1.0'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (51/51), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 54 (delta 18), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (54/54), 517.46 KiB | 4.07 MiB/s, done.


In [None]:
# Convert to fp16
fp16_path = f"{model_path}/{model_path.lower()}.fp16.bin"
!python llama.cpp/convert.py {model_path} --outtype f16 --outfile {fp16_path}

/content/llama.cpp/gguf-py
Loading model file TinyLlama-1.1B-Chat-v1.0/model.safetensors
params = Params(n_vocab=32000, n_embd=2048, n_layer=22, n_ctx=2048, n_ff=5632, n_head=32, n_head_kv=4, f_norm_eps=1e-05, n_experts=None, n_experts_used=None, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('TinyLlama-1.1B-Chat-v1.0'))
Loading vocab file 'TinyLlama-1.1B-Chat-v1.0/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
lm_head.weight                                   -> output.weight              

In [None]:
# Quantize the model with quant_method
quant_path = f"{model_path}/{model_path.lower()}.{quant_method.upper()}.gguf"
!./llama.cpp/quantize {fp16_path} {quant_path} {quant_method}

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 1833 (326b418)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'TinyLlama-1.1B-Chat-v1.0/tinyllama-1.1b-chat-v1.0.fp16.bin' to 'TinyLlama-1.1B-Chat-v1.0/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0/tinyllama-1.1b-chat-v1.0.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 204

## Run inference

Test the quantized model. To speed up inference, use `--n-gpu-layers|-ngl` to offload the layers to GPU.

In [None]:
prompt = """<|system|>
You are a friendly chatbot who always responds in the style of a pirate.</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>"""

In [None]:
!./llama.cpp/main -m {quant_path} -n 128 --color -ngl 35 -p "{prompt}"

Log start
main: build = 1833 (326b418)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1705044390
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader

## Push to hub

To push your model to the hub, you'll need to input your HuggingFace token in Colab's "Secrets" tab.

In [None]:
# !pip install -qU huggingface_hub

In [None]:
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

username = "kesamet"

api = HfApi(token=userdata.get("HF_TOKEN"))

create_repo(
    repo_id=f"{username}/{model_path}-GGUF",
    repo_type="model",
    exist_ok=True,
)

In [None]:
api.upload_folder(
    folder_path=model_path,
    repo_id=f"{username}/{model_path}-GGUF",
    allow_patterns=f"*.gguf",
)

tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf:   0%|          | 0.00/668M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kesamet/TinyLlama-1.1B-Chat-v1.0-GGUF/commit/bfcf0fb89756864d966ea741bce8e9d48bcf8c64', commit_message='Upload folder using huggingface_hub', commit_description='', oid='bfcf0fb89756864d966ea741bce8e9d48bcf8c64', pr_url=None, pr_revision=None, pr_num=None)