# GGUF
Previously nammed GGML, GGUF is favored by a lot of the community for its ability to run efficiently on CPU and Apple devices, offloading to a GPU if available! Making it a good choice for local testing and deployment as it can make good use of both RAM (and VRAM if available).
### Quantizing with [llama.cpp](https://github.com/ggerganov/llama.cpp)
Here is a list of possible quantizations with llama.cpp:
- `q2_k`
- `q3_k_l`
- `q3_k_m`
- `q3_k_s`
- `q4_0`: <- 4-bit
- `q4_1`
- `q4_k_s`
- `q4_k_m` <- Recommended
- `q5_0`: <- 5-bit
- `q5_1`
- `q5_k_s`
- `q5_k_m`: <- Recommended
- `q6_k`: <- Recommended
- `q8_0`: <- 8-bit, very close to lossless compared to the original weights

Lets do a short demo and quantize Mistral 7B!

First lets install all the dependencies required and `llama.cpp`, as well as downloading the model.

In [1]:
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
model_name = model_id.split('/')[-1]
user_name = "huggingface_username"
hf_token = "read_token"

# install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
!(cd llama.cpp && make)

# download model
!git lfs install
!git clone https://{user_name}:{hf_token}@huggingface.co/{model_id}

Cloning into 'llama.cpp'...
remote: Enumerating objects: 32489, done.[K
remote: Counting objects: 100% (11391/11391), done.[K
remote: Compressing objects: 100% (708/708), done.[K
remote: Total 32489 (delta 11079), reused 10702 (delta 10682), pack-reused 21098 (from 1)[K
Receiving objects: 100% (32489/32489), 56.06 MiB | 14.27 MiB/s, done.
Resolving deltas: 100% (23481/23481), done.
Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c+

Once everything installed and downloaded, we can convert our model to fp16, required before quantizing to GGUF.

In [2]:
fp16 = f"{model_name}/{model_name.lower()}.fp16.bin"
!python llama.cpp/convert_hf_to_gguf.py {model_name} --outtype f16 --outfile {fp16}

INFO:hf-to-gguf:Loading model: Mistral-7B-Instruct-v0.3
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 32768}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16

Now our model is ready, we can quantize, feel free to change the method, in this example we will quantize to `q4_k_m`.

In [None]:
method = "q4_k_m"
qtype = f"{model_name}/{model_name.lower()}.{method.upper()}.gguf"
!./llama.cpp/llama-quantize {fp16} {qtype} {method}

Perfect, we may now test it with `llama.cpp` using the folllowing:

In [10]:
!./llama.cpp/llama-cli -m {qtype} -n 128 --color -ngl 35 -cnv --chat-template mistral

Log start
main: build = 3613 (fc54ef0d)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724246949
llama_model_loader: loaded meta data with 37 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.3/mistral-7b-instruct-v0.3.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral 7B Instruct v0.3
llama_model_loader: - kv   3:                            general.version str              = v0.3
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str