# Converting HF models using safetensors to GGUF format

## Step 0: Clone LLaMA C++

In [1]:
%%bash

SRC_DIR=../src/llama-cpp
if [ ! -d "$SRC_DIR" ]; then git clone git@github.com:ggerganov/llama.cpp.git "$SRC_DIR"; fi


## Step 1: Clone a model from Hugging Face

In [2]:
%%bash

git clone https://huggingface.co/allenai/OLMo-7B-0724-Instruct-hf ../src/allenai/OLMo-7B-0724-Instruct-hf

Cloning into '../src/allenai/OLMo-7B-0724-Instruct-hf'...
Filtering content: 100% (3/3), 4.83 GiB | 15.75 MiB/s, done.


In [3]:
%%bash

ls -lh ../src/allenai/OLMo-7B-0724-Instruct-hf/

total 26955920
-rw-r--r--  1 pughdr  KAUST\Domain Users   9.2K Oct 18 11:51 README.md
-rw-r--r--  1 pughdr  KAUST\Domain Users   637B Oct 18 11:51 config.json
-rw-r--r--  1 pughdr  KAUST\Domain Users   115B Oct 18 11:51 generation_config.json
-rw-r--r--  1 pughdr  KAUST\Domain Users   4.7G Oct 18 11:56 model-00001-of-00003.safetensors
-rw-r--r--  1 pughdr  KAUST\Domain Users   4.6G Oct 18 11:56 model-00002-of-00003.safetensors
-rw-r--r--  1 pughdr  KAUST\Domain Users   3.6G Oct 18 11:55 model-00003-of-00003.safetensors
-rw-r--r--  1 pughdr  KAUST\Domain Users    18K Oct 18 11:51 model.safetensors.index.json
-rw-r--r--  1 pughdr  KAUST\Domain Users   293B Oct 18 11:51 special_tokens_map.json
-rw-r--r--  1 pughdr  KAUST\Domain Users   2.0M Oct 18 11:51 tokenizer.json
-rw-r--r--  1 pughdr  KAUST\Domain Users   5.6K Oct 18 11:51 tokenizer_config.json


## Step 2: Convert the model to GGUF format

In [12]:
%%bash

export TOKENIZERS_PARALLELISM=false

OUTTYPE=bf16
python ../src/llama-cpp/convert_hf_to_gguf.py ../src/allenai/OLMo-7B-0724-Instruct-hf/ \
    --outtype $OUTTYPE \
    --outfile ../models/allenai-OLMo-7B-0724-Instruct-hf-$OUTTYPE.gguf


INFO:hf-to-gguf:Loading model: OLMo-7B-0724-Instruct-hf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> BF16, shape = {4096, 50304}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> BF16, shape = {11008, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> BF16, shape = {4096, 11008}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> BF16, shape = {4096, 11008}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> BF16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_output.weight,  torch.bfloat16 --> BF16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_q.weight,       torch.bfloat16 --> BF16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,       torch.bfloa

## Run the model

In [9]:
%%bash

llama-cli \
    --model ../models/allenai-OLMo-7B-0724-Instruct-hf-bf16.gguf \
    --prompt "Why is the sky blue?" \
    --seed 42


build: 3865 (00b7317e) with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin24.0.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 31 key-value pairs and 226 tensors from ../models/allenai-OLMo-7B-0724-Instruct-hf-bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = olmo
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = OLMo 7B 0724 Instruct Hf
llama_model_loader: - kv   3:                            general.version str              = 0724
llama_model_loader: - kv   4:                           general.finetune str              = Instruct-hf
llama_model_loader: - kv   5:                 

Why is the sky blue? Why is the sun red at sunrise and sunset?

The first answer is because of the scattering of sunlight by the atmosphere. Blue light travels shorter distances than other colors, so it scatters more. This is why the sky appears blue to us. 

The second answer is also related to scattering, but this time it's the scattering of sunlight by tiny dust particles in the atmosphere. At sunrise and
Interrupted by user


llama_perf_sampler_print:    sampling time =       2.48 ms /    91 runs   (    0.03 ms per token, 36678.76 tokens per second)
llama_perf_context_print:        load time =   19021.70 ms
llama_perf_context_print: prompt eval time =    9247.50 ms /     6 tokens ( 1541.25 ms per token,     0.65 tokens per second)
llama_perf_context_print:        eval time =  136342.13 ms /    84 runs   ( 1623.12 ms per token,     0.62 tokens per second)
llama_perf_context_print:       total time =  146843.55 ms /    90 tokens


Process is interrupted.


Run the quantized model:

```bash
# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
```

When running the larger models, make sure you have enough disk space to store all the intermediate files.

## Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

| Model | Original size | Quantized size (Q4_0) |
|------:|--------------:|----------------------:|
|    7B |         13 GB |                3.9 GB |
|   13B |         24 GB |                7.8 GB |
|   30B |         60 GB |               19.5 GB |
|   65B |        120 GB |               38.5 GB |

## Quantization

Several quantization methods are supported. They differ in the resulting model disk size and inference speed.

The quantization formats `Q4_0_4_4`, `Q4_0_4_8` and `Q4_0_8_8` are block interleaved variants of the `Q4_0` format, providing a data layout that is better suited for specific implementations of optimized mulmat kernels. Since these formats differ only in data layout, they have the same quantized size as the `Q4_0` format.

*(outdated)*

| Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
|    7B | perplexity   | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
|    7B | file size    |  13.0G |   3.5G |   3.9G |   4.3G |   4.7G |   6.7G |
|    7B | ms/tok @ 4th |    127 |     55 |     54 |     76 |     83 |     72 |
|    7B | ms/tok @ 8th |    122 |     43 |     45 |     52 |     56 |     67 |
|    7B | bits/weight  |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |
|   13B | perplexity   | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
|   13B | file size    |  25.0G |   6.8G |   7.6G |   8.3G |   9.1G |    13G |
|   13B | ms/tok @ 4th |      - |    103 |    105 |    148 |    160 |    131 |
|   13B | ms/tok @ 8th |      - |     73 |     82 |     98 |    105 |    128 |
|   13B | bits/weight  |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |

- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
- recent k-quants improvements and new i-quants
  - [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
  - [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
  - [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
  - [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
  - [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
  - [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
  - [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
  - [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
  - [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
  - [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
  - [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
  - [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
  - [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
  - [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)

**Llama 2 7B**

| Quantization | Bits per Weight (BPW) |
|--------------|-----------------------|
| Q2_K         | 3.35                  |
| Q3_K_S       | 3.50                  |
| Q3_K_M       | 3.91                  |
| Q3_K_L       | 4.27                  |
| Q4_K_S       | 4.58                  |
| Q4_K_M       | 4.84                  |
| Q5_K_S       | 5.52                  |
| Q5_K_M       | 5.68                  |
| Q6_K         | 6.56                  |

**Llama 2 13B**

Quantization | Bits per Weight (BPW)
-- | --
Q2_K | 3.34
Q3_K_S | 3.48
Q3_K_M | 3.89
Q3_K_L | 4.26
Q4_K_S | 4.56
Q4_K_M | 4.83
Q5_K_S | 5.51
Q5_K_M | 5.67
Q6_K | 6.56

**Llama 2 70B**

Quantization | Bits per Weight (BPW)
-- | --
Q2_K | 3.40
Q3_K_S | 3.47
Q3_K_M | 3.85
Q3_K_L | 4.19
Q4_K_S | 4.53
Q4_K_M | 4.80
Q5_K_S | 5.50
Q5_K_M | 5.65
Q6_K | 6.56

You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
