# GGUF Converter Notebook

Llama.cpp is a great way to run LLMs efficiently on CPUs and GPUs. The downside however is that you need to convert models to a format that's supported by Llama.cpp, which is now the GGUF file format.

This notebooks aims to be a guide to convert your model weigth/binaries into a GGUF file format compatible with llama.cpp

**Reference**

- https://www.substratus.ai/blog/converting-hf-model-gguf-model/

- https://github.com/ggerganov/llama.cpp/discussions/2948

- https://hackernoon.com/the-cheapskates-guide-to-fine-tuning-llama-2-and-running-it-on-your-laptop

## Setup

In [1]:
import os
import getpass
import logging
import sys
import dotenv

### Logging

In [2]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")

console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

### Env Vars

In [4]:
logger.info("Looking .env files at current directory...")
dotenv_exists = dotenv.load_dotenv(dotenv.find_dotenv(usecwd=True))
if dotenv_exists == False:
    logger.warning("No .env found! Prompting user for credentials...")
    _hugginface_token = getpass.getpass("Enter HUGGINGFACE TOKEN: ")
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN") if dotenv_exists == True else _hugginface_token
logger.info("Loaded .env variables")

2023-10-12 12:20:30,225 - __main__ - INFO - Looking .env files at current directory...
2023-10-12 12:20:30,231 - __main__ - INFO - Loaded .env variables


## Step 1 - Clone llama.cpp repo

In [7]:
! rm -rf llama.cpp && \
    git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 10326, done.[K
remote: Counting objects: 100% (10326/10326), done.[K
remote: Compressing objects: 100% (3185/3185), done.[K
remote: Total 10326 (delta 7189), reused 10073 (delta 7071), pack-reused 0[K
Receiving objects: 100% (10326/10326), 11.37 MiB | 21.89 MiB/s, done.
Resolving deltas: 100% (7189/7189), done.
Updating files: 100% (272/272), done.


## Step 2 - Install required libraries

In [8]:
%pip install -r llama.cpp/requirements.txt

Looking in indexes: https://artifacts.dell.com/artifactory/api/pypi/python/simple, https://artifacts.dell.com/artifactory/api/pypi/ailfc-1003745-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aia-1001238-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aiops-1002685-pypi-prd-local/simple
Collecting numpy==1.24.4 (from -r llama.cpp/requirements.txt (line 1))
  Using cached https://artifacts.dell.com/artifactory/api/pypi/python/packages/packages/10/be/ae5bf4737cb79ba437879915791f6f26d92583c738d7d960ad94e5c36adf/numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting sentencepiece==0.1.98 (from -r llama.cpp/requirements.txt (line 2))
  Using cached https://artifacts.dell.com/artifactory/api/pypi/python/packages/packages/e2/1b/5f1374ba4c4009bd300566ea60697a4e37a82d8b36420999420463d85e56/sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting gguf>=0.1.0 (from -r ll

## Step 3 - Converting Fine-tuned adapters model to GGML

In [9]:
!python llama.cpp/convert-lora-to-ggml.py 'gguf-fine-tuned-chat-model'

base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight => blk.0.attn_q.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight => blk.0.attn_q.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight => blk.0.attn_v.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight => blk.0.attn_v.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight => blk.1.attn_q.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight => blk.1.attn_q.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight => blk.1.attn_v.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.v_proj.lora_B.weight => blk.1.attn_v.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.2.self_attn.q_proj.lora_A.

## Step 4 - Convert model into f16 or f32 GGUF versions

In [10]:
!python llama.cpp/convert.py -h

usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only]
                  [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR]
                  [--outfile OUTFILE] [--vocabtype {spm,bpe}] [--ctx CTX]
                  [--concurrency CONCURRENCY]
                  model

Convert a LLaMa model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself
                        (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default:
                        f16 or f32 based on input)
  --vocab-dir VOCAB_DIR
                        directory containing tokenizer.model, if separate from
         

In [12]:
!python llama.cpp/convert.py 'gguf-fine-tuned-chat-model'

Loading model file gguf-fine-tuned-chat-model/model-00001-of-00002.safetensors
Loading model file gguf-fine-tuned-chat-model/model-00001-of-00002.safetensors
Loading model file gguf-fine-tuned-chat-model/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-06, f_rope_freq_base=None, f_rope_scale=None, ftype=None, path_model=PosixPath('gguf-fine-tuned-chat-model'))
Loading vocab file 'gguf-fine-tuned-chat-model/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting la

## Step 5 - Quantize model to 4bits

In [13]:
!llama.cpp/quantize 'gguf-fine-tuned-chat-model/ggml-model-f16.gguf' 'gguf-fine-tuned-chat-model/ggml-model-f16.gguf' 'q4_0'

/bin/bash: line 1: llama.cpp/quantize: No such file or directory


In [25]:
!python llama.cpp/examples/make-ggml.py -h

Looking in indexes: https://artifacts.dell.com/artifactory/api/pypi/python/simple, https://artifacts.dell.com/artifactory/api/pypi/ailfc-1003745-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aia-1001238-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aiops-1002685-pypi-prd-local/simple
Collecting huggingface-hub==0.16.4
  Using cached https://artifacts.dell.com/artifactory/api/pypi/python/packages/packages/7f/c4/adcbe9a696c135578cabcbdd7331332daad4d49b7c43688bc2d36b3a47d2/huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.17.3
    Uninstalling huggingface-hub-0.17.3:
      Successfully uninstalled huggingface-hub-0.17.3
Successfully installed huggingface-hub-0.16.4
usage: make-ggml.py [-h] --model_type
                    {llama,starcoder,falcon,baichuan,gptneox}
                    [--outname OUTNAME] [-

In [15]:
!cd llama.cpp/examples && \
    python make-ggml.py '/home/dell/llama-v2-evaluation/social_content_gen/gguf-fine-tuned-chat-model' \
      --model_type 'llama' \
      --outname 'llama-2-7b-chat-social_media_gen' \
      --outdir '/home/dell/llama-v2-evaluation/social_content_gen/gguf-fine-tuned-chat-model' \
      --quants 'Q4_K_M'

Looking in indexes: https://artifacts.dell.com/artifactory/api/pypi/python/simple, https://artifacts.dell.com/artifactory/api/pypi/ailfc-1003745-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aia-1001238-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aiops-1002685-pypi-prd-local/simple
Building llama.cpp
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn 

## Optional: Upload to HuggingFace

In [6]:
# %pip install huggingface_hub["cli"] --upgrade --user # already inclueded in dev container environment

Looking in indexes: https://artifacts.dell.com/artifactory/api/pypi/python/simple, https://artifacts.dell.com/artifactory/api/pypi/ailfc-1003745-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aia-1001238-pypi-prd-local/simple, https://artifacts.dell.com/artifactory/api/pypi/aiops-1002685-pypi-prd-local/simple
Collecting huggingface_hub[cli]
  Using cached https://artifacts.dell.com/artifactory/api/pypi/python/packages/packages/ef/b5/b6107bd65fa4c96fdf00e4733e2fe5729bb9e5e09997f63074bb43d3ab28/huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
Installing collected packages: huggingface_hub
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 2.2.2 requires sentencepiece, which is not installed.
tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.18.0 which is incompatible.[0m[31m


In [7]:
from huggingface_hub import login, HfApi

logger.info("Logging to Hugging Face...")
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN, write_permission=True)

logger.info("Uploading fine tuned model folder to Hugging Face...")
api = HfApi()

REPO_ID = "kevinknights29/llama-2-7b-chat-social_media_gen.gguf"
api.create_repo(
    repo_id=REPO_ID,
    token=HUGGINGFACE_TOKEN,
    repo_type="model",
    exist_ok=True,
)
api.upload_folder(
    repo_id=REPO_ID,
    token=HUGGINGFACE_TOKEN,
    repo_type="model",
    folder_path=os.path.join(os.getcwd(), "llama-2-7b-chat-social_media_gen_GGUF"),
)

2023-10-12 12:21:52,101 - __main__ - INFO - Logging to Hugging Face...
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/dell/.cache/huggingface/token
Login successful
2023-10-12 12:21:52,382 - __main__ - INFO - Uploading fine tuned model folder to Hugging Face...


llama-2-7b-chat-social_media_gen.gguf.Q4_K_M.bin:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

'https://huggingface.co/kevinknights29/llama-2-7b-chat-social_media_gen.gguf/tree/main/'