# GGUF Converter Notebook

Llama.cpp is a great way to run LLMs efficiently on CPUs and GPUs. The downside however is that you need to convert models to a format that's supported by Llama.cpp, which is now the GGUF file format.

This notebooks aims to be a guide to convert your model weigth/binaries into a GGUF file format compatible with llama.cpp

**Reference**

- https://www.substratus.ai/blog/converting-hf-model-gguf-model/

- https://github.com/ggerganov/llama.cpp/discussions/2948

- https://hackernoon.com/the-cheapskates-guide-to-fine-tuning-llama-2-and-running-it-on-your-laptop

**Additional Resources**

- https://stackoverflow.com/questions/37890898/how-to-set-env-variable-in-jupyter-notebook

- https://unix.stackexchange.com/questions/558350/is-there-in-bash-a-builtin-command-to-get-the-absolute-path-of-a-relative-file

- https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html

- https://cgold.readthedocs.io/en/latest/first-step/installation.html

## Setup

In [12]:
import os
import getpass
import logging
import sys
import dotenv

from huggingface_hub import login, HfApi

### Logging

In [2]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")

console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

### Env Vars

In [3]:
logger.info("Looking .env files at current directory...")
dotenv_exists = dotenv.load_dotenv(dotenv.find_dotenv(usecwd=True))
if dotenv_exists == False:
    logger.warning("No .env found! Prompting user for credentials...")
    _hugginface_token = getpass.getpass("Enter HUGGINGFACE TOKEN: ")
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN") if dotenv_exists == True else _hugginface_token
logger.info("Loaded .env variables")

2024-03-03 21:49:21,375 - __main__ - INFO - Looking .env files at current directory...
2024-03-03 21:49:21,377 - __main__ - INFO - Loaded .env variables
2024-03-03 21:49:21,377 - __main__ - INFO - Loaded .env variables


## Step 1 - Clone llama.cpp repo

In [4]:
! rm -rf llama.cpp && \
    git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 19504, done.[K
remote: Counting objects: 100% (11747/11747), done.[K
remote: Compressing objects: 100% (639/639), done.[K
remote: Total 19504 (delta 11327), reused 11176 (delta 11107), pack-reused 7757[K
Receiving objects: 100% (19504/19504), 20.40 MiB | 12.90 MiB/s, done.
Resolving deltas: 100% (13945/13945), done.


## Step 2 - Install required libraries

In [5]:
%pip install -r llama.cpp/requirements.txt

Collecting numpy~=1.24.4 (from -r llama.cpp/./requirements/requirements-convert.txt (line 1))
  Downloading numpy-1.24.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (5.6 kB)
Collecting sentencepiece~=0.1.98 (from -r llama.cpp/./requirements/requirements-convert.txt (line 2))
  Downloading sentencepiece-0.1.99-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.7 kB)
Collecting gguf>=0.1.0 (from -r llama.cpp/./requirements/requirements-convert.txt (line 4))
  Downloading gguf-0.6.0-py3-none-any.whl.metadata (3.2 kB)
Collecting protobuf<5.0.0,>=4.21.0 (from -r llama.cpp/./requirements/requirements-convert.txt (line 5))
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_aarch64.whl.metadata (541 bytes)
Collecting torch~=2.1.1 (from -r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2))
  Downloading torch-2.1.2-cp311-cp311-manylinux2014_aarch64.whl.metadata (25 kB)
Collecting einops~=0.7.0 (from -r llama.cpp/./requiremen

## Step 3 - Converting Fine-tuned adapters model to GGML

In [9]:
# !python llama.cpp/convert-lora-to-ggml.py 'gguf-fine-tuned-chat-model'

base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight => blk.0.attn_q.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight => blk.0.attn_q.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight => blk.0.attn_v.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight => blk.0.attn_v.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight => blk.1.attn_q.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight => blk.1.attn_q.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight => blk.1.attn_v.weight.loraA (4096, 64) float32 1.00MB
base_model.model.model.layers.1.self_attn.v_proj.lora_B.weight => blk.1.attn_v.weight.loraB (4096, 64) float32 1.00MB
base_model.model.model.layers.2.self_attn.q_proj.lora_A.

## Step 4 - Convert model into f16 or f32 GGUF versions

In [6]:
!python llama.cpp/convert.py -h

usage: convert.py [-h] [--awq-path AWQ_PATH] [--dump] [--dump-single]
                  [--vocab-only] [--outtype {f32,f16,q8_0}]
                  [--vocab-dir VOCAB_DIR] [--vocab-type VOCAB_TYPE]
                  [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY]
                  [--big-endian] [--pad-vocab] [--skip-unknown]
                  model

Convert a LLaMA model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself
                        (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --awq-path AWQ_PATH   Path to scale awq cache file
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default:
                      

In [5]:
# Store relative path as an ENV VAR
%env GGML_OUTPUT_DIR=./models/Llama-2-7b-chat-hf
!echo $GGML_OUTPUT_DIR
!realpath $GGML_OUTPUT_DIR

env: GGML_OUTPUT_DIR=./models/Llama-2-7b-chat-hf
./models/Llama-2-7b-chat-hf
/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf


In [6]:
# Convert ENV VAR relative path to absolute path
os.environ["GGML_OUTPUT_DIR"] = os.popen("realpath $GGML_OUTPUT_DIR").read().strip()
print(os.environ["GGML_OUTPUT_DIR"])

/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf


In [7]:
!python llama.cpp/convert.py $GGML_OUTPUT_DIR

Loading model file models/Llama-2-7b-chat-hf/model-00001-of-00002.safetensors
Loading model file models/Llama-2-7b-chat-hf/model-00001-of-00002.safetensors
Loading model file models/Llama-2-7b-chat-hf/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('models/Llama-2-7b-chat-hf'))
Found vocab files: {'spm': PosixPath('models/Llama-2-7b-chat-hf/tokenizer.model'), 'bpe': None, 'hfft': PosixPath('models/Llama-2-7b-chat-hf/tokenizer.json')}
Loading vocab file PosixPath('models/Llama-2-7b-chat-hf/tokenizer.model'), type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0}, add special tokens {'bo

In [7]:
# Store relative path as an ENV VAR
%env GGML_MODEL_PATH=./models/Llama-2-7b-chat-hf/ggml-model-f16.gguf
!echo $GGML_MODEL_PATH
!realpath $GGML_MODEL_PATH

env: GGML_MODEL_PATH=./models/Llama-2-7b-chat-hf/ggml-model-f16.gguf
./models/Llama-2-7b-chat-hf/ggml-model-f16.gguf
/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-f16.gguf


In [8]:
# Convert ENV VAR relative path to absolute path
os.environ["GGML_MODEL_PATH"] = os.popen("realpath $GGML_MODEL_PATH").read().strip()
print(os.environ["GGML_MODEL_PATH"])

/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-f16.gguf


## Step 5 - Quantize model to 4bits

In [61]:
!apt-get -y install cmake && \
    cd llama.cpp && \
    mkdir build && \
    cd build && \
    cmake .. && \
    cmake --build . --config Release

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.25.1-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
-- The C compiler identification is GNU 12.2.0
-- The CXX compiler identification is GNU 12.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.39.2") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Perfor

In [9]:
# Store quantization option and path as ENV VARs
os.environ["QUANTIZE_OPTION"] = "q4_0"
os.environ["GGML_QUANTIZE_MODEL_PATH"] = (
    f"{os.popen('realpath $GGML_OUTPUT_DIR').read().strip()}"
    f"/ggml-model-{os.environ['QUANTIZE_OPTION']}.gguf"
)
print(os.environ["QUANTIZE_OPTION"])
print(os.environ["GGML_QUANTIZE_MODEL_PATH"])

q4_0
/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf


In [66]:
!cd llama.cpp/build/bin && \
    ./quantize $GGML_MODEL_PATH $GGML_QUANTIZE_MODEL_PATH $QUANTIZE_OPTION

main: build = 2321 (97311342)
main: built with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: quantizing '/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-f16.gguf' to '/workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /workspaces/Llama_to_Llama.cpp/models/Llama-2-7b-chat-hf/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:         

## Optional: Upload to HuggingFace

In [67]:
# %pip install huggingface_hub["cli"] --upgrade --user # already inclueded in dev container environment

In [13]:
logger.info("Logging to Hugging Face...")
HUGGINGFACE_USER = os.getenv("HUGGINGFACE_USER")
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN, write_permission=True)


api = HfApi()

REPO_ID = f"{HUGGINGFACE_USER}/llama-2-7b-chat-{os.environ['QUANTIZE_OPTION']}.gguf"
logger.info("Creating repo %s in Hugging Face...", REPO_ID)
api.create_repo(
    repo_id=REPO_ID,
    token=HUGGINGFACE_TOKEN,
    repo_type="model",
    exist_ok=True,
)

logger.info("Uploading fine-tuned / quantize model to %s in Hugging Face...", REPO_ID)
api.upload_folder(
    repo_id=REPO_ID,
    token=HUGGINGFACE_TOKEN,
    repo_type="model",
    folder_path=os.environ["GGML_OUTPUT_DIR"],
    allow_patterns="*.gguf",    # Remove if you would like to uplodad different file types.
    multi_commits=True,         # For large file uploads, refer: https://huggingface.co/docs/huggingface_hub/guides/upload
    multi_commits_verbose=True, # For large file uploads, refer: https://huggingface.co/docs/huggingface_hub/guides/upload
)

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2048219987.py, line 24)