<a href="https://colab.research.google.com/github/limcheekin/LLM-Engineers-Handbook/blob/main/notebooks/hf_to_gguf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantize HF models using GGUF and llama.cpp
> üó£Ô∏è The notebook is updated version of [Quantize_Llama_2_models_using_GGUF_and_llama_cpp.ipynb](https://colab.research.google.com/github/mlabonne/llm-course/blob/main/Quantize_Llama_2_models_using_GGUF_and_llama_cpp.ipynb) from the [Large Language Model Course](https://github.com/mlabonne/llm-course)

‚ù§Ô∏è Created by [@limcheekin](https://github.com/limcheekin).

## Usage

* `MODEL_ID`: The ID of the model to quantize (e.g., `mlabonne/EvolCodeLlama-7b`).
* `QUANTIZATION_METHOD`: The quantization method to use.

## Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used (detailed below). Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by [TheBloke](https://huggingface.co/TheBloke/):

* `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
* `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
* `q3_k_s`: Uses Q3_K for all tensors
* `q4_0`: Original quant method, 4-bit.
* `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
* `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
* `q4_k_s`: Uses Q4_K for all tensors
* `q5_0`: Higher accuracy, higher resource usage and slower inference.
* `q5_1`: Even higher accuracy, resource usage and slower inference.
* `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
* `q5_k_s`:  Uses Q5_K for all tensors
* `q6_k`: Uses Q8_K for all tensors
* `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

As a rule of thumb, **I recommend using Q5_K_M** as it preserves most of the model's performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.

In [None]:
# Variables
MODEL_ID = "limcheekin/TwinLlama-3.2-3B"
QUANTIZATION_METHODS = ["q4_k_m", "q5_k_m", "q8_0"]

# Constants
MODEL_NAME = MODEL_ID.split('/')[-1]

!rm -r $MODEL_NAME
!rm -r llama.cpp

# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && cmake -B build && cmake --build build --config Release
!pip install -r llama.cpp/requirements.txt

# Download model
!git lfs install
!git clone https://huggingface.co/{MODEL_ID}

# Uncomment the code below to download gated model
#from huggingface_hub import login, snapshot_download
#from google.colab import userdata
#hf_token = userdata.get('huggingface') # defined in the secrets tab in Google Colab
#login(hf_token)
#model_path = snapshot_download(repo_id=MODEL_ID, local_dir=MODEL_NAME)
#print(f"Model downloaded to {model_path}")

# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.F16.gguf"
!python llama.cpp/convert_hf_to_gguf.py {MODEL_NAME} --outtype f16 --outfile {fp16}

# Quantize the model for each method in the QUANTIZATION_METHODS list
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/build/bin/llama-quantize {fp16} {qtype} {method}

Cloning into 'llama.cpp'...
remote: Enumerating objects: 38964, done.[K
remote: Counting objects: 100% (24070/24070), done.[K
remote: Compressing objects: 100% (1273/1273), done.[K
remote: Total 38964 (delta 23266), reused 22839 (delta 22796), pack-reused 14894 (from 1)[K
Receiving objects: 100% (38964/38964), 56.53 MiB | 11.62 MiB/s, done.
Resolving deltas: 100% (28953/28953), done.
Already up to date.
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LI

## Run inference

Here is a simple script to run your quantized models. I'm offloading every layer to the GPU (35 for a 7b parameter model) to speed up inference.

In [7]:
import os

model_list = [file for file in os.listdir(MODEL_NAME) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_method = input("Name of the model (options: " + ", ".join(model_list) + "): ")

# Verify the chosen method is in the list
if chosen_method not in model_list:
    print("Invalid name")
else:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/build/bin/llama-cli -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Summarize the key takeaways from the book 'The Lean Startup.'  ### Response:
Name of the model (options: twinllama-3.2-3b.Q8_0.gguf, twinllama-3.2-3b.Q4_K_M.gguf, twinllama-3.2-3b.F16.gguf, twinllama-3.2-3b.Q5_K_M.gguf): twinllama-3.2-3b.Q4_K_M.gguf
build: 4263 (253b7fde) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from TwinLlama-3.2-3B/twinllama-3.2-3b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              =

## Push to hub

To push your model to the hub, you'll need to input your Hugging Face token (https://huggingface.co/settings/tokens) in Google Colab's "Secrets" tab. The following code creates a new repo with the "-GGUF" suffix. Don't forget to change the `username` variable.

In [6]:
!pip install -q huggingface_hub
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

# Defined in the secrets tab in Google Colab
hf_token = userdata.get('huggingface')

api = HfApi()
username = "limcheekin"

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
    token=hf_token
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
    token=hf_token
)

[0m

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


twinllama-3.2-3b.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

twinllama-3.2-3b.Q5_K_M.gguf:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

twinllama-3.2-3b.F16.gguf:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

twinllama-3.2-3b.Q8_0.gguf:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/limcheekin/TwinLlama-3.2-3B-GGUF/commit/9a0aac50b92d1de127c87959b6c450e2deb27080', commit_message='Upload folder using huggingface_hub', commit_description='', oid='9a0aac50b92d1de127c87959b6c450e2deb27080', pr_url=None, repo_url=RepoUrl('https://huggingface.co/limcheekin/TwinLlama-3.2-3B-GGUF', endpoint='https://huggingface.co', repo_type='model', repo_id='limcheekin/TwinLlama-3.2-3B-GGUF'), pr_revision=None, pr_num=None)