# Converting Model to GGUF


If you’re running inference, you can use the GPU for acceleration. However llama.cpp quantization is performed on the CPU.

Firstly we will clone llama.cpp then build it. Installation can take 10~ minutes. It uses cmake because make method is deprecated.

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!cmake -B build
!cmake --build build --config Release
!pip install -r requirements.txt
%cd ..

Cloning into 'llama.cpp'...
remote: Enumerating objects: 49065, done.[K
remote: Counting objects: 100% (179/179), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 49065 (delta 124), reused 56 (delta 56), pack-reused 48886 (from 4)[K
Receiving objects: 100% (49065/49065), 103.16 MiB | 21.53 MiB/s, done.
Resolving deltas: 100% (35316/35316), done.
/content/llama.cpp
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- 

You should run this if you want to GPU acceleration for inference.

In [None]:
#!cmake -B build -DGGML_CUDA=ON
#!cmake --build build --config Release

## Downloading libraries and models

In [None]:
!pip install -U huggingface_hub[hf_xet]
#!pip install -U huggingface_hub
#!pip uninstall -U huggingface_hub

Collecting huggingface_hub[hf_xet]
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting hf-xet>=0.1.4 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.0.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.0.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.4/481.4 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf-xet, huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.23.5
    Uninstalling huggingface-hub-0.23.5:
      Successfully uninstalled huggingface-hub-0.23.5
Successfully installed hf-xet-1.0.3 huggingface_hub-0.30.2


If you have xet early access download xet version of it. If not just download huggingfacehub for git lfs.If your Python environment has a hf_xet-aware version of huggingface_hub then your uploads and downloads will automatically use Xet.



In [None]:
import os
from huggingface_hub import HfApi, login, hf_hub_download, snapshot_download
import subprocess
from getpass import getpass

Look this link if you don't know https://huggingface.co/docs/hub/en/security-tokens. Tokens should generated as not read token, if you want to generate new repo create write token.

In [None]:
hf_token = getpass("Enter your Hugging Face token: ")
login(token=hf_token)

Enter your Hugging Face token: ··········


Do your configs here like username, model name, which quantization methods do you want.

In [None]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
quant_levels = ["Q3_K_S"] #Some other quant levels ["Q3_K_S", "Q3_K_M", "Q3_K_L", "Q4_0", "Q4_1", "Q4_K_S", "Q4_K_M", "Q5_0", "Q5_1", "Q5_K_S", "Q5_K_M", "Q6_K", "Q8_0"]
new_repo_name = "username/Llama-3.2-1B-Instruct-GGUF"
local_dir = "./model"
model_short_name = model_name.split("/")[-1]

Remove old files if you want to.

In [None]:
!df -h .
if os.path.exists(local_dir):
    !rm -rf {local_dir}
os.makedirs(local_dir, exist_ok=True)

Filesystem      Size  Used Avail Use% Mounted on
overlay         226G   45G  182G  20% /


Download the model and remove unwanted files.

In [None]:
print(f"Downloading repository: {model_name}")
model_dir = snapshot_download(
    repo_id=model_name,
    local_dir=local_dir,
    token=hf_token,
    force_download=True,
    allow_patterns=["*.json", "*.safetensors", "*.model", "*.txt"],
    ignore_patterns=["*.*pth"]
)

Begin quantization

##Quantization

For here we will first convert model to 16bit GGUF file.Quantization can take your time.


In [None]:
f16_output_file = f"{local_dir}/{model_short_name}-f16.gguf"
print(f"Converting model to GGUF (f16)...")

!./llama.cpp/convert_hf_to_gguf.py "{model_dir}" --outfile "{f16_output_file}" --outtype f16

Converting model to GGUF (f16)...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {3072, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {8192, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bflo

This converts a 16-bit GGUF model file into a quantized version. It can use other quantization methods while converting.

In [None]:
for quant_level in quant_levels:
    print(f"Quantizing model to {quant_level}...")
    output_file = f"{local_dir}/{model_short_name}-{quant_level.lower()}.gguf"

    !./llama.cpp/build/bin/llama-quantize {f16_output_file} {output_file} {quant_level}

    if os.path.exists(output_file):
        print(f"Quantized model saved as {output_file}")
    else:
        print(f"Error quantizing to {quant_level}: Output file not created")
        continue

Quantizing model to Q3_K_S...
main: build = 5174 (56304069)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing './model/Llama-3.2-3B-Instruct-f16.gguf' to './model/Llama-3.2-3B-Instruct-q3_k_s.gguf' as Q3_K_S
llama_model_loader: loaded meta data with 26 key-value pairs and 255 tensors from ./model/Llama-3.2-3B-Instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 3.2B
llama_model_loader: - kv   4:                          llama.block_count u32              = 28
llama_model_lo

In [None]:
!ls llama.cpp/build/bin/

libggml-base.so		       llama-q8dot
libggml-cpu.so		       llama-quantize
libggml.so		       llama-quantize-stats
libllama.so		       llama-qwen2vl-cli
libllava_shared.so	       llama-retrieval
libmtmd_shared.so	       llama-run
llama-batched		       llama-save-load-state
llama-batched-bench	       llama-server
llama-bench		       llama-simple
llama-cli		       llama-simple-chat
llama-convert-llama2c-to-ggml  llama-speculative
llama-cvector-generator        llama-speculative-simple
llama-embedding		       llama-tokenize
llama-eval-callback	       llama-tts
llama-export-lora	       llama-vdot
llama-gbnf-validator	       test-arg-parser
llama-gemma3-cli	       test-autorelease
llama-gen-docs		       test-backend-ops
llama-gguf		       test-barrier
llama-gguf-hash		       test-c
llama-gguf-split	       test-chat
llama-gritlm		       test-chat-template
llama-imatrix		       test-gguf
llama-infill		       test-grammar-integration
llama-llava-cli		       test-grammar-parser
llama-llava-clip-

## Creating Repo and Uploading Models

In [None]:
api = HfApi()
api.create_repo(repo_id=new_repo_name, token=hf_token, exist_ok=True)
print(f"Created repository: {new_repo_name}")

In [None]:
print("Uploading quantized models...")
for quant_level in quant_levels:
    quantized_file = f"{local_dir}/{model_short_name}-{quant_level.lower()}.gguf"
    if not os.path.exists(quantized_file):
        print(f"File {quantized_file} not found.")
        continue
    api.upload_file(
        path_or_fileobj=quantized_file,
        path_in_repo=f"{model_short_name}-{quant_level.lower()}.gguf",
        repo_id=new_repo_name,
        repo_type="model",
        token=hf_token
    )
    print(f"Uploaded {quantized_file}")

##Inference (In Development)

You can run models with gpu or cpu easily. Write your prompt here. If you want to continue conversation use below by clicking end of output (your input text is hidden ***).e

In [None]:
model_list = [file for file in os.listdir(local_dir) if "gguf" in file]

prompt = input("Enter your prompt: ")
chosen_model = input("Name of the model (options: " + ", ".join(model_list) + "): ")

#change n for gpu layers
if chosen_model not in model_list:
    print("Invalid name")
else:
    model_path = f"{local_dir}/{chosen_model}"
    !./llama.cpp/build/bin/llama-cli -m {model_path} -t 2 --color -c 256 --temp 0.7 -n 56 -p "USER: {prompt}\nASSISTANT:"

Enter your prompt: Who went to space first?
Name of the model (options: Llama-3.2-3B-Instruct-q3_k_s.gguf, Llama-3.2-3B-Instruct-f16.gguf): Llama-3.2-3B-Instruct-q3_k_s.gguf
build: 5174 (56304069) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 255 tensors from ./model/Llama-3.2-3B-Instruct-q3_k_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 3.2B
llama_model_loader: