# Converting Hugging Face Models to GGUF Format

This notebook provides a comprehensive guide on how to convert any Large Language Model (LLM) from Hugging Face to GGUF format for efficient deployment and inference.

## 1. Overview of Hugging Face Model Structure and Format

Hugging Face models are typically stored in one of the following formats:

### 1.1 PyTorch Format
- Models saved using `model.save_pretrained()` are stored as:
  - `config.json`: Model architecture and hyperparameters
  - `pytorch_model.bin` or sharded files like `pytorch_model-00001-of-00003.bin`: Contains model weights
  - `tokenizer.json` and related files: Tokenization configuration
  - `generation_config.json`: Parameters for text generation

### 1.2 Safetensors Format
- Same structure as PyTorch format but with `.safetensors` files instead of `.bin`
- More secure as it doesn't allow arbitrary code execution during loading

### 1.3 Transformers Library Integration
Hugging Face models are designed to work with the Transformers library and are organized around the concept of:
- Model architecture (e.g., GPT-2, LLaMA, Mistral)
- Pretrained weights
- Tokenizer for text processing
- Configuration parameters

## 2. What is GGUF Format and Why Use It?

### 2.1 GGUF (GPT-Generated Unified Format)
- GGUF is the successor to GGML (GPT-Generated Machine Learning), created for llama.cpp
- It's a binary format optimized for efficient inference on consumer hardware
- Replaced GGML in August 2023 as the standard format for llama.cpp

### 2.2 Benefits of GGUF
- **Efficiency**: Optimized for CPU inference with minimal RAM requirements
- **Quantization**: Supports multiple precision levels to reduce model size
- **Speed**: Faster loading and inference times
- **Portability**: Run models on consumer hardware without specialized GPUs
- **Embedded metadata**: Includes model info, tokenizer, and parameters in a single file
- **Local deployment**: Run models completely offline without cloud dependencies

## 3. Key Considerations When Converting to GGUF

### 3.1 Model Architecture Compatibility
- Not all model architectures are supported by llama.cpp
- Best compatibility: LLaMA, Mistral, Falcon, MPT, and similar decoder-only transformers
- Encoder-decoder models may require additional steps or might not be fully supported

### 3.2 Memory Requirements
- Conversion process temporarily requires more memory than the original model size
- For 7B parameter models, you'll need at least 16GB RAM
- For larger models (13B+), you may need 32GB+ RAM
- The basic formula for estimating RAM requirements for a full-precision model is:

RAM required (in GB) = (Number of parameters × 2 × 1.2) / 10^9

Where:

Number of parameters: The size of the model (e.g., 7B, 13B)

× 2: For FP16 precision (2 bytes per parameter)

× 1.2: Overhead factor for additional memory needs

Divided by 10^9 to convert to GB

For example:

7B parameter model in FP16: ~16.8 GB

13B parameter model in FP16: ~31.2 GB

70B parameter model in FP16: ~168 GB

### 3.3 Quantization Considerations
- Higher quantization levels preserve more accuracy but require more memory
- Lower levels allow models to run on devices with limited resources
- A balance between model size and quality is necessary

## 4. Understanding GGUF Quantization Notation

When you see GGUF models, they often have suffixes that indicate their quantization level:

### 4.1 Quantization Types
- **F16**: 16-bit floating point (no quantization, but converted format)
- **Q8_0**: 8-bit quantization with no grouping (former K_QUANT)
- **Q6_K**: 6-bit quantization with fixed block size of 64 (former K_QUANT)
- **Q5_K**: 5-bit quantization with fixed block size of 64
- **Q5_1**: 5-bit quantization with small block size
- **Q5_0**: 5-bit quantization with no grouping
- **Q4_K**: 4-bit quantization with fixed block size of 64 (good balance)
- **Q4_1**: 4-bit quantization with small block size
- **Q4_0**: 4-bit quantization with no grouping
- **Q3_K**: 3-bit quantization with fixed block size
- **Q2_K**: 2-bit quantization (very small, significant quality loss)

### 4.2 Performance vs. Quality Trade-offs
- Higher bit quantization (Q8, F16): Higher quality, larger file size
- Middle quantization (Q4_K, Q5_K): Good balance for most use cases
- Lower bit quantization (Q2_K, Q3_K): Smaller size, significant quality degradation

### 4.3 Choosing the Right Quantization
- **Desktop/laptop with 16GB+ RAM**: Q6_K, Q5_K, or Q8_0
- **Desktop/laptop with 8GB RAM**: Q4_K or Q5_1
- **Low-end devices**: Q3_K or Q4_0
- **For most general purposes**: Q4_K offers a good balance

## 5. Implementation: Converting Models to GGUF

### 5.1 Required Installations

In [1]:
# Install required packages
!pip install torch transformers huggingface_hub sentencepiece protobuf

# Clone the llama.cpp repository (for conversion tools)
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && mkdir build && cd build && cmake .. && cmake --build . --config Release

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### 5.2 Download Model from Hugging Face

In [5]:
from huggingface_hub import snapshot_download
import os

# Define model name (replace with your desired model)
model_name = "meta-llama/Llama-3.2-3B-Instruct"

# Create directory for the model
os.makedirs("models", exist_ok=True)

# Download the model files
model_path = snapshot_download(
    repo_id=model_name,
    local_dir=f"models/{model_name.split('/')[-1]}",
    ignore_patterns=["*.bin"] if os.path.exists(f"models/{model_name.split('/')[-1]}") else None
)

print(f"Model downloaded to: {model_path}")

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

consolidated.00.pth:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/6.02k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

original%2Fparams.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

original%2Forig_params.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

Model downloaded to: /content/models/Llama-3.2-3B-Instruct


### 5.3 Convert Model to GGUF Format

We'll use the conversion script in llama.cpp to convert the model to GGUF format.

In [None]:
import os

# Get the model name without path
model_short_name = model_name.split('/')[-1]
model_folder = f"models/{model_short_name}"
output_path = f"models/{model_short_name}_GGUF"

# Create output directory
os.makedirs(output_path, exist_ok=True)

# Run the conversion script
!cd llama.cpp && python convert_hf_to_gguf.py /content/{model_folder} --outfile /content/{output_path}/{model_short_name}-f16.gguf --outtype f16

INFO:hf-to-gguf:Loading model: Llama-3.2-3B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00002.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {3072, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {8192, 3072}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {3072, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {3072}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shap

### 5.4 Quantize the GGUF Model

Now that we have a full-precision (F16) GGUF model, we can quantize it to reduce its size.

In [11]:
# Define quantization types to create
quant_types = ['q4_k', 'q5_k', 'q8_0'] # Select the quantization levels you want

for quant in quant_types:
    output_file = f"/content/{output_path}/{model_short_name}-{quant}.gguf"
    input_file = f"/content/{output_path}/{model_short_name}-f16.gguf"

    print(f"Quantizing to {quant}...")
    !cd llama.cpp && ./build/bin/llama-quantize {input_file} {output_file} {quant}

    # Check file size
    file_size_gb = os.path.getsize(output_file) / (1024 * 1024 * 1024)
    print(f"Created {output_file} ({file_size_gb:.2f} GB)")

Quantizing to q4_k...
main: build = 4820 (1a24c462)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-f16.gguf' to '/content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-q4_k.gguf' as Q4_K
llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from /content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instr

## 6. Running Models with GGUF Format

### 6.1 Using llama.cpp for Inference

In [24]:
# Select a quantized model to run
quantized_model = f"/content/{output_path}/{model_short_name}-q8_0.gguf"

# Run inference with the model
!cd llama.cpp && ./build/bin/llama-simple -m {quantized_model} \
    --ctx-size 8096 \
    --threads 4 \
    --temp 0.7 \
    --repeat_penalty 1.1 \
    -p "What is GRPC? Give me example use it?"

llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from /content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_mod

### 6.2 Batch Processing with llama.cpp

In [25]:
# Create a file with multiple prompts
with open("prompts.txt", "w") as f:
    f.write("What is artificial intelligence?\n")
    f.write("Explain quantum computing in simple terms.\n")
    f.write("Write a haiku about programming.\n")

# Process all prompts
!cd llama.cpp && ./build/bin/llama-simple -m {quantized_model} \
    --ctx-size 2048 \
    --threads 4 \
    --temp 0.7 \
    --repeat_penalty 1.1 \
    --batch-size 512 \
    -f ../prompts.txt

llama_model_loader: loaded meta data with 31 key-value pairs and 255 tensors from /content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_mod

### 6.3 Using Text Generation WebUI

If you prefer a graphical interface, you can use Text Generation WebUI with your GGUF models.

In [26]:
# Install Text Generation WebUI
!git clone https://github.com/oobabooga/text-generation-webui
!cd text-generation-webui && pip install -r requirements.txt

Cloning into 'text-generation-webui'...
remote: Enumerating objects: 20203, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 20203 (delta 207), reused 137 (delta 137), pack-reused 19961 (from 2)[K
Receiving objects: 100% (20203/20203), 29.16 MiB | 34.37 MiB/s, done.
Resolving deltas: 100% (14392/14392), done.
Collecting llama-cpp-python==0.3.7+cpuavx2 (from -r requirements.txt (line 35))
  Downloading https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.3.7+cpuavx2-cp311-cp311-linux_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[?25hIgnoring llama-cpp-python: markers 'platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"' don't match your environment
Ignoring llama-cpp-python: markers 'platform_system == "Windows" and python_version 

In [29]:
# Run Text Generation WebUI with your model
# Make sure to place your GGUF file in the models directory
!cd text-generation-webui && python server.py --model /content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-q8_0.gguf --wbits 4 --groupsize 128

[2;36m09:04:02-802528[0m[2;36m [0m[34mINFO    [0m Starting Text generation web UI                                            
[2;36m                [0m         remove that flag.                                                          
[2;36m                [0m         remove that flag.                                                          
[2;36m09:04:02-861644[0m[2;36m [0m[34mINFO    [0m Loading [32m"Llama-3.2-3B-Instruct-q8_0.gguf"[0m                                  
[2;36m09:04:02-864888[0m[2;36m [0m[1;31mERROR   [0m The path to the model does not exist. Exiting.                             
[30m╭─[0m[30m──────────────────────────────[0m[30m [0m[1;31mTraceback [0m[1;2;31m(most recent call last)[0m[30m [0m[30m───────────────────────────────[0m[30m─╮[0m
[30m│[0m [2;33m/content/text-generation-webui/[0m[1;33mserver.py[0m:[94m256[0m in [92m<module>[0m                                         [30m│[0m
[30m│[0m                     

In [28]:
quantized_model

'/content/models/Llama-3.2-3B-Instruct_GGUF/Llama-3.2-3B-Instruct-q8_0.gguf'

### 6.4 Using Other GGUF-Compatible Tools

Your GGUF models are now compatible with various tools:

1. **LM Studio**: A desktop application for running local models
2. **GPT4All**: Cross-platform GUI for running LLMs
3. **Ollama**: Command-line tool and API for running models
4. **KoboldCPP**: UI focused on creative writing and storytelling

Just load your GGUF file into any of these applications to start generating text.

## 7. Advanced Options and Troubleshooting

### 7.1 Custom Conversion Parameters

The `convert_hf_to_gguf.py` script accepts several parameters to customize the conversion:

Read docs




### 7.2 Common Issues and Solutions

1. **Out of Memory During Conversion**:
   - Try using a machine with more RAM
   - Use swap space or virtual memory
   - Convert the model in parts

2. **Model Architecture Not Supported**:
   - Check llama.cpp documentation for supported architectures
   - You may need custom conversion scripts for certain architectures

3. **Tokenizer Issues**:
   - Ensure tokenizer files are properly included
   - Check if the model uses a special tokenizer format

4. **Slow Inference**:
   - Try different quantization levels
   - Adjust number of threads to match your CPU
   - Reduce context size if not needed

## 8. Conclusion

You've now learned how to:
- Understand Hugging Face model structures
- Convert models to GGUF format
- Quantize models to different precision levels
- Run inference with converted models

GGUF format allows you to run powerful language models locally with minimal hardware requirements, making AI more accessible and private.