# Conversione a GGUF
GGUF (GPT-Generated Unified Format) è un formato di file progettato per archiviare modelli di grandi dimensioni, ottimizzato per l'efficienza e la compatibilità con l'hardware consumer.

È di fatto un formato molto comodo se vogliamo eseguire il nostro modello su hardware consumer, su server CPU only o hardware a basso costo (es: raspberry, mini pc, ecc...).

Il modo più semplice per leggere/scrivere questo formato è utilizzare la famosa libreria llama_cpp.

Questa libreria, oltre alla semplice conversione, permette anche di effettuare la quantizzazione, tecnica che permette di aumentare le performance del modello con un trade-off sulla sua precisione.

In questo esercizio proviamo a convertire il modello addestrato nell'esercizio di function calling in formato gguf.

Prima di tutto installiamo le dipendenze.

In [None]:
!pip install huggingface-hub numpy torch transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

Importiamo le librerie e accediamo a hugging face con la nostra API Key.

In [None]:
import os
import subprocess
from huggingface_hub import snapshot_download, login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Definiamo una funzione che si scarica la repository di llama.cpp. Questo perché nella repository è presente uno script che permette di convertire dal formato di hugging face a gguf.

In [None]:
def clone_llama_cpp():
    """Clone llama.cpp repository if not already present."""
    llama_cpp_dir = os.path.abspath(os.path.join('.', "llama.cpp"))
    if not os.path.exists(llama_cpp_dir):
        print("Cloning llama.cpp repository...")
        subprocess.run(
            ["git", "clone", "https://github.com/ggerganov/llama.cpp.git", llama_cpp_dir],
            check=True
        )
    return llama_cpp_dir

Ora definiamo una funzione che esegua quello script invocando un processo fork di python.

In [None]:
def convert_to_gguf(model_path: str, output_path: str, quantization: str = "q4_k_m"):
    """
    Convert a PyTorch model to GGUF format using llama.cpp's conversion tools.

    Args:
        model_path: Path to the merged LoRA model
        output_path: Path where the GGUF model will be saved
        quantization: Quantization method to use (e.g., "q4_k_m", "q5_k_m", "q8_0")
    """
    # Convert paths to absolute
    model_path = os.path.abspath(model_path)
    output_path = os.path.abspath(output_path)
    os.makedirs(output_path, exist_ok=True)

    print(f"Converting model from {model_path} to GGUF format...")
    print(f"Model directory contents:")
    for item in os.listdir(model_path):
        print(f"- {item}")

    # Get llama.cpp repository
    llama_cpp_dir = clone_llama_cpp()

    # Convert to GGUF using llama.cpp's convert_hf_to_gguf.py
    print("Converting to GGUF format...")
    gguf_path = os.path.join(output_path, f"model-{quantization}.gguf")

    # Run the conversion script
    convert_cmd = [
        "python3",
        os.path.join(llama_cpp_dir, "convert_hf_to_gguf.py"),
        "--outfile", gguf_path,
        "--outtype", quantization,
        "--verbose",  # Add verbose flag to see more details
        model_path  # Use absolute path to model directory
    ]

    try:
        # Change to llama.cpp directory for conversion
        original_dir = os.getcwd()
        os.chdir(llama_cpp_dir)

        # Install required dependencies
        print("Installing llama.cpp dependencies...")
        subprocess.run(["pip", "install", "-r", "requirements.txt"], check=True)

        # Run conversion with captured output
        print("Running conversion script...")
        print(f"Command: {' '.join(convert_cmd)}")
        result = subprocess.run(
            convert_cmd,
            capture_output=True,
            text=True,
            check=False  # Don't raise exception immediately
        )

        # Print the output regardless of success/failure
        if result.stdout:
            print("Conversion script output:")
            print(result.stdout)
        if result.stderr:
            print("Conversion script errors:")
            print(result.stderr)

        # Now check if the command was successful
        if result.returncode != 0:
            raise subprocess.CalledProcessError(
                result.returncode,
                convert_cmd,
                output=result.stdout,
                stderr=result.stderr
            )

        # Change back to original directory
        os.chdir(original_dir)

        print(f"Successfully converted model to GGUF format: {gguf_path}")
        return gguf_path
    except subprocess.CalledProcessError as e:
        print(f"Error during conversion: {e}")
        raise
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise
    finally:
        # Ensure we return to the original directory
        os.chdir(original_dir)


Scarichiamo il modello che abbiamo addestrato prima dalla repository di hugging face.

Con questo comando otteniamo il percorso della cartella temporanea in cui viene salvato il modello.

In [None]:
model_path = snapshot_download(repo_id="monadestudio/smol-function-calling")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/368 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

/root/.cache/huggingface/hub/models--monadestudio--smol-function-calling/snapshots/76ae67811058acd154412ff756dcaa3d21e398b9


Proviamo se funziona correttamente prima della conversione:

In [None]:
# Importiamo le dipendenze
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Definiamo il dispositivo su cui eseguire il modello in base alla disponibilità.
# CUDA (GPU), MPS (Apple Silicon) o CPU.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Scarichiamo il modello
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float32,
    device_map=device,
)

# Scarichiamo il tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

def generate_response(model, tokenizer, prompt, system_prompt=None):
    # Preparo il prompt in modalità chat e lo formatto in modo che il modello possa leggerlo.
    messages = [{"role": "user", "content": prompt}]
    # Se è presente un system prompt, lo aggiungo alla lista dei messaggi.
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})
    formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    # Tokenizzo il prompt
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
    # Lo passo al modello e gli chiedo di generare una risposta.
    outputs = model.generate(**inputs, max_new_tokens=512)
    # Decodifico la risposta del modello.
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Stampo la risposta, prendendo l'ultima parte della risposta del modello.
    return output.split("assistant")[-1].strip()

print(generate_response(model, tokenizer, "What's the capital of France?"))

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Loading adapter weights from /root/.cache/huggingface/hub/models--monadestudio--smol-function-calling/snapshots/76ae67811058acd154412ff756dcaa3d21e398b9 led to missing keys in the model: model.layers.0.self_attn.q_proj.lora_A.default.weight, model.layers.0.self_attn.q_proj.lora_B.default.weight, model.layers.0.self_attn.k_proj.lora_A.default.weight, model.layers.0.self_attn.k_proj.lora_B.default.weight, model.layers.0.self_attn.v_proj.lora_A.default.weight, model.layers.0.self_attn.v_proj.lora_B.default.weight, model.layers.0.self_attn.o_proj.lora_A.default.weight, model.layers.0.self_attn.o_proj.lora_B.default.weight, model.layers.1.self_attn.q_proj.lora_A.default.weight, model.layers.1.self_attn.q_proj.lora_B.default.weight, model.layers.1.self_attn.k_proj.lora_A.default.weight, model.layers.1.self_attn.k_proj.lora_B.default.weight, model.layers.1.self_attn.v_proj.lora_A.default.weight, model.layers.1.self_attn.v_proj.lora_B.default.weight, model.layers.1.self_attn.o_proj.lora_A.defa

The capital of France is Paris. It is a city that has been a beacon of culture, art, and innovation for centuries. Paris is known for its iconic landmarks like the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also famous for its historical sites, such as the Palace of Versailles and the Louvre Museum.

Paris is a city that has a rich history, and its capital is a place where people from all over the world come to experience its beauty and culture.


Ora invochiamo la conversione del modello una volta per ogni metodologia di quantizzazione.

In [None]:
output_path = "./gguf-models"

# Available quantization methods (from convert_hf_to_gguf.py help)
quantization_methods = [
    "f32",    # Full precision (32-bit float)
    "f16",    # Half precision (16-bit float)
    "q8_0",   # 8-bit quantization
    "tq1_0",  # Ternary quantization
    "tq2_0",  # Ternary quantization (alternative)
    "auto"    # Automatic selection
]

# Convert with different quantization methods
for quant in quantization_methods:
    try:
        print(f"\nConverting with {quant} quantization...")
        gguf_path = convert_to_gguf(model_path, output_path, quant)
        print(f"Model saved to: {gguf_path}")
    except Exception as e:
        print(f"Failed to convert with {quant} quantization: {e}")
        continue



Converting with f32 quantization...
Converting model from /root/.cache/huggingface/hub/models--monadestudio--smol-function-calling/snapshots/76ae67811058acd154412ff756dcaa3d21e398b9 to GGUF format...
Model directory contents:
- adapter_config.json
- adapter_model.safetensors
- config.json
- special_tokens_map.json
- tokenizer.json
- .gitattributes
- tokenizer_config.json
- generation_config.json
- chat_template.jinja
- vocab.json
- model.safetensors
- merges.txt
- README.md
Cloning llama.cpp repository...
Converting to GGUF format...
Installing llama.cpp dependencies...
Running conversion script...
Command: python3 /content/llama.cpp/convert_hf_to_gguf.py --outfile /content/gguf-models/model-f32.gguf --outtype f32 --verbose /root/.cache/huggingface/hub/models--monadestudio--smol-function-calling/snapshots/76ae67811058acd154412ff756dcaa3d21e398b9
Conversion script errors:
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time on

Fatto! Ora abbiamo un file per ogni tipologia di quantizzazione.

Come scegliamo la quantizzazione migliore? Come dicevo, è un trade-off tra velocità e precisione. Vanno provati!

Facciamo un test di esecuzione:

In [None]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.10.tar.gz (79.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hcanceled
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
from llama_cpp import Llama
from textwrap import dedent

def get_chat_interaction(prompt):
    """Create a chat interaction with the given prompt."""
    return [
        {
            "role": "system",
            "content": dedent("""
You are a helpful assistant with access to the following functions. Use them if required -
{
    "name": "get_weather",
    "description": "Get the weather at a given location",
    "parameters": {
        "type": "object",
        "properties": {
            "latitude": {
                "type": "number",
                "description": "The latitude of the location"
            },
            "longitude": {
                "type": "number",
                "description": "The longitude of the location"
            }
        },
        "required": [
            "latitude",
            "longitude"
        ]
    }
}
{
    "name": "get_time",
    "description": "Get the current time at a given location",
    "parameters": {
        "type": "object",
        "properties": {
            "latitude": {
                "type": "number",
                "description": "The latitude of the location"
            },
            "longitude": {
                "type": "number",
                "description": "The longitude of the location"
            }
        },
        "required": [
            "latitude",
            "longitude"
        ]
    }
}
""")
        },
        {
            "role": "user",
            "content": prompt
        }
    ]

def format_prompt(chat_interaction):
    """Format the chat interaction into a prompt string."""
    tokenizer = AutoTokenizer.from_pretrained(
        "HuggingFaceTB/SmolLM2-135M-Instruct", padding=True, truncation=True, max_length=512
    )
    formatted_prompt = tokenizer.apply_chat_template(chat_interaction, tokenize=False)

    return formatted_prompt

example = "What time is it in Tokyo (Lat: 35.6895, Long: 139.6917)?"
chat = get_chat_interaction(example)
prompt = format_prompt(chat)

# Carichiamo il modello gguf (usiamo la variante auto)
model = Llama(
    model_path="./gguf-models/model-auto.gguf",
    n_ctx=512,  # Match the training context length
    n_batch=512,
    n_threads=4,  # Adjust based on your CPU
    n_gpu_layers=-1  # Set to -1 for all layers on GPU, 0 for CPU only
)

max_tokens = 128

response = model(
    prompt,
    max_tokens=max_tokens,
    temperature=0.7,
    top_p=0.95,
    stop=["</s>", "<|user|>", "<|system|>", "<|im_end|>"],  # Stop at these tokens
    echo=False  # Don't include the prompt in the output
)

# Extract the generated text
generated_text = response["choices"][0]["text"]

# Clean up the response
generated_text = generated_text.strip()

print(generated_text)

Ora possiamo scaricarci il nostro modello convertito nei vari formati quantizzati.

In [None]:
import os
from google.colab import files

os.system(f"zip -r gguf_models.zip {output_path}")
files.download(f"gguf_models.zip")