## Mounting Your Drive

In this first step, we mount your Google Drive and set the working directory. This ensures that all files are stored persistently and that we have a defined workspace.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create a directory for the LLM models if it doesn't exist
os.makedirs('/content/drive/My Drive/llm', exist_ok=True)

# Change working directory to the new folder in Google Drive
os.chdir('/content/drive/My Drive/llm')

print('Current working directory:', os.getcwd())

## Introduction

This notebook converts a Hugging Face model into the GGUF format with q8 quantization. The process involves downloading the model, loading its weights and configuration, and applying quantization to reduce precision to 8-bit. The final output is a GGUF file containing the quantized model data.

## Installing Dependencies

We install the necessary Python libraries: NumPy for numerical operations, Hugging Face Hub for downloading the model, and safetensors for reading model files.

In [None]:
# Install necessary dependencies
!pip install numpy huggingface_hub safetensors

## Downloading the Model

Using the Hugging Face Hub, we download the model specified by its repository ID. The model is stored in the current working directory. Adjust the repository ID (model_repo) as needed.

In [None]:
import os
from huggingface_hub import snapshot_download

# Set the HF repository ID for your model
model_repo = "tomg-group-umd/huginn-0125"
cache_dir = os.getcwd()

# Construct an expected folder name based on the repo id (replace '/' with '-')
expected_model_dir = os.path.join(cache_dir, model_repo.replace('/', '-'))

if os.path.exists(expected_model_dir):
    print(f"Model already downloaded at: {expected_model_dir}")
    model_path = expected_model_dir
else:
    print(f"Downloading model {model_repo}...")
    model_path = snapshot_download(repo_id=model_repo, cache_dir=cache_dir)
    print(f"Model downloaded to: {model_path}")

print('Final model path:', model_path)

## Loading and Converting the Model

We load the model's tensors from the safetensors files and its hyperparameters from a configuration file (config.json or params.json). The conversion process then applies q8 quantization to each tensor and writes the output in a simplified GGUF format. The quantization reduces precision to 8-bit and stores a scale factor for each tensor.

In [None]:
import os
import json
import numpy as np
from safetensors import safe_open

########################################
# HF Model loading and conversion code with q8 quantization
########################################

def load_hf_model(model_dir):
    """
    Recursively scan the model directory for .safetensors files and load all tensors.
    Returns a dictionary mapping tensor names to NumPy arrays.
    """
    model = {}
    for root, dirs, files in os.walk(model_dir):
        for file in files:
            if file.endswith(".safetensors"):
                file_path = os.path.join(root, file)
                print(f"Loading tensors from {file_path}")
                with safe_open(file_path, framework="np") as f:
                    for key in f.keys():
                        if key in model:
                            print(f"Warning: key {key} already exists. Overwriting.")
                        model[key] = f.get_tensor(key)
    return model

def load_hf_hparams(model_dir):
    """
    Load hyperparameters from a config file in the model directory.
    Tries config.json first, then params.json.
    """
    for fname in ["config.json", "params.json"]:
        config_path = os.path.join(model_dir, fname)
        if os.path.exists(config_path):
            print(f"Loading hyperparameters from {config_path}")
            with open(config_path, "r") as f:
                return json.load(f)
    raise ValueError("No config.json or params.json found in the model directory.")

########################################
# Minimal GGUF writer (simplified example)
########################################

class GGUFWriter:
    def __init__(self, outfile, hparams, outtype):
        self.outfile = outfile
        self.hparams = hparams  # hyperparameters dictionary
        self.outtype = outtype  # e.g., "q8", "f16", or "f32"
        self.tensors = []

    def add_tensor(self, name, tensor, scale=None):
        self.tensors.append((name, tensor, scale))
        if scale is not None:
            print(f"Added tensor: {name}, shape: {tensor.shape}, dtype: {tensor.dtype}, scale: {scale}")
        else:
            print(f"Added tensor: {name}, shape: {tensor.shape}, dtype: {tensor.dtype}")

    def finalize(self):
        with open(self.outfile, "wb") as f:
            # Write a simple header (placeholder for the GGUF header)
            f.write(b"GGUF\n")

            # Write hyperparameters as text
            for key, value in self.hparams.items():
                line = f"{key}: {value}\n".encode('utf-8')
                f.write(line)

            f.write(b"--TENSORS--\n")

            # Write each tensor's metadata and raw data
            for name, tensor, scale in self.tensors:
                meta = f"{name} | shape: {tensor.shape} | dtype: {tensor.dtype}".encode('utf-8')
                if scale is not None:
                    meta += f" | scale: {scale}".encode('utf-8')
                meta += b"\n"
                f.write(meta)
                f.write(tensor.tobytes())
                f.write(b"\n")

        print(f"Finalized GGUF file at {self.outfile}")

########################################
# Quantization function for q8 output
########################################

def quantize_to_q8(tensor):
    # Ensure tensor is float32
    tensor_f32 = tensor.astype(np.float32)
    # Compute scale factor: max absolute value divided by 127
    scale = np.max(np.abs(tensor_f32)) / 127.0
    if scale == 0:
        scale = 1.0
    # Quantize: divide by scale, round, then cast to int8
    quantized = np.round(tensor_f32 / scale).astype(np.int8)
    return quantized, scale

########################################
# Conversion function that handles different output types
########################################

def convert_tensor(tensor, outtype):
    if outtype == "q8":
        quantized, scale = quantize_to_q8(tensor)
        return quantized, scale
    elif outtype == "f16":
        return tensor.astype(np.float16), None
    elif outtype == "f32":
        return tensor.astype(np.float32), None
    else:
        return tensor, None

def convert_model_to_gguf(model, hparams, outfile, outtype="q8"):
    writer = GGUFWriter(outfile, hparams, outtype)
    for tensor_name, tensor_data in model.items():
        converted, scale = convert_tensor(tensor_data, outtype)
        writer.add_tensor(tensor_name, converted, scale=scale)
    writer.finalize()
    print(f"GGUF conversion complete: {outfile}")

########################################
# End of conversion code
########################################

# Load your HF model from the downloaded directory
print("Loading HF model from:", model_path)
real_model = load_hf_model(model_path)
print(f"Loaded {len(real_model)} tensors from the model.")

# Load hyperparameters from config.json (or params.json)
real_hparams = load_hf_hparams(model_path)
print("Hyperparameters loaded.")

# Define the output GGUF file
output_filename = "output_model.gguf"
output_type = "q8"

# Convert the real model to GGUF with q8 quantization
convert_model_to_gguf(real_model, real_hparams, output_filename, outtype=output_type)

## Verifying the Output

Finally, we list the contents of our working directory to verify that the GGUF file has been created successfully.

In [None]:
# List the contents of the current directory to verify the output file
!ls -lh