# Hugging Face model to GGUF conversion

This notebook describes the steps to dowload and convert Hugging Face models to GGUF format so that it is compatible to be imported to Ollama.

> This notebook assumes that there llama.cpp repo has been downloaded in another directory.
> Ollama should also be installed the system

## To download llama.cpp from GitHub

``` bash
cd <YOUR_GIT_DIRECTORY>
git clone https://github.com/ggml-org/llama.cpp
```

This assumes that you will have the llama.cpp contents at <YOUR_GIT_DIRECTORY>/llama.cpp directory.

We will have to initialize this notebook's python environment and install necessary python packages from llama.cpp `requirements.txt` file

In [35]:
# llama_cpp_dir = "<YOUR_LLAMA_CPP_DIR>"
llama_cpp_dir = "/Users/Shared/_ws/llama.cpp"
! pip install -q -r "{llama_cpp_dir}/requirements.txt"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Downloading models from Hugging Face

For this sample we're going to download `intfloat/e5-large-v2` embedding model which is not available in Ollama yet.

In [36]:
import os
from huggingface_hub import snapshot_download

model_id = "intfloat/e5-large-v2"

# For organization, I'd like to organize the directoy in the following manner
# root_dir
# ├── model_id
# │   ├── /hf
# │   ├── model_name.q.gguf
# │   └── Modelfile.model_name.q
root_dir = "/Users/Shared/_models"
model_dir = os.path.normpath(os.path.join(root_dir, model_id))
hf_model_dir = os.path.normpath(os.path.join(model_dir, "hf"))

print(f"Model directory: {model_dir}")
print(f"Hugging Face model directory: {hf_model_dir}")

snapshot_download(model_id, local_dir=hf_model_dir)


Model directory: /Users/Shared/_models/intfloat/e5-large-v2
Hugging Face model directory: /Users/Shared/_models/intfloat/e5-large-v2/hf


Fetching 20 files: 100%|██████████| 20/20 [00:00<00:00, 8339.41it/s]


'/Users/Shared/_models/intfloat/e5-large-v2/hf'

In [37]:
script = os.path.join(llama_cpp_dir, "convert_hf_to_gguf.py")
print(f"Script path: {script}")

quant = "f32" # full precision
model_name = os.path.basename(model_id)
gguf_file = f"{model_name}.{quant}.gguf"
gguf_model_dir = os.path.normpath(os.path.join(model_dir, gguf_file))

print(f"GGUF model directory: {gguf_model_dir}")

Script path: /Users/Shared/_ws/llama.cpp/convert_hf_to_gguf.py
GGUF model directory: /Users/Shared/_models/intfloat/e5-large-v2/e5-large-v2.f32.gguf


In [38]:
! python "{script}" "{hf_model_dir}" --outfile "{gguf_model_dir}" --outtype "{quant}"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:hf-to-gguf:Loading model: hf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd_norm.bias,            torch.float32 --> F32, shape = {1024}
INFO:hf-to-gguf:token_embd_norm.weight,          torch.float32 --> F32, shape = {1024}
INFO:hf-to-gguf:position_embd.weight,            torch.float32 --> F32, shape = {1024, 512}
INFO:hf-to-gguf:token_types.weight,              torch.float32 --> F32, shape = {1024, 2}
INFO:hf-to-gguf:token_embd.weight,               torch.float32 --> F32, shape = {1024, 30522}
INFO:hf-to-gguf:blk.0.attn_output_norm.bias,     torch.float32 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output_norm.weight,   torch.float32 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output.bias,          torch.float32 --> F32, shape = {1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,        torch.float32 --> F32, shape = {1024, 102

## Creating a model file for Ollama

Refering back to our model directory, we create a `Modelfile` file for Ollama.

```
root_dir
├── model_id
│   ├── /hf
│   ├── model_name.q.gguf
│   └── Modelfile.model_name.q
```

We'll name it with the following format: `Modelfile.<MODEL_NAME>.<QUANT>`. And then write the following contents:

``` text
FROM {gguf_model_dir}
```

For more information, refer to Ollama documentation at https://github.com/ollama/ollama/blob/main/docs/modelfile.md

In [39]:
modelfile = f"Modelfile.{model_name}.{quant}"
modelfile_dir = os.path.normpath(os.path.join(model_dir, modelfile))

with open(modelfile_dir, "w") as f:
    f.write(f"FROM {gguf_model_dir}")

## Importing the model into Ollama

Now that we have a Modelfile, we can now import the model to Ollama using the following command

``` bash
ollama create <MODEL_NAME> -f <GGUF_MODEL_DIR>
```

In [48]:
# For reference I have the following models in Ollama

! ollama list

NAME                                   ID              SIZE      MODIFIED    
llama3.2:latest                        a80c4f17acd5    2.0 GB    6 weeks ago    
nomic-embed-text:latest                0a109f422b47    274 MB    6 weeks ago    
deepseek-r1:8b-llama-distill-q4_K_M    28f8fd6cdc67    4.9 GB    7 weeks ago    
llama3.1:latest                        46e0c10c039e    4.9 GB    7 weeks ago    
deepseek-r1:latest                     0a8c26691023    4.7 GB    7 weeks ago    


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [49]:
! ollama create "{model_name}" -f "{modelfile_dir}"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components [K
copying file sha256:da0fce4302df4c14caa6a938e9b513dedebe740938eafae2bc61432971afaa11 0% ⠋ [K[?25h[?2026l[?2026h[?25l[A[1Ggathering model components [K
copying file sha256:da0fce4302df4c14caa6a938e9b513dedebe740938eafae2bc61432971afaa11 4% ⠙ [K[?25h[?2026l[?2026h[?25l[A[1Ggathering model components [K
copying file sha256:da0fce4302df4c14caa6a938e9b513dedebe740938eafae2bc61432971afaa11 18% ⠹ [K[?25h[?2026l[?2026h[?25l[A[1Ggathering model components [K
copying file sha256:da0fce4302df4c14caa6a938e9b513dedebe740938eafae2bc61432971afaa11 32% ⠸ [K[?25h[?2026l[?2026h[?25l[A[1Ggathering model components [K
copying file sha256:da0fce4302df4c14caa6a938e9b513dedebe7

In [50]:
# Lets check if the model is in Ollama

! ollama list

NAME                                   ID              SIZE      MODIFIED       
e5-large-v2:latest                     cfb0e6095fc4    1.3 GB    11 seconds ago    
llama3.2:latest                        a80c4f17acd5    2.0 GB    6 weeks ago       
nomic-embed-text:latest                0a109f422b47    274 MB    6 weeks ago       
deepseek-r1:8b-llama-distill-q4_K_M    28f8fd6cdc67    4.9 GB    7 weeks ago       
llama3.1:latest                        46e0c10c039e    4.9 GB    7 weeks ago       
deepseek-r1:latest                     0a8c26691023    4.7 GB    7 weeks ago       


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Checking the model

Lets try to use the model. I downloaded an embedding model so we're going to use the embeddings endpoint of Ollama. Our expectation is that it should generate the vector with a length of 1024.

In [43]:
import requests
import json

url = "http://localhost:11434/api/embeddings"
payload = json.dumps({
    "model": model_name,
    "prompt": "The quick brown fox jumps over the lazy dog."
})
headers = {
    "Content-Type": "application/json"
}
response = requests.request("POST", url, headers=headers, data=payload)

json_response = json.loads(response.text)

In [44]:
# 1024
len(json_response["embedding"])

1024

## Compare the results of both models

Since we downloaded an embedding model, we can compare the results of both models i.e. Using Hugging Face libraries and results from Ollama.

In [45]:
# Copied from 
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: The lazy dog was jumped over by the quick brown fox.']

tokenizer = AutoTokenizer.from_pretrained(hf_model_dir)
model = AutoModel.from_pretrained(hf_model_dir)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = embeddings.T

embeddings.shape

torch.Size([1024, 1])

In [46]:
import numpy as np
from numpy.linalg import norm

# Result from Ollama
A = np.array(json_response["embedding"])
B = embeddings.detach().numpy()

 
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
cosine

array([0.92337578])

Good result! so we know they have similar performance!