---
aliases:
- /ollama/
categories:
- Large Language Models
date: '2024-08-27'
colab: <a href="https://colab.research.google.com/drive/1glOts4otk-ZEdYWx-pyUkxnrLGkVAzcW#scrollTo=ivP5DvZiIJ-O"><img src="images/colab.png" alt="Open In Colab"></a>
image: /images/ollama/thumbnail.jpg
title: "From Fine-Tuning to Deployment"
subtitle: "Harnessing Custom LLMs with Ollama and Quantization"
---

# From Fine-Tuning to Deployment: Harnessing Custom LLMs with Ollama and Quantization

Imagine unlocking the full potential of large language models (LLMs) right on your local machine, without relying on costly cloud services. This is where Ollama shines by allowing users to harness the power of LLMs on their machines. While Ollama offers a range of ready-to-use models, there are times when a custom model is necessary, whether it's fine-tuned on specific data or designed for a particular task. Efficiently deploying these custom models on local hardware often requires optimization techniques like quantization.
In this article, we will explore the concept of quantization and demonstrate how to apply it to a fine-tuned model from Huggingface. We'll then cover how to install Ollama, create a corresponding Modelfile for a custom model, and integrate this custom model into Ollama, proving how easy it is to bring AI capabilities in-house.
All the code used in this article is available on Google Colab and in the LLM Tutorial.

## Ollama
Ollama is an open-source platform that empowers users to run large language models (LLMs) locally, bypassing the need for cloud-based services. Designed with accessibility in mind, Ollama simplifies the installation and management of a wide range of pre-trained LLMs and embedding models, enabling easy deployment without requiring extensive technical expertise. The platform provides a local API for seamless application integration and supports frameworks like LangChain. Recently tool-calling functionality has been introduced. This feature allows models to interact with external tools - like APIs, web browsers, and code interpreters - enabling them to perform complex tasks and interact with the outside world more effectively. Thanks to a large open-source community, Ollama continues evolving its capabilities, making it a robust, cost-effective solution for local AI deployment.

## Quantization in Large Language Models
**Quantization** is a crucial technique in machine learning that involves reducing the precision of a model's weights and activations. Traditionally, these models operate using 32-bit floating point (FP32) formats, but quantization allows for the conversion of these weights to lower precision formats such as 16-bit (FP16), 8-bit (INT8), 4-bit, or even 2-bit. The primary goals of quantization are to reduce the model's memory footprint and computational demands, thereby making it possible to deploy the model on resource-constrained hardware.

### Types of Quantization
**• Post-Training Quantization (PTQ)**: PTQ is a straightforward technique in which the model is quantized after it has been fully trained. This method is quick to implement and does not require retraining the model, making it ideal for scenarios where time or resources are limited. However, it may result in a slight decrease in accuracy since the model was not trained with quantization in mind.
**• Quantization-Aware Training (QAT)**: QAT integrates quantization into the training process, allowing the model to learn to compensate for the reduced precision. This approach generally results in better performance compared to PTQ, as the model adapts to the quantized environment during training. However, QAT requires more computational resources during training and is more complex to implement.

## Quantizing a Custom Model
In our example, we will use the GGUF (GPT-Generated Unified Format) quantization format, released by Georgi Gerganov and the llama.cpp team. GGUF uses the post-training quantization technique and supports a range of quantization methods, allowing developers to balance model accuracy and efficiency based on their specific needs.

### Installing the llama.cpp library
To start quantizing our model, we need to install the llama.cpp library. The library includes utilities to convert models into GGUF format and tools to quantize these models into various bit-widths depending on the hardware constraints.

In [3]:
!git clone https://github.com/ggerganov/llama.cpp
!pip install -r llama.cpp/requirements.txt -q
!cd llama.cpp && make -j 8

Klone nach 'llama.cpp'...
remote: Enumerating objects: 32548, done.[K
remote: Counting objects: 100% (11448/11448), done.[K
remote: Compressing objects: 100% (721/721), done.[K
remote: Total 32548 (delta 11126), reused 10759 (delta 10726), pack-reused 21100 (from 1)[K
Empfange Objekte: 100% (32548/32548), 54.44 MiB | 26.43 MiB/s, fertig.
Löse Unterschiede auf: 100% (23535/23535), fertig.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
diambra-engine 2.2.1 requires protobuf<4.0dev,>=3.12.0, but you have protobuf 4.25.4 which is incompatible.[0m[31m
[0mI ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCE

## Downloading and Preparing the Model
Once we have the necessary tools, the next step is to download the model we want to quantize. In this example, we are using the Mistral-v0.3–7B-ORPO model that we fine-tuned in the last article. We download the model, rename it locally, and move it to the ./model folder.

In [9]:
#Get model from huggingface, rename it locally to Mistral-v0.3-7B-ORPO, and move it to the model directory
!git lfs install
!git clone https://huggingface.co/llmat/Mistral-v0.3-7B-ORPO Mistral-v0.3-7B-ORPO
!mv Mistral-v0.3-7B-ORPO model/


Git LFS initialized.
Klone nach 'Mistral-v0.3-7B-ORPO'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 23 (delta 3), reused 0 (delta 0), pack-reused 6 (from 1)[K
Entpacke Objekte: 100% (23/23), 13.30 KiB | 973.00 KiB/s, fertig.
Filtere Inhalt: 100% (4/4), 1.50 GiB | 8.34 MiB/s, fertig.


Once we have our model we need to convert it to the GGUF F16 format.

In [10]:
!python ./llama.cpp/convert_hf_to_gguf.py ./model --outfile ./model/Mistral-v0.3-7B-ORPO-f16.gguf --outtype f16

INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 32768}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> F16, shape = {

Now, we can choose the method by which we want our model to be quantized.
In the context of **llama.cpp**, quantization methods are typically named following a specific convention: **Q#_K_M**
Let's break down what each component means:
<br><br>**• Q:** Stands for "Quantization," indicating that the model has undergone a process to reduce its numerical precision.
<br><br>**• #:** Refers to the number of bits used in the quantization process. For example, 4 in Q4_K_M indicates that the model has been quantized using 4-bit integers.
<br><br>**• K:** Denotes the use of k-means clustering in the quantization process. K-means clustering is a technique used to group similar weights, reducing the variation between them and allowing for more efficient quantization with minimal loss of accuracy.
<br><br>**• M:** Indicates the size category of the model after quantization, where:
<br><br>**• S** = Small
<br><br>**• M** = Medium
<br><br>**• L** = Large

## Quantization Methods Explained
Here's a closer look at the different quantization methods supported by llama.cpp and Ollama, following the Q#_K_M naming convention:
<br><br>`Q2_K`: This method uses 2-bit quantization, offering the most significant size reduction but with a considerable loss in accuracy. It's mainly used in highly constrained environments where memory and processing power are extremely limited.
<br><br>`Q3_K_S`: A 3-bit quantization method using k-means clustering, optimized for small models. This method provides significant memory savings and is used when accuracy can be somewhat compromised.
<br><br>`Q3_K_M`: Similar to Q3_K_S but optimized for medium-sized models. This method offers a balanced trade-off between memory usage and accuracy.
<br><br>`Q3_K_L`: This method is tailored for larger models, using 3-bit quantization with k-means clustering to reduce size while maintaining as much accuracy as possible.
<br><br>`Q4_0`: A standard 4-bit quantization method that does not use k-means clustering. This is the default method, offering a good balance between size reduction and maintaining model accuracy. It's suitable for general use cases where memory is limited but accuracy is still important.
<br><br>`Q4_1`: Similar to Q4_0 but with slight variations in how quantization is applied, potentially offering slightly better accuracy at the cost of a small increase in resource usage.
<br><br>`Q4_K_S`: A variation of 4-bit quantization optimized for smaller models. It reduces the model size significantly while preserving reasonable accuracy.
<br><br>`Q4_K_M`: This method applies 4-bit quantization with k-means clustering to medium-sized models, offering an excellent balance between size and accuracy. It's one of the most recommended methods for general use.
<br><br>`Q5_0`: Uses 5-bit quantization, which offers higher precision than 4-bit methods, resulting in better accuracy. This method is a good choice when you have slightly more memory available and need to maintain higher accuracy.
<br><br>`Q5_1`: A refinement of Q5_0, providing even greater accuracy by applying more sophisticated quantization techniques, though at the cost of increased resource requirements.
<br><br>`Q5_K_S`: This method uses 5-bit quantization with k-means clustering, optimized for smaller models, providing higher accuracy than 4-bit methods with only a slight increase in resource use.
<br><br>`Q5_K_M`: An advanced 5-bit quantization technique optimized for medium-sized models, providing high accuracy with reasonable memory efficiency. This method is often recommended for scenarios where accuracy is critical but resources are still somewhat limited.
<br><br>`Q6_K`: This method uses 6-bit quantization, providing a middle ground between 4-bit and 8-bit methods. It's suitable when you need more accuracy than what 4-bit offers but can't afford the higher resource demands of 8-bit quantization.
<br><br>`Q8_0`: Uses 8-bit quantization, which is nearly as accurate as the original float16 model. This method is best for scenarios where you need to preserve as much accuracy as possible while still reducing the model size.
<br><br>In our example we choose to quantize our model to 4-bit, using the Q4_K_M method.

In [17]:
!mkdir Mistral-v0.3-7B-ORPO_Q4_K_M
!./llama.cpp/llama-quantize ./model/Mistral-v0.3-7B-ORPO-f16.gguf ./Mistral-v0.3-7B-ORPO_Q4_K_M/Mistral-v0.3-7B-ORPO_Q4_K_M.gguf Q4_K_M

main: build = 3615 (1731d423)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.6.0
main: quantizing './model/Mistral-v0.3-7B-ORPO-f16.gguf' to './Mistral-v0.3-7B-ORPO_Q4_K_M/Mistral-v0.3-7B-ORPO_Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 42 key-value pairs and 291 tensors from ./model/Mistral-v0.3-7B-ORPO-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral 7b v0.3 Bnb 4bit
llama_model_loader: - kv   3:                            general.version str              = v0.3
llama_model_loader: - kv   4:                       general.organization str          

## Push model to hub (Optional)

In [18]:
# imports from huggingface
from huggingface_hub import create_repo, HfApi

username = "llmat"
HUGGINGFACE_TOKEN = "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"
MODEL_NAME = "Mistral-v0.3-7B-ORPO_Q4_K_M"


# Defined in the .env
api = HfApi(token=HUGGINGFACE_TOKEN)

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)

Mistral-v0.3-7B-ORPO_Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/llmat/Mistral-v0.3-7B-ORPO_Q4_K_M-GGUF/commit/982c1f2854cb26577fa3fe88c42d2aae4d43158c', commit_message='Upload folder using huggingface_hub', commit_description='', oid='982c1f2854cb26577fa3fe88c42d2aae4d43158c', pr_url=None, pr_revision=None, pr_num=None)

## Install Ollama
With our model now quantized, the next step is to install and start Ollama.
To begin, install Ollama by running the following command:

In [19]:
!curl -fsSL https://ollama.com/install.sh | sh

ERROR This script is intended to run on Linux only.


Once the installation is complete, start the Ollama server.
If using Google Colab then execute the following commands to start Ollama:

In [None]:
!pip install colab-xterm #https://pypi.org/project/colab-xterm/
%load_ext colabxterm
%xterm

## A Terminal window pops up
## Add command 'ollama serve'

If running on a local environment use this command to start the Ollama serve:

In [20]:
!ollama serve

Error: listen tcp 127.0.0.1:11434: bind: address already in use


Before we can add our quantized model to the Ollama server, Ollama requires us to create a Modelfile.

## Modelfile for Ollama
Ollama's Modelfile is a developing syntax designed to act as a blueprint, defining key components and parameters to customize model behavior within the Ollama ecosystem. The Modelfile includes several key instructions:
<br><br>**• FROM (Required):** Specifies the base model or file to build from.
<br><br>**• PARAMETER:** Sets various operational parameters like temperature, context window size, and stopping conditions, influencing model output and behavior.
<br><br>**• TEMPLATE:** Defines the prompt structure, including system messages and user prompts, which guides the model's responses.
<br><br>**• SYSTEM:** Sets the system message to dictate the model's behavior.
<br><br>**• ADAPTER:** Applies LoRA adapters to the model for further customization.
<br><br>**• LICENSE:** Specifies the legal license under which the model is shared.
<br><br>**• MESSAGE:** Provides a message history to influence how the model generates responses.

## Create Custom Modelfile
In our example, we only need to set the path, define the template, and set parameters for the stopping conditions.
<br><br>**Path to the Quantized Model:** The Modelfile needs to specify the path to the quantized model stored on our system. This ensures the correct model is loaded for processing.
<br><br>**Template for Message Processing:** The template within the Modelfile is based on the chat template used in the base model. It is responsible for processing and formatting messages according to their roles, such as "user" or "assistant." This structure guarantees that the model's output adheres to the dialogue format it was fine-tuned for.
<br><br>**Stop Parameters:** The stop parameters identify the boundaries of the instructions provided to the model and the responses generated by it. The markers "[INST]" and "[/INST]" signal the start and end of the user's input, respectively. These delimiters ensure the model recognizes where the user's message begins and ends.
<br><br>Below is how we define the path to our quantized model, construct the template content and stopping parameters within the Modelfile for our example:

In [23]:
# Creating the content for the Modelfile
template_content = """TEMPLATE """
template_content += '''"""
{{- if .Messages }}
    {{- range $index, $_ := .Messages }}
        {{- if eq .Role "user" }}
            [INST] 
            {{ .Content }}[/INST]
        {{- else if eq .Role "assistant" }}
            {{- if .Content }} {{ .Content }}
            {{- end }}</s>
        {{- end }}
    {{- end }}
{{- else }}
    [INST] 
    {{ .Prompt }}[/INST]
{{- end }} 
{{ .Response }}
{{- if .Response }}</s>
{{- end }}
"""'''

# Write the rest of the parameters to the file
with open('./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile', 'w') as file:
    file.write('FROM ./Mistral-v0.3-7B-ORPO_Q4_K_M.gguf\n\n')
    file.write('PARAMETER stop "[INST]"\n')
    file.write('PARAMETER stop "[/INST]"\n\n')
    file.write(template_content)

Let's break down the template content:
<br><br>**• Processing Messages:** The template processes the list of messages (.Messages) by identifying the role of each sender (.Role), effectively structuring the conversation.
<br><br>**• Formatting User Messages:** Messages from the "user" are enclosed within [INST] tags. If the message is the user's only input and a system message exists, it is included at the beginning.
<br><br>**• Formatting Assistant Messages:** Messages from the "assistant" are output directly without additional tags, with a </s> tag appended to signify the end of the response.
<br><br>**• Handling Edge Cases:** If no messages are present, the template provides a fallback instruction within [INST] tags to ensure that the model still generates meaningful content.
<br><br>**• Final Response Handling:** The final response is appended and closed with a </s> tag, ensuring the conversation is properly terminated.
After creating the Modelfile, you can display the file content to verify:

In [24]:
# Display the file content
with open('./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile', 'r') as file:
    content = file.read()

With the Modelfile ready, we can now create and add our quantized model to Ollama. This command registers the quantized model with Ollama using the configurations specified in the Modelfile:

In [26]:
!ollama create mistral-v0.3-7B-orpo_Q4_K_M -f ./Mistral-v0.3-7B-ORPO_Q4_K_M/modelfile

[?25ltransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data 0% ⠧ [?25h[?25l[2K[1Gtransferring model data 3% ⠇ [?25h[?25l[2K[1Gtransferring model data 7% ⠏ [?25h[?25l[2K[1Gtransferring model data 9% ⠋ [?25h[?25l[2K[1Gtransferring mod

We can now check if our quantized model is now listed and ready to use.

In [27]:
!ollama list

NAME                               	ID          	SIZE  	MODIFIED      
mistral-v0.3-7B-orpo_Q4_K_M:latest 	980962cd1ac6	4.4 GB	3 seconds ago	
mxbai-embed-large:latest           	468836162de7	669 MB	21 hours ago 	
dema:latest                        	98643ff5e8b0	7.7 GB	6 days ago   	
mistral:7b-text-v0.2-fp16          	06b91ca50a5e	14 GB 	12 days ago  	
llama3.1:8b-instruct-fp16          	1fbd8c253427	16 GB 	4 weeks ago  	
mistral-nemo:12b-instruct-2407-q8_0	550a4a7f593a	13 GB 	4 weeks ago  	
mistral-nemo:12b-instruct-2407-fp16	6a38e02e88ec	24 GB 	4 weeks ago  	
mathstral:7b-v0.1-fp16             	ce1616d5e8c6	14 GB 	5 weeks ago  	
phi3:14b-medium-4k-instruct-f16    	3bc888c6960f	27 GB 	6 weeks ago  	
gemma2:9b-instruct-fp16            	9de55d4bf6ae	18 GB 	6 weeks ago  	
gemma2:27b-instruct-fp16           	d98a00774ab4	54 GB 	7 weeks ago  	
gemma:7b-instruct-fp16             	f689ad351c8d	17 GB 	2 months ago 	
qwen2:7b-instruct-fp16             	99954bc7f7d6	15 GB 	2 months ago 	
starco

Next, we install the necessary library to test the model using the LangChain framework:

In [28]:
!pip install langchain-community langchain-core

Collecting langchain-community
  Using cached langchain_community-0.2.12-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-core
  Downloading langchain_core-0.2.34-py3-none-any.whl.metadata (6.2 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain-community)
  Downloading SQLAlchemy-2.0.32-cp310-cp310-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Using cached dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.3.0,>=0.2.13 (from langchain-community)
  Using cached langchain-0.2.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain-community)
  Downloading langsmith-0.1.101-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-community)
  Using cached tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)


Now we run and test the model on Ollama:

In [31]:
from langchain.llms import Ollama

ollama = Ollama(base_url="http://localhost:11434", model="mistral-v0.3-7B-orpo_Q4_K_M")

TEXT_PROMPT = "What is one plus one?"

print(ollama(TEXT_PROMPT))

One plus one equals two.


The model should return a correct response.

## Conclusion
This article has walked you through the process of quantizing a custom model, integrating it with Ollama, and testing it locally. By leveraging the llama.cpp framework we quantized our custom model in the Q4_K_M format and pushed it to Hugging Face Hub. We then discussed how to create the corresponding Modelfile and how to integrate our model into the Ollama framework.
<br><br>Quantization, offers significant benefits, including reduced memory footprint, faster inference times, and lower power consumption. These advantages make it feasible to deploy sophisticated AI models across a variety of hardware configurations, from high-performance servers to low-power edge devices, broadening the scope of where and how AI can be applied. I hope you enjoyed reading this article and learned something new.
You can find the quantized model from this example on Huggingface.

## References
Brev.dev. (2024). Convert a fine-tuned model to GGUF format and run on Ollama. https://brev.dev/blog/convert-to-llamacpp
<br><br>IBM. (2024). GGUF versus GGML. IBM. https://www.ibm.com/think/topics/gguf-versus-ggml
<br><br>Ollama. (2024). Retrieved from https://ollama.com/blog
<br><br>PatrickPT's Blog. (2024). LLM Quantization in a nutshell. https://patrickpt.github.io/posts/quantllm/

## Push model to hub

In [7]:
# imports from huggingface
from huggingface_hub import create_repo, HfApi

username = "llmat"
HUGGINGFACE_TOKEN = "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"
MODEL_NAME = "Mistral-v0.3-7B-ORPO_Q8_0"


# Defined in the .env
api = HfApi(token=HUGGINGFACE_TOKEN)

# Create empty repo
create_repo(
    repo_id = f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)

# Upload gguf files
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns=f"*.gguf",
)


Mistral-v0.3-7B-ORPO_Q8_0.gguf:   0%|          | 0.00/7.70G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/llmat/Mistral-v0.3-7B-ORPO_q8_0-GGUF/commit/7d6b711f87115356c13a8fcdcc1887c4406d07b2', commit_message='Upload folder using huggingface_hub', commit_description='', oid='7d6b711f87115356c13a8fcdcc1887c4406d07b2', pr_url=None, pr_revision=None, pr_num=None)