<!-- Banner Image -->
<img src="https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdevnotebooks.png" width="100%">

<!-- Links -->
<center>
  <a href="https://console.brev.dev" style="color: #06b6d4;">Console</a> •
  <a href="https://brev.dev" style="color: #06b6d4;">Docs</a> •
  <a href="/" style="color: #06b6d4;">Templates</a> •
  <a href="https://discord.gg/NVDyv7TUgJ" style="color: #06b6d4;">Discord</a>
</center>

# Finetune and convert Llama3 to an Ollama Modelfile 🤙

Hi everyone!

In the last couple months, Ollama has taken the AI community by a storm. It's one of the easiest ways of interacting with open souce models and has a ton of support from the community. One thing we havn't seen online is end-to-end guides on how to finetune a model and then convert it to an Ollama model so it can be used anywhere...so we decided to build one :)

In this notebook, we will do the following.
1. Finetune Llama3-8B instruct using a novel finetuning method called ORPO. This combines Supvervised-Finetuning with Direct Preference Optimization and happens all in 1 step
2. We will install `llama.cpp` and convert our model to `gguf` format and talk about the different options (and how LoRA plays in here)
3. We will write our Ollama `Modelfile` which can be thought of as a Dockerfile for Ollama models
4. We will locally test and then push our model to the Ollama hub
5. We will pull our model on a separate machine and launch it for inference

Grab some snacks, buckle-up, and let it rip 🤙!

Here's the link to the corresponding [video](https://www.youtube.com/watch?v=sK7yqqrK2fE&t=1s)

#### Help us make this tutorial better! Please provide feedback on the [Discord channel](https://discord.gg/T9bUNqMS8d) or on [X](https://x.com/brevdev).

A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.

# Phase 1: Finetune Llama3 with ORPO

So far, we've released guides on finetuning Llama3 with 2 different approaches
1. [Supervised Finetuning](https://github.com/brevdev/notebooks/blob/main/llama3_finetune_inference.ipynb)
2. [Direct Preference Optimization](https://github.com/brevdev/notebooks/blob/main/llama3dpo.ipynb)

If you havn't taken a look at these, I suggest skimming through the code for the SFT notebook and the explanation at the top of the DPO notebook. TLDR: Most of the finetuning in guides and online use SFT. SFT is good at adapting a model to domain-specific ouput but might also increase the probability of generating outputs that are not as desirable. To solve this issue, we can then perform a process known as preference alignment. This can be done using RLHF or DPO. However, these are both computationally expensive.

Enter Odds Ratio Policy Optimization. At a high level, this combines SFT and DPO into 1 neat step which weakly penalizes rejected responses while strongly rewarding preferred responses. Here we optimize for two objectives at once: learning domain-specific output AND aligning the output with out preferences. If you'd like to dive deeper into ORPO, check out MLabonne's [fantastic guide](https://huggingface.co/blog/mlabonne/orpo-llama-3) on HuggingFace

![ORPO](https://i.imgur.com/ftrth4Q.png)


## Install Dependancies

In [1]:
!pip install bitsandbytes
!pip install wandb
!pip install transformers
!pip install peft
!pip install accelerate
!pip install trl

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->bitsandbytes)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->bitsandbytes)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (41

In [2]:
import gc
import os

import torch
import wandb
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import ORPOConfig, ORPOTrainer

## Load Model, Tokenizer, and Dataset

We'll be working with the Llama3-8B instruct model. But the steps are almost the same with any other model if you decide to finetune with a LoRA adapter.

In [3]:
# Model
base_model = "cognitivecomputations/dolphin-2.9.2-qwen2-7b"
new_model = "brevity-adapter"


In [4]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

In [5]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.33G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [9]:
dataset_name = "hurtIII/uncensoredformat_openaiartofwarfinal1"
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=13).select(range(13))

def format_chat_template(row):
    row["chosen"] = row["messages"][2]["content"]  # Assuming the third message is the assistant's response (chosen)
    row["rejected"] = ""  # If there is no rejected message in the new dataset format
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)


#dataset_name = "mlabonne/orpo-dpo-mix-40k"
#dataset = load_dataset(dataset_name, split="all")
#dataset = dataset.shuffle(seed=42).select(range(150))

#def format_chat_template(row):
#    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
#    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
#    return row

#dataset = dataset.map(
#    format_chat_template,
#    num_proc= os.cpu_count(),
#)
#dataset = dataset.train_test_split(test_size=0.01)

Downloading readme:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/61.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13 [00:00<?, ? examples/s]

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/13 [00:00<?, ? examples/s]

Take a look at the dataset by uncommenting the code block below

In [10]:
# Uncomment this section to inspect the dataset structure
for k in dataset['train'].features.keys():
    print(k)
    print("---------")
    print(dataset['train'][1][k])
    print('\n')


# for k in dataset['train'].features.keys():
#     print(k)
#     print("---------")
#     print(dataset['train'][1][k])
#     print('\n')

messages
---------
[{'role': 'system', 'content': 'You are a helpful assistant knowledgeable about The Art of War.'}, {'role': 'user', 'content': 'Chapter : The control of a large force is the same principle as the control of a few men: it is merely a question of dividing up their numbers.'}, {'role': 'assistant', 'content': '. Fighting with a large army under your command is nowise different from fighting with a small one: it is merely a question of instituting signs and signals. To ensure that your whole host may withstand the brunt of the enemy attack and remain unshaken this is effected by maneuvers direct and indirect. That the impact of your army may be like a grind- stone dashed against an egg this is effected by the sci- ence of weak points and strong. In all fighting, the direct method may be used for joining battle, but indirect methods will be needed in order to secure victory. Indirect tactics, efficiently applied, are inexhaustible as Heaven and Earth, unending as the flow

In [11]:
!pip install trl --upgrade
!pip install transformers --upgrade
!pip install accelerate



Collecting transformers
  Downloading transformers-4.42.3-py3-none-any.whl (9.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.2
    Uninstalling transformers-4.41.2:
      Successfully uninstalled transformers-4.41.2
Successfully installed transformers-4.42.3


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3108, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2901, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/req_command.py", line 242, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 441, in run
    conflicts = self._determine_conflicts(to_install)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 

In [4]:
from typing import Any, Dict, Optional, Union

import torch
from torch.utils.data import Dataset
from transformers import Trainer, is_torch_tpu_available, DataCollatorWithPadding
from transformers.trainer_utils import PredictionOutput

from trl.core import LengthSampler
from trl.trainer.utils import has_length


In [5]:
class ORPOTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def tokenize_row(self, feature, model):
        """
        Tokenizes a single row of the dataset.
        """
        batch = {}
        # Adapt to the dataset schema
        chosen = feature["chosen"]  # Assuming "chosen" key is for the chosen response
        rejected = feature["rejected"]  # Assuming "rejected" key is for the rejected response

        # Tokenize the chosen and rejected responses
        batch["input_ids"] = self.tokenizer(
            chosen,
            truncation=True,
            max_length=self.args.max_length,
            padding="max_length",
            return_tensors="pt",
        ).input_ids
        batch["labels"] = self.tokenizer(
            rejected,
            truncation=True,
            max_length=self.args.max_length,
            padding="max_length",
            return_tensors="pt",
        ).input_ids

        # Convert to PyTorch tensors
        batch = {k: v.squeeze() for k, v in batch.items()}
        return batch


In [9]:
import gc
import os
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import ORPOConfig, ORPOTrainer  # Ensure this import is correct


In [10]:
# ORPO training configuration
orpo_args = ORPOConfig(
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    max_length=1024,
    max_prompt_length=512,
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=1,
    max_steps=10,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=2,
    report_to="wandb",
    output_dir="./results/",
)

# Load and preprocess the dataset
dataset_name = "hurtIII/uncensoredformat_openaiartofwarfinal1"
dataset = load_dataset(dataset_name, split="all")

# Adjust selection to the actual dataset size
dataset_size = len(dataset)
dataset = dataset.shuffle(seed=42).select(range(dataset_size))

def format_chat_template(row):
    row["chosen"] = row["messages"][1]["content"]  # Assuming the second message is the chosen one
    row["rejected"] = row["messages"][2]["content"]  # Assuming the third message is the rejected one
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)


  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/13 [00:00<?, ? examples/s]

## Set up ORPO Training

The ORPO trainer looks very similar to the SFTTrainer and the DPO trainer. We set our config parameters and start our training run. Notice that the run is very short. In order to increase it, you can increase the `num_train_epochs` or add a `max_steps` argument

In [12]:
orpo_args = ORPOConfig(
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    max_length=1024,
    max_prompt_length=512,
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=1,
    max_steps=10,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=2,
    report_to="wandb",
    output_dir="./results/",
)




In [14]:
# Model and Tokenizer
base_model = "cognitivecomputations/dolphin-2.9.2-qwen2-7b"
new_model = "brevity-adapter"

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Log in to Hugging Face (if needed)
from huggingface_hub import notebook_login
notebook_login()

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# Apply PEFT configuration to the model
model = get_peft_model(model, peft_config)


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [16]:
# Load and preprocess the dataset
dataset_name = "hurtIII/uncensoredformat_openaiartofwarfinal1"
dataset = load_dataset(dataset_name, split="all")

# Adjust selection to the actual dataset size
dataset_size = len(dataset)
dataset = dataset.shuffle(seed=42).select(range(dataset_size))

def format_chat_template(row):
    row["chosen"] = row["messages"][1]["content"]  # Assuming the second message is the chosen one
    row["rejected"] = row["messages"][2]["content"]  # Assuming the third message is the rejected one
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)


In [18]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

model = get_peft_model(model, peft_config)


In [19]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
)


Map:   0%|          | 0/12 [00:00<?, ? examples/s]

KeyError: 'prompt'

In [None]:
# Flush memory
del trainer, model
gc.collect()
torch.cuda.empty_cache()

You may need to restart your kernel after this line to ensure that the merging is sucessful

In [33]:
import torch
from peft import PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token

# Adjust model loading with disk offloading
fp16_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
    offload_folder="offload",  # Specify the offload folder for handling memory
    offload_state_dict=True
)


NameError: name 'base_model' is not defined

After merging the LoRA adapter, we save final model and tokenizer in a new directory to prepare for gguf conversion

In [None]:
model.save_pretrained("dolphin-2.9.2-qwen2-7b")
tokenizer.save_pretrained("dolphin-2.9.2-qwen2-7b")

# Phase 2: From safetensors to gguf

## Convert our model to GGUF

Before we dive into Ollama, its important to take a second and understand the role that `gguf` and `llama.cpp` play in the process.

Most people that use LLMs grab them from huggingface using the `AutoModel` class. This is how we did it above. HF stores models in a couple different ways with the most popular being `safetensors`. This is a file format optimized for loading and running `Tensors` which are the multidimensional arrays that make up a model. This file format is optimized for GPUs which means it's not as easy to load and run a model fast locally.

One solution that addresses this is the `gguf` format. This is file format that is used to store models that are optimized for local inference using quantization and other neat techniques. This file format is then consumed by runners that support it (ie. `llama.cpp` and Ollama).

There's a good bit of complexity here and heres a fantastic [blog post](https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/) that gets into the weeds. For now this is what we need to know

1. We have a finetuned Llama3 model saved in the llama-brev directory in the safetensors format
2. In order to use this model via Ollama, we need it to be in the `gguf` format
3. We can use helper tools in the `llama.cpp` repository to convert

## Convert to gguf

The first thing we do is build llama.cpp in order to use the conversion tools

In [None]:
# this step might take a while
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make

In [None]:
# install requirements
!pip install -r llama.cpp/requirements.txt

Here we run the actual conversion. Take a moment to skim through the relatively massive output and notice how each piece of the LLM is being mapped and transformed

In [None]:
# run the conversion script to convert to gguf
# this will save the gguf file inside of the dolphin-brev directory
!python llama.cpp/convert-hf-to-gguf.py dolphin-brev


The final model is saved at `llama-brev/ggml-model-f16.gguf`. Note how the model is in fp16

### An aside on quantizations

Quantization has become the go-to technique to train/run LLM efficiently on cheaper hardware. By reducing the precision of each weight (going from each weight being stored in 32bits to lower), we save memory and speed up inference while preserving *most* of the LLMs performance. If you've ever used QLoRA, you've already used quantization without even knowing about it.

Llama.cpp gives us a ton of quantization options. Here's a couple resources to dive deeper into which options are available

- [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
- [Maxime Labonne](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html)

In this guide, we will use the `Q4_K_M` format. Feel free to play around with different ones! Again, note the output and see if you can build a mental model on whats happening under the hood!

In [None]:
# run the quantize script
!cd llama.cpp && ./llama-quantize ../dolphin-brev/ggml-model-f16.gguf ../dolphin-brev/ggml-model-Q4_K_M.gguf Q4_K_M


If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run

```
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

In [None]:
!cd llama.cpp && ./llama-server -m ../dolphin-brev/ggml-model-Q4_K_M.gguf -c 2048

Note that this is a blocking process. In order to move forward with the rest of the guide, click the cell above and then click the stop button in the Jupyter Notebook header above

# Phase 3: Run and deploy your model using Ollama

Now that you have the quantized model, you can spin up the llama.cpp server anywhere you want and load the gguf model in. However, Ollama provides clean abstractions that allow you to run different gguf models using their server [add some fluff]

## Build the Ollama Modelfile

A Modelfile is very similar to a Dockefile. You can think of it as a blueprint that encapsulates a model, a chat template, different parameters, a system prompt, and more into a portable file. To learn more, check out their [Modelfile docs](https://github.com/ollama/ollama/blob/main/docs/modelfile.md).

Here we will build a relatively simple one. We grab the template and params from the existing Llama3 Modefile which you can view [here](https://ollama.com/library/llama3)

In [None]:
tuned_model_path = "/home/ubuntu/verb-workspace/dolphin-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"


In [None]:
cmds = []

In [None]:
base_model = f"FROM {tuned_model_path}"

template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
"""'''

params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''

system = f'''SYSTEM """{sys_message}"""'''

In [None]:
cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)

In [None]:
def generate_modelfile(cmds):
    content = ""
    for command in cmds:
        content += command + "\n"
    print(content)
    with open("Modelfile", "w") as file:
        file.write(content)

In [None]:
generate_modelfile(cmds)

There should now be a `Modelfile` saved in your working directory. Lets now install Ollama

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

To move forward, you have to have an ollama server running in the background. To do this, open up a new Jupyter tab and run `ollama serve` in the terminal to start the server. This will print out a key. Save it for future use. It will look like `ssh-ed25519...`

In [None]:
# the create command create the model
!ollama create llama-brev -f Modelfile

## Experiment with the model

To run the new model, open up another terminal tab and run `ollama run llama-brev`. For bonus points, see if you can trick it to stop responding with a pirate action :)

## Push the model

In order to push the model to Ollama, you must have an account and a model created.

1. Sign-up at https://www.ollama.ai/signup
2. Create a new model at https://www.ollama.ai/new. Find mine at scooterman/llama-brev. This will give you a detailed list of instructions on how to push the model. Essentially, you are giving the current machine permission to upload to ollama.
3. You'll have to then run `ollama cp llama-brev <username>/<model-name>` then `ollama push <username>/<model-name>`