# Conversation models finetuning

In this notebook we will finetuning an instruction / chat model to behave like Paul Graham. https://www.paulgraham.com/

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

### Installation

In [1]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install -U datasets
!pip install fsspec==2023.9.2

### Load base model

In [2]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    #token=userdata.get('HF_ACCESS_TOKEN')
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.8: Fast Llama patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [3]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted_text)


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Add lora to base model and patch with Unsloth

In [4]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig

target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  # you run out of memory on colab if you do this
  # target_modules = target_modules + ["lm_head", "embed_tokens"]
  # so if you are on colab and added new tokens instead do
  target_modules = target_modules + ["lm_head"]


model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules, # On which modules of the llm the lora weights are used
    lora_alpha = 16,  # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, # Default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none",   # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False, # scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None, # And LoftQ
)

Unsloth 2025.7.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Create dataset

In [5]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("OscarIsmael47/Bart_Simpson_and_Homer_Simpson_conversations", split = "train")
#dataset = load_dataset("pookie3000/pg_chat", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/349 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/206k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2482 [00:00<?, ? examples/s]

Map:   0%|          | 0/2482 [00:00<?, ? examples/s]

In [6]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break



------ Sample 1 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Whoa, somebody was bound to say it one day. I just can't believe it was her.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Did you hear that, Marge? She called me a baboon! The stupidest, ugliest, smelliest ape of them all!<|eot_id|>

------ Sample 2 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Ah, Dad, if just me, Milhouse and Lewis had voted...<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey son, would you have gotten any money for being class president?<|eot_id|>

------ Sample 3 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id

### Train the model


In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/2482 [00:00<?, ? examples/s]

In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,482 | Num Epochs = 1 | Total steps = 311
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
1,5.153
2,4.7732
3,4.5608
4,3.9438
5,3.328
6,3.0594
7,2.6708
8,2.3774
9,2.0207
10,1.6566


Unsloth: Will smartly offload gradients to save VRAM!


### Inference


In [17]:
FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Who is Marge?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Who is Marge?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Marge is your mother.<|eot_id|>


### Save lora adapter

This is both useful for inference and if you want to load the model again

In [None]:
model.push_to_hub(
    "OscarIsmael47/Meta-Llama-3.1-8B-Instruct-Bart_Simpson_and_Homer_Simpson-LORA",
    tokenizer,
    token = 'HUGGINGFACE_TOKEN'
)

README.md:   0%|          | 0.00/612 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/OscarIsmael47/Meta-Llama-3.1-8B-Instruct-Bart_Simpson_and_Homer_Simpson-LORA


### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub_gguf(
    "Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = 'HUGGINGFACE_TOKEN'
  )
# local
#model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.52 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 50%|█████     | 16/32 [00:01<00:01, 13.56it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [03:29<00:00,  6.54s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF into f16 GGUF format.
The output location will be /content/Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/OscarIsmael47/Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF


In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

In [29]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.14.tar.gz (51.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.14-cp311-cp311-linux_x86_64.whl size=4299383 sha256=8648ae8c712e9aba929

In [2]:
from llama_cpp import Llama

llm = Llama(
      model_path="Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/unsloth.Q4_K_M.gguf",
)

llama_model_loader: loaded meta data with 31 key-value pairs and 292 tensors from Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/unsloth.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Q4_K_M Bart_Simpson...
llama_model_loader: - kv   3:                           general.finetune str              = Bart_Simpson_and_Homer_Simpson-GGUF
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                        

In [9]:
response = llm.create_chat_completion(
      messages = [
        {"role": "user", "content": "Who is Marge?"},
      ]
)

print(response)
print("-----------------")
print(response["choices"][0]["message"]["content"])

Llama.generate: 39 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =    2424.10 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    2226.58 ms /    10 runs   (  222.66 ms per token,     4.49 tokens per second)
llama_perf_context_print:       total time =    2237.71 ms /    11 tokens


{'id': 'chatcmpl-03c277c1-ac71-42a8-bc58-5506c1a62001', 'object': 'chat.completion', 'created': 1753299115, 'model': 'Meta-Llama-3.1-8B-q4_k_m-Bart_Simpson_and_Homer_Simpson-GGUF/unsloth.Q4_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'Marge is the mother of your friends.'}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 40, 'completion_tokens': 9, 'total_tokens': 49}}
-----------------
Marge is the mother of your friends.


'Marge is my wife.'

### Load model and saved lora adapters

For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "OscarIsmael47/Meta-Llama-3.1-8B-Instruct-Bart_Simpson_and_Homer_Simpson-LORA",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token="HUGGINGFACE_TOKEN"
)

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "What is your job?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

==((====))==  Unsloth 2025.7.8: Fast Llama patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is your job?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I am a... I am a... I am a... I am a... I am a...<|eot_id|>


# Run GGUF on Ollama

Create Modelfile

Put correct template in it (depends on chat or completion model) Tip look on Ollama website for templates.
Ollama Modelfile docs: https://github.com/ollama/ollama/blob/main/docs/modelfile.md

Meta llama3 template: https://ollama.com/library/llama3/blobs/8ab4849b038c


Create model

ollama create my-model -f Modelfile