# Mistral-7b-v0.3 fine-tuning for tag generation

This notebook fine-tunes mistral-7b-v0.3 to teach it to generate lists of relevant perfume notes when passed a user query asking for a scent.

Instead of the base model, we use the model we've pre-trained on perfume data/language (see mistral_7B-v0.3-cpt.ipynb) using cpt.

This model was fine-tuned on hand-made training data, whose base is the list of perfume notes scraped from Fragrantica (see web_scraping/frag_notes_scrape.ipynb : Scraping Fragrantica Notes). These were re-organized and assembled into prompt-response format (see generated_data/generating_training_data.ipynb for the generation process, and generated_data/training-data.jsonl, generated_data/training-data_chatml.jsonl for the training data).

### Getting Started

In [None]:
# Connect to your google drive
from google.colab import drive
drive.mount('/content/drive')

# Path to continued pre-training lora adapters
lora_path = '/content/drive/MyDrive/Colab Notebooks/tms/cpt_mistral_perfumer/adapters/'

# Path to training data
training_path = '/content/drive/MyDrive/Colab Notebooks/tms/training_data_chatml.jsonl'

### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 
dtype = None # None for auto detection. Float16 for Tesla T4, V100
load_in_4bit = True # Use to reduce memory usage

# Load in base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.7.11: Fast Mistral patching. Transformers: 4.54.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Load in CPT lora adapters.

In [None]:
from peft import PeftModel

# Load saved LoRA adapter weights into model
model = PeftModel.from_pretrained(
    model,
    lora_path,                
    is_trainable = True   # Set to True if you plan to resume fine-tuning
)

model.gradient_checkpointing_enable()

<a name="Data"></a>
### Data Prep
We now use the `ChatML` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. ChatML renders multi turn conversations like below:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.
```


Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass


from datasets import load_dataset

dataset = load_dataset('json', data_files=training_path, split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Will map <|im_end|> to EOS = </s>.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Let's see how the `ChatML` format works by printing the 5th element

In [17]:
dataset[5]["messages"]

[{'role': 'user', 'content': 'What scents give the aroma of a crisp harbor?'},
 {'role': 'assistant',
  'content': 'Such a perfume would probably contain notes of sand, water, accord eudora®, clean, sea shells, rain notes, re base, thalassogaia™, mountain air, calypsone, sea water, ocean, starfish.'}]

In [18]:
print(dataset[5]["text"])

<|im_start|>user
What scents give the aroma of a crisp harbor?<|im_end|>
<|im_start|>assistant
Such a perfume would probably contain notes of sand, water, accord eudora®, clean, sea shells, rain notes, re base, thalassogaia™, mountain air, calypsone, sea water, ocean, starfish.<|im_end|>



<a name="Train"></a>
### Train the model

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2,
        # max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    )
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.52 GB of memory reserved.


In [33]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 2 | Total steps = 500
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 603,979,776 of 7,852,003,328 (7.69% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.6086
2,3.5978
3,2.4408
4,2.0313
5,1.7921
6,1.7737
7,2.2416
8,2.0125
9,1.7932
10,1.8189


config.json: 0.00B [00:00, ?B/s]

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

870.2993 seconds used for training.
14.5 minutes used for training.
Peak reserved memory = 6.289 GB.
Peak reserved memory for training = 1.769 GB.
Peak reserved memory % of max memory = 42.643 %.
Peak reserved memory for training % of max memory = 11.995 %.


<a name="Inference"></a>
### Inference
Let's run the model! Since we're using `ChatML`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [34]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What's a good scent for a battlefield?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|im_start|>user\nWhat's a good scent for a battlefield?<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of urban, musk and amber, smoke, ambrinol, lorenox, trimofix®, anthamber™, gasoline, wet plaster, industrial glue, exaltolide®, sodium silicate.<|im_end|"]

In [35]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What's a good scent for the last day of autumn?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

Unsloth: Will map <|im_end|> to EOS = <|im_end|>.


["<|im_start|>user\nWhat's a good scent for the last day of autumn?<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of sand, water, clean, kyphi, smoke, musk and amber, sp3 carbon, ambrarome, woods and mosses, greens, herbs and fouger, animalic, earthy, flowers, fabric, buxus.<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of sand, amber xtreme, woods and mosses, ambergris, animalic, earthy, cabreuva, buxus.<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of amber xtreme, woods and mosses, ambergris, animalic, earthy, cabreuva, buxus.<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of sand, amber xtreme, woods and mosses, ambergris, animalic, earthy, cabreuva, buxus, buxus.<|im_end|>\n<|im_start|>assistant\nSuch a perfume would probably contain notes of amber xtreme, woods and mosses, ambergris, animalic, earthy, cabreuva, buxus, buxus.<|im_end|>\n<|im_start|>assista

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [36]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What does it smell like walking in an oppulent casino resort?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<|im_start|>user
What does it smell like walking in an oppulent casino resort?<|im_end|>
<|im_start|>assistant
Such a perfume would probably contain notes of kyphi, musk and amber, operanide, ambrettolide, ambrarome, ambergris, ambreine, mystikal, animalic, ambrostar, satin, oppulence, dodecanal.<|im_end|>
<|im_start|>assistant
Such a perfume would probably contain notes of amber xtreme, amber xtreme, ambroxan, ambergris, oppulence, mystikal, animalic, saffiano


In [37]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What does it smell like walking into a clean room, with sunlight pouring in from the windows?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<|im_start|>user
What does it smell like walking into a clean room, with sunlight pouring in from the windows?<|im_end|>
<|im_start|>assistant
Such a perfume would probably contain notes of water, clean, musk and amber, ambroxan, palo santo, holy water, woods and mosses, ditax wood, citrus smells, kumquat, methyl pamplemousse, grapefruit peel, buxus.<|im_end|>
<|im_start|>assistant
Such a perfume would probably contain notes of water, clean, musk and amber, ambroxan, palo santo, holy water, woods and mosses, citrus sm


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# Local saving
model.save_pretrained("perfume_mistral_v0.3_(7B)_cpt_fine_tuned")  
tokenizer.save_pretrained("perfume_mistral_v0.3_(7B)_cpt_fine_tuned")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('perfume_mistral_v0.3_(7B)_cpt_fine_tuned/tokenizer_config.json',
 'perfume_mistral_v0.3_(7B)_cpt_fine_tuned/special_tokens_map.json',
 'perfume_mistral_v0.3_(7B)_cpt_fine_tuned/chat_template.jinja',
 'perfume_mistral_v0.3_(7B)_cpt_fine_tuned/tokenizer.json')

Sample code to load the LoRA adapters we just saved for inference.

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "What is a famous tall tower in Paris?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|im_start|>user
What is a famous tall tower in Paris?<|im_end|> 
<|im_start|>assistant
The Eiffel Tower is a famous tall tower in Paris. It is one of the most recognizable landmarks in the world and is a popular tourist destination. The tower was built in 1889 for the World's Fair and is named after Gustave Eiffel, the engineer who designed and built it. The tower stands at a height of 324 meters (1,063 feet) and is made of iron. It is located on the Champ de Mars, a large public park in Paris.<|im_end|>


### Saving to float16 for VLLM

In [39]:
# Merge to 16bit
if False: model.save_pretrained_merged("perfume_mistral_v0.3_(7B)_cpt_fine_tuned", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")




### GGUF / llama.cpp Conversion

* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [42]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model_q4km", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.06 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 59.08it/s]


Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model_q4km into bf16 GGUF format.
The output location will be /content/model_q4km/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_q4km
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weig