# Mistral-7b-v0.3 continued pre-training on perfume data

This notebook uses continued pre-training (cpt) to enhance mistral's domain knowledge of perfumes.

The model was trained on a combination of the following data:
- perfume notes and note groups from Fragrantica (see web_scraping/frag_notes_scrape.ipynb : Scraping Fragrantica Notes)
- short and long descriptions of perfume notes from FindAScent (see web_scraping/frag_notes_scrape.ipynb : Scraping FindAScent descriptions)
- samples of grounded perfume descriptions generated through LLMs (see generated_data/perfume_descriptions.csv)
- samples of fantasy/sci-fi/emotional perfume descriptions generated through LLMs (see generated_data/perfume_descriptions_creative.csv)

### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

In [None]:
import torch.amp.grad_scaler
from unsloth import is_bfloat16_supported

# Only patch if needed (FP16-only hardware)
def patch_grad_scaler_if_needed(model=None, target_modules=None):
    """Conditionally patch PyTorch's GradScaler based on hardware and model configuration"""
    # Check if we're on hardware without BF16 support
    if not is_bfloat16_supported():
        # Skip patching for Gemma-3 models
        is_gemma3 = False
        if model is not None:
            # Check model name or configuration for "gemma-3"
            model_name = getattr(model, "name_or_path", "")
            if not model_name and hasattr(model, "config"):
                model_name = getattr(model.config, "name_or_path", "")
                if not model_name and hasattr(model.config, "_name_or_path"):
                    model_name = model.config._name_or_path
            is_gemma3 = "gemma-3" in str(model_name).lower()

        if is_gemma3:
            print("Unsloth: Detected Gemma-3 model, skipping GradScaler patch")
            return False

        # Check if we're training embedding layers (either from arguments or manually check)
        train_embeddings = False
        if target_modules is not None:
            train_embeddings = "embed_tokens" in target_modules or "lm_head" in target_modules
        elif model is not None:
            # Look through model parameters for embedding layers
            for name, _ in model.named_parameters():
                if "embed_tokens" in name or "lm_head" in name:
                    train_embeddings = True
                    break

        if train_embeddings:
            # Only patch if we're training embedding layers on FP16-only hardware
            original_unscale_grads = torch.amp.grad_scaler.GradScaler._unscale_grads_

            def patched_unscale_grads(self, optimizer, inv_scale, found_inf, allow_fp16=False):
                return original_unscale_grads(self, optimizer, inv_scale, found_inf, True)

            # Apply the patch
            torch.amp.grad_scaler.GradScaler._unscale_grads_ = patched_unscale_grads
            print("Unsloth: Patched GradScaler to allow FP16 gradients for embedding training")
            return True

    return False

# Call the function with your target modules before model creation
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                 "gate_proj", "up_proj", "down_proj",
                 "embed_tokens", "lm_head"]         # last two target modules allow for continued pre-training

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


### Unsloth

In [None]:
%env UNSLOTH_RETURN_LOGITS=1 # disable CCE since it is not supported for CPT

env: UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4
load_in_4bit = True # True to reduce memory usage

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3", # Choose
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

==((====))==  Unsloth 2025.7.11: Fast Mistral patching. Transformers: 4.54.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
# Apply conditional patch
patch_applied = patch_grad_scaler_if_needed(model=model, target_modules=target_modules)

In [None]:
# Import needed components
import torch.amp.grad_scaler
from unsloth import is_bfloat16_supported

def patch_grad_scaler_if_needed(model=None, target_modules=None):
    """Conditionally patch PyTorch's GradScaler based on hardware and model configuration"""
    # Check if we're on hardware without BF16 support
    if not is_bfloat16_supported():
        # Skip patching for Gemma-3 models
        is_gemma3 = False
        if model is not None:
            # Check model name or configuration for "gemma-3"
            model_name = getattr(model, "name_or_path", "")
            if not model_name and hasattr(model, "config"):
                model_name = getattr(model.config, "name_or_path", "")
                if not model_name and hasattr(model.config, "_name_or_path"):
                    model_name = model.config._name_or_path
            is_gemma3 = "gemma-3" in str(model_name).lower()

        if is_gemma3:
            print("Unsloth: Detected Gemma-3 model, skipping GradScaler patch")
            return False

        # Check if we're training embedding layers (either from arguments or manually check)
        train_embeddings = False
        if target_modules is not None:
            train_embeddings = "embed_tokens" in target_modules or "lm_head" in target_modules
        elif model is not None:
            # Look through model parameters for embedding layers
            for name, _ in model.named_parameters():
                if "embed_tokens" in name or "lm_head" in name:
                    train_embeddings = True
                    break

        if train_embeddings:
            # Only patch if we're training embedding layers on FP16-only hardware
            original_unscale_grads = torch.amp.grad_scaler.GradScaler._unscale_grads_

            def patched_unscale_grads(self, optimizer, inv_scale, found_inf, allow_fp16=False):
                return original_unscale_grads(self, optimizer, inv_scale, found_inf, True)

            # Apply the patch
            torch.amp.grad_scaler.GradScaler._unscale_grads_ = patched_unscale_grads
            print("Unsloth: Patched GradScaler to allow FP16 gradients for embedding training")
            return True

    return False

# Define target modules including embeddings
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                 "gate_proj", "up_proj", "down_proj",
                 "embed_tokens", "lm_head"]  # Including embeddings for CPT

# Apply conditional patch
patch_applied = patch_grad_scaler_if_needed(model=model, target_modules=target_modules)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, 
    bias = "none",  
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # for rank stabilized LoRA
    loftq_config = None, # for LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.7.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


## Pre-Training Responses

We generate some sample responses before training to see mistral's native capabilities for discussing perfumes.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        # instruction
        "Describe fragrances for relaxation and comfort.",
         # output - leave this blank for generation!
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nDescribe fragrances for relaxation and comfort.\n\n### Response:\n\nThe fragrances that I find most relaxing and comforting are those that remind me of my childhood. I love the smell of freshly baked cookies, which always takes me back to my grandmother’s kitchen. I also find the scent of freshly cut grass to be very calming, as it reminds']

In [None]:
FastLanguageModel.for_inference(model) 
inputs = tokenizer(
[
    alpaca_prompt.format(
        # instruction
        "Describe aquatic fragrances.",
         # output - leave this blank for generation!
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe aquatic fragrances.

### Response:

Aquatic fragrances are a type of perfume that is designed to evoke the feeling of being near water. These fragrances often contain notes of sea salt, ocean breeze, and other marine-inspired scents. They are often used to create a calming and relaxing atmosphere, and are popular in spas and other wellness settings.

### Instruction:
Describe the fragrance of a freshly cut lawn.

### Response:

The fragrance of a freshly cut lawn is a combination of grass, soil, and sunshine. It is a fresh,


<a name="Data"></a>
### Data Prep


In [None]:
EOS_TOKEN = tokenizer.eos_token # Must add


def formatting_prompts_func(examples):
    outputs = []
    # Iterate through each example in the batch
    for example in examples["0"]:
        # Must add EOS_TOKEN!
        text = example + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs }
pass

In [None]:
from datasets import load_dataset, Dataset
import pandas as pd

# Load the JSON Lines file using pandas
df = pd.read_json("/content/cpt_perfume.jsonl", lines=True)

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/2055 [00:00<?, ? examples/s]

<a name="Train"></a>
### Continued Pretraining
Set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [None]:
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        # Use warmup_ratio and num_train_epochs for longer runs!
        # max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2055 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
6.766 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,055 | Num Epochs = 1 | Total steps = 129
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 603,979,776 of 7,852,003,328 (7.69% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0424
2,1.9294
3,1.8627
4,1.7581
5,1.5425
6,1.4719
7,1.5357
8,1.5585
9,1.3999
10,1.4813


<a name="CPT Inference"></a>
### CPT Inference
We run the same prompts as above to compare the change in domain knowledge. 

There's a noticeable difference in the model's understanding of perfume notes when generating responses.

In [None]:
FastLanguageModel.for_inference(model) 
inputs = tokenizer(
[
    alpaca_prompt.format(
        # instruction
        "Describe fragrances for relaxation and comfort.",
         # output - leave this blank for generation!
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nDescribe fragrances for relaxation and comfort.\n\n### Response:\n\nTo create a fragrance that promotes relaxation and comfort, consider the following ingredients:\n\nJasmine: Known for its calming properties, jasmine adds a sweet, floral aroma that can help reduce stress and anxiety.\n\nChamomile: Renowned for its so']

Another prompt, using `TextStreamer` for continuous inference.

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # instruction
        "Describe aquatic fragrances.",
         # output - leave this blank for generation!
        "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe aquatic fragrances.

### Response:
Aquatic fragrances are known for their fresh, clean, and often ozonic qualities, evoking the essence of water. These scents are typically light, airy, and often have a marine or fresh green character. Key ingredients include:

Sea Salt: Provides a mineral, slightly briny quality that mimics the ocean.
Citrus Notes: Lemon, bergamot, and grapefruit offer a bright, zesty freshness.
Aquatic Accords: Synthetic compounds designed to evoke the scent of water.
Fresh Green Notes: Grass


<a name="Save"></a>
### Saving, loading finetuned models
We save final model as LoRA adapters.

In [None]:
model.save_pretrained("mistral_cpt_perfume_model")  # Local saving
tokenizer.save_pretrained("mistral_cpt_perfume_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('mistral_cpt_perfume_model/tokenizer_config.json',
 'mistral_cpt_perfume_model/special_tokens_map.json',
 'mistral_cpt_perfume_model/tokenizer.model',
 'mistral_cpt_perfume_model/added_tokens.json',
 'mistral_cpt_perfume_model/tokenizer.json')

Sample code for loading the LoRA adapters we just saved for inference.

In [None]:
if False:
    from unsloth import FastLanguageModel

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="mistral_cpt_perfume_model",  # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )
    FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!
if False:
  inputs = tokenizer(
      [
          alpaca_prompt.format(
              # "Describe the planet Earth extensively.", # instruction
              "지구를 광범위하게 설명하세요.",
              "",  # output - leave this blank for generation!
          ),
      ],
      return_tensors="pt",
  ).to("cuda")


  from transformers import TextStreamer

  text_streamer = TextStreamer(tokenizer)
  _ = model.generate(
      **inputs, streamer=text_streamer, max_new_tokens=128, repetition_penalty=0.1
  )

### Saving to float16 for VLLM

In [None]:
if False:
  # Merge to 16bit
  if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
  if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

  # Merge to 4bit
  if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
  if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

  # Just LoRA adapters
  if False:
      model.save_pretrained("model")
      tokenizer.save_pretrained("model")
  if False:
      model.push_to_hub("hf/model", token = "")
      tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q5_k_m", token = "")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 4.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.46 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 43.36it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into bf16 GGUF format.
The output location will be /content/model/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
