To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git


# Unsloth achieves up to 2x faster fine-tuning speeds compared to traditional methods, with a significant reduction in memory usage (up to 70%).
# This makes it suitable for environments with constrained computational resources, like Google Colab or low-end GPUs​

# Unsloth leverages LoRA (Low-Rank Adaptation), which modifies only a small fraction (1-10%) of a model's parameters during training,
# instead of fine-tuning the entire model. This drastically reduces the computational and memory requirements while achieving comparable performance.
# It allows models to adapt to domain-specific tasks without retraining the entire network, enabling faster iterations and greater flexibility.

# By supporting 4-bit quantization, Unsloth minimizes memory usage during training and inference.
# Quantization reduces the precision of the weights and activations, which reduces memory demands and accelerates computation while preserving accuracy​.
#     - the weights of the models use only 4-bits representation.

In [10]:
from google.colab import drive
drive.mount('/content/drive')

output_dir = "/content/drive/My Drive/Colab Notebooks/outputs"

!ls "/content/drive/My Drive/Colab Notebooks"

Mounted at /content/drive
'Copy of CUDA_Colab_tutorial1.ipynb'   output_dir  'SVM algorithm steg,springa.ipynb'


In [None]:
!nvidia-smi -l

Wed Nov 27 13:57:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [2]:
#FROM LECTURE:
# need to understand...
#    what does it mean to quantize the weights in the network:
#           - To reduce the precision of the weights from the common 16-bits or 32-bits to 4-bits.

#    what are some of the hyperparameters:
#           -

#    why did you choose a specific model:
#           - The 3B model is a balance between computational efficiency and performance.
#           - Larger models (e.g., 70B) can provide better accuracy on complex tasks but require significantly more hardware resources.
#           - A 3B model is suitable for general-purpose tasks with constrained computational budgets.
#           - It’s particularly effective for applications like instruction-following, summarization, and general-purpose reasoning tasks​.

#    and talk about the properties of the model:

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
                      # RoPE: Extends the sequence capabilities of pre-trained models and enables processing of sequences longer than the model was originally trained for.
                      # Models like Llama-2, Llama-3, and others frequently use RoPE or derivatives to support sequence lengths beyond the typical 512 or 1024 tokens.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models.
                                               # The Meta Llama 3.2 collection of multilingual LLMs is a collection of pretrained and instruction-tuned generative models
                                               # in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual
                                               # dialogue use cases, including agentic retrieval and summarization tasks.
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(   # This loads and configures pre-trained LLMs.
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct".
                                                        #Specifies the name or path of the pre-trained model to load.
                                                        #Refers to the identifier of the model hosted on platforms like Hugging Face.
    max_seq_length = max_seq_length,    # Purpose: Sets the maximum sequence length (number of tokens) the model can process in a single forward pass.
                                                        # Affects memory usage and context size.
                                                        # Larger values allow the model to handle more context (e.g., long documents) but require more VRAM.
                                                        # Models like Llama-3 support sequence lengths up to 4096 when using techniques like RoPE scaling (Rotary Positional Embeddings)
    dtype = dtype,   # Purpose: Specifies the data type for the model’s computations (e.g., FP16, BF16, etc.).
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128.        Explanation: This represents the rank of the low-rank decomposition used in LoRA.
                                                                             # Higher values increase the model's capacity to adapt but also require more memory and computational power.
                                                                             # Function: Determines how much adaptation (fine-tuning) the model will allow. Higher r values let the model fine-tune more expressively.

    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",            # Explanation: Specifies which modules in the model to apply LoRA.
                                                                             # These correspond to the query, key, value, and output projections in transformer layers, as well as additional feed-forward components.
                      "gate_proj", "up_proj", "down_proj",],                 # Function: Focuses fine-tuning on specific parts of the model, like the attention mechanism (q_proj, k_proj, v_proj) and feedforward layers (up_proj, down_proj)
    lora_alpha = 16,                                                     # Explanation: A scaling factor for the LoRA layers. Typically set around 16 for balanced scaling.
                                                                             # Function: Controls the strength of the LoRA updates during training. A higher value amplifies the LoRA adjustments, while a lower value reduces their effect.
    lora_dropout = 0, # Supports any, but = 0 is optimized               # Explanation: The probability of dropping (disabling) connections in the LoRA layers during training to prevent overfitting.
                                                                             # Set to 0 when memory and computation are priorities. Non-zero values (e.g., 0.1) are useful when regularization is necessary.
    bias = "none",    # Supports any, but = "none" is optimized          # Explanation: Determines how biases are treated in the model.
                                                                             # Options:
                                                                                # "none": No additional biases are used in LoRA layers (optimal for performance).
                                                                                # "all" or "lora_only": Includes biases in LoRA for more flexibility at the cost of efficiency.
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context.       # Explanation: A memory-saving technique. Function: Enables training on GPUs with limited memory while maintaining good efficiency for long-context models.
                                                                                               # Options:
                                                                                                  # True: Saves GPU memory by recomputing intermediate values during backpropagation, reducing VRAM usage.
                                                                                                  # "unsloth": An optimized gradient checkpointing variant that uses 30% less VRAM and supports larger batch sizes.
    random_state = 3407,                                           # Explanation: A seed value for initializing random operations.
                                                                        # Function: Ensures reproducibility of results by fixing randomness during training.
    use_rslora = False,  # We support rank stabilized LoRA         # Explanation: Whether to use Rank-Stabilized LoRA (r-sLoRA).
                                                                        # Function: r-sLoRA adds stability to fine-tuning by leveraging special techniques to stabilize rank values.
                                                                        # Ensure that the updates to the model's weights remain numerically stable and well-conditioned.
                                                                        # Avoid optimization challenges like vanishing or exploding gradients that can occur due to poorly scaled low-rank updates.
                                                                        # r-sLoRA carefully initializes the matrices 𝐴 and 𝐵 to ensure that the rank decomposition starts in a stable state.
    loftq_config = None, # And LoftQ           # Explanation: Configuration for LoftQ, a quantization technique to reduce memory usage and speed up inference.
                                               # Function: Enables efficient model compression when set.
)

Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
'''
How Gradient Checkpointing Works
Regular Training Process:

During the forward pass, the network computes and stores activations at each layer.
These activations are later used during the backward pass to compute gradients.
Storing all activations can consume a large amount of memory for deep networks.
With Gradient Checkpointing:

Instead of storing all activations, only some "checkpoint" activations are stored.
When backpropagation requires an intermediate activation, it is recomputed from the closest checkpoint and input.
This trades memory (less storage of activations) for computation (recomputing activations when needed).
'''

'''
When to Use r-sLoRA?
r-sLoRA is especially useful in the following scenarios:
- When fine-tuning very large models (e.g., LLMs) with strict resource constraints.
- For tasks where low-rank adaptations (𝑟) in standard LoRA lead to underfitting or instability.
- When prioritizing stability and reproducibility in fine-tuning.
'''

<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [4]:
from unsloth.chat_templates import get_chat_template    # Purpose: Ensures data aligns with the structure expected by the model, including special tokens for roles (e.g., "user" and "assistant") and end-of-turn markers.

tokenizer = get_chat_template(         # Returns a wrapped tokenizer object. This enables easy transformation of unstructured conversation data into a structured format that the model can process.
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):     # After applying the formatting_prompts_func, the dataset will be formatted and ready for tokenization, training, or evaluation.
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")     # Loaded dataset!!!!!! contains pre-defined conversations suitable for fine-tuning conversational AI.

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [5]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [7]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [8]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [24]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
import torch
import os

# Checkpoint file path
checkpoint_path = "checkpoint.pth"

# Split dataset into training and validation sets
dataset_split = dataset.train_test_split(test_size=0.1)
# Extract train and validation datasets
train_data = dataset_split['train']
eval_data = dataset_split['test']

# Initializes an SFTTrainer object with the specified model, tokenizer, dataset, and training parameters.

trainer = SFTTrainer( # A bunch of hyperparameters
    model = model,                               # The pre-trained model being fine-tuned (e.g., LLaMA or GPT-based models).
    tokenizer = tokenizer,                       # tokenizer: The tokenizer used to preprocess input text for the model.
    train_dataset = train_data,                     # train_dataset: The dataset to use for training, typically preprocessed and formatted earlier in the pipeline???????????????????
    eval_dataset=eval_data,                   # Use the validation set for evaluation
                      # M- dataset = dataset.map(formatting_prompts_func, batched = True,) <-- this line preprocessed the dataset with formatting_prompts_func
    dataset_text_field = "text",                 # dataset_text_field: Specifies the field in the dataset containing the input text.
    max_seq_length = max_seq_length,             # max_seq_length: The maximum sequence length for inputs. Longer sequences are truncated.
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
                                                 # data_collator: Handles batching and padding of sequences during training, using the tokenizer to ensure compatibility with the model.
    dataset_num_proc = 2,                        # dataset_num_proc: Specifies the number of processes to use for dataset preprocessing (e.g., for large datasets). ???????????????????
                                                  #How does this affect traning? M : i think with more parallel processing it can train faster with the expense of more memory usage
    packing = False, # Can make training 5x faster for short sequences.
                                                 # packing: If True, it combines multiple shorter sequences into a single input sequence to improve training efficiency.
                                                          # Set to False if sequences are already appropriately sized. (M: My undestanding is set to false since conversations sequence
                      # has different length so difficult to pack for faster processing)
    # Defines the training hyperparameters using TrainingArguments.
    args = TrainingArguments(                # BE ABLE TO EXPLAIN WHAT ARE THERE PARAMETERS HERE.
        per_device_train_batch_size = 2,     # per_device_train_batch_size: The batch size for each device (e.g., GPU or TPU). Here, 2 sequences are processed per device at a time.
        gradient_accumulation_steps = 4,     # IMPROVIGN MEMORY UTILIZATION.
                                             # gradient_accumulation_steps: Accumulates gradients over multiple steps before updating the model.
                                             # This effectively increases the batch size and reduces memory usage.
        warmup_steps = 5,                    # warmup_steps: Number of steps to slowly increase the learning rate at the beginning of training. Helps stabilize training in early stages.
        #num_train_epochs = 3, # Set this for 1 full training run. IDEALLY MORE THAN ONE. THIS IS CHECK POINT????
                                             # num_train_epochs: The total number of training epochs (full passes over the dataset).
                                             # A single epoch is used here for demonstration, but more epochs are recommended for meaningful fine-tuning???????????????
        max_steps = 300,        # 60 STEPS IS QUITE SHORT.
        learning_rate = 2e-4,                # learning_rate: The step size for the optimizer. Set to 0.0002 in this case.
        fp16 = not is_bfloat16_supported(),  # fp16: Enables 16-bit floating-point precision (if bfloat16 is not supported). Reduces memory usage and speeds up training.
        bf16 = is_bfloat16_supported(),      # bf16: Uses bfloat16 precision if supported by the hardware. Preferred over fp16 on newer hardware like TPUs.
        logging_steps = 1,                   # logging_steps: Logs metrics (e.g., loss) after every step for better monitoring.
        optim = "adamw_8bit",                # optim: Specifies the optimizer. adamw_8bit is a lightweight, memory-efficient version of AdamW.
        weight_decay = 0.01,                 # weight_decay: A regularization technique to prevent overfitting by penalizing large weights.
        lr_scheduler_type = "linear",        # lr_scheduler_type: Specifies the learning rate decay schedule. A "linear" schedule reduces the learning rate linearly over time.
        seed = 3407,                         # seed: Sets a random seed for reproducibility, ensuring results can be replicated.
        output_dir = "output_dir",              # output_dir: Directory where training checkpoints and logs will be saved.
        report_to = "none", # Use this for WandB etc
                                             # report_to: Defines where logs and metrics are reported. Setting it to "none" disables reporting to external tools (e.g., WandB).
        save_steps=60,                      # Save a checkpoint every 500 steps
        # save_total_limit=3,  # Keep the last 3 checkpoints to save disk space
        save_strategy="steps",  # Save strategy, can be "steps" or "epoch"
        evaluation_strategy="steps",  # Optionally evaluate during training
        eval_steps=60,  # Evaluate every 500 steps (optional)
    ),
)



Map (num_proc=2):   0%|          | 0/90000 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [27]:
# Load checkpoint before training
if os.path.exists(checkpoint_path):
    print("Loading checkpoint...")
    checkpoint = torch.load(checkpoint_path)
    trainer.model.load_state_dict(checkpoint["model_state_dict"])
    trainer.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    trainer.args.num_train_epochs = checkpoint["epoch"] + 1
    print(f"Resumed training from epoch {checkpoint['epoch'] + 1}")
else:
    print("No checkpoint found. Starting training from scratch.")

No checkpoint found. Starting training from scratch.


In [None]:
"""
Purpose of Key Parameters
Checkpoints:
- The model saves intermediate states (checkpoints) in the output_dir after an epoch or defined steps. These can be used to resume or analyze training progress.

Learning Rate Warmup:
- Gradual increase in learning rate during the first 5 steps to avoid unstable training.

Memory Optimization:
- Gradient accumulation and 8-bit AdamW optimizer reduce memory overhead, enabling training with smaller GPUs.

Precision:
- fp16 or bf16 enables mixed-precision training for faster computation with less memory.

Logging:
- Logs metrics after every step (logging_steps=1), helpful for debugging and tracking progress.
"""

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [28]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [29]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow can I create a Python program that takes a distinct date input in the format YYYY-MM-DD and returns the corresponding day of the week based on the ISO 8601 standard? For example, if the date is \'2020-12-30\', I want to get the day of the week as the output.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTo solve this problem, you can utilize Python\'s built-in datetime library. This library provides useful methods to parse the date string into a datetime object and extract the day of the week.\n\nHere\'s an example Python program that achieves this:\n\n```python\nimport datetime\n\n# Assuming the date as \'2020-12-30\'\ndate_string = \'2020-12-30\'\n\n# Convert the string to datetime object\ndate_object = datetime.datetime.strptime(date_string, \'%Y-%m-%d\')\n\n# Get the day of the week\n

In [30]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                  \n\nTo solve this problem, you can utilize Python\'s built-in datetime library. This library provides useful methods to parse the date string into a datetime object and extract the day of the week.\n\nHere\'s an example Python program that achieves this:\n\n```python\nimport datetime\n\n# Assuming the date as \'2020-12-30\'\ndate_string = \'2020-12-30\'\n\n# Convert the string to datetime object\ndate_object = datetime.datetime.strptime(date_string, \'%Y-%m-%d\')\n\n# Get the day of the week\n# The weekday() function in Python\'s datetime module returns the day of the week as an integer, where Monday is 0 and Sunday is 6\n# We can define a list of days of the week starting from \'Monday\' to \'Sunday\'\nday_of_week_list = [\'Monday\', \'Tuesday\', \'Wednesday\', \'Thursday\', \'Friday\', \'Saturday\', \'Sunday\']\nday_of_week = day_of_week_list[date_object.weekday()]\n\nprint(\'The given 

In [31]:
"""
masking an instruction prompt typically refers to strategically hiding or obscuring certain parts of the input data during training.
This helps prevent the model from directly memorizing specific instructions and instead encourages it to learn more generalized patterns and reasoning.
"""

'\nmasking an instruction prompt typically refers to strategically hiding or obscuring certain parts of the input data during training.\nThis helps prevent the model from directly memorizing specific instructions and instead encourages it to learn more generalized patterns and reasoning.\n'

We can see the System and Instruction prompts are successfully masked!

In [32]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
7.957 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

# initiates the training process for the model using the configurations and data provided to the trainer object.

'''
What Happens During trainer.train()?
Training Initialization:
- Prepares the model, dataset, optimizer, and scheduler based on the arguments provided during the trainer's initialization (e.g., learning rate, batch size, etc.).
If there are pre-saved checkpoints in the output_dir, training can resume from the latest checkpoint.

Data Loading:
- Loads batches of data from the train_dataset using the DataCollator for padding and tokenization.
If gradient_accumulation_steps is greater than 1, multiple batches are accumulated before updating the model weights.

Forward Pass:
- Passes the input data through the model to calculate the predictions and the loss (e.g., cross-entropy loss for text generation).

Backward Pass:
- Computes the gradients for each trainable parameter by backpropagating the loss.

Gradient Updates:
- Applies the optimizer (e.g., AdamW) to update the model's weights using the gradients.
If gradient clipping or weight decay is enabled, those adjustments are applied during this step.

Learning Rate Scheduler:
- Updates the learning rate based on the schedule defined in the TrainingArguments (e.g., linear decay).

Logging:
- Tracks and logs metrics (e.g., loss, learning rate) at specified intervals (logging_steps).

Checkpointing:
- Saves the model and optimizer state to output_dir at regular intervals or at the end of an epoch, depending on configuration.

Repeat for Epochs:
- The above steps repeat for the number of epochs specified in TrainingArguments.


Return Value: trainer_stats
The output of trainer.train() (stored in trainer_stats) is an object containing statistics and metadata about the training process. Key details typically include:
- Training Loss:
   -The loss values computed at each step or epoch.
- Final Model State:
   -The state of the model after training (weights, biases, etc.).
- Number of Steps/Epochs:
   -Total steps or epochs completed during training.
- Time Taken:
   -Total time spent on training.

This data is useful for analyzing training performance, debugging, or resuming fine-tuning.


Why Is This Line Important?
It triggers the actual training of the model.
It provides feedback (trainer_stats) on how well the training progressed.
It can be further used to evaluate the model or fine-tune training parameters for better results.
'''

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 90,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 300
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss,Validation Loss
60,0.8055,0.762726
120,0.9103,0.755152


In [22]:
# Save checkpoint after training
checkpoint = {
    "epoch": trainer.args.num_train_epochs,
    "model_state_dict": trainer.model.state_dict(),
    "optimizer_state_dict": trainer.optimizer.state_dict(),
}
torch.save(checkpoint, checkpoint_path)
print("Checkpoint saved!")

Checkpoint saved!


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

447.7124 seconds used for training.
7.46 minutes used for training.
Peak reserved memory = 4.43 GB.
Peak reserved memory for training = 0.043 GB.
Peak reserved memory % of max memory = 30.038 %.
Peak reserved memory for training % of max memory = 0.292 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(                    # tokenizer: This function customizes the tokenizer to work with the "llama-3.1" chat template.
                                                  # It ensures the text input aligns with the model's expected formatting (e.g., user and assistant roles).
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference. Leverages optimization specific to the FastLanguageModel framework.

messages = [   # A structured representation of the conversation
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")   # this means to run it on cuda, on gpu. On huggingface you will only have cpu, so it will go much slower.

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n9, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181.<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
''' PURPOSE OF THIS BLOCK:
Configures and runs a conversational model to respond to a user’s message in real time.
Specifically:
- Prepares the model and tokenizer for optimized inference.
- Formats the input message into a chat-compatible format.
- Generates text based on the user’s prompt.
- Streams the response as it’s generated for a more dynamic experience.
'''


'''
Purpose: Configures the model for optimized inference using the FastLanguageModel framework. This enables native optimizations that can double the inference speed.
Why?:
- During inference, specific optimizations (e.g., kernel fusion, weight quantization) can reduce the computational overhead.
- Particularly useful for smaller models, where speed is critical.
Impact:
- Improves throughput during tasks like text generation without modifying model accuracy.
'''
FastLanguageModel.for_inference(model) # Enable native 2x faster inference. Small model will work relatively well.

# Purpose: Defines a chat-style message structure for input. Why? This structured format is required for conversational language models, as it mimics a multi-turn dialogue format.
messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Purpose: Prepares the input text for the model in a chat-specific format and optimizes it for GPU-based inference.
# Why?: The model expects its inputs to be properly tokenized and formatted to generate meaningful output.
inputs = tokenizer.apply_chat_template(  # Formats the messages into a style expected by the model (e.g., adding special tokens like <|user|> and <|assistant|>).
    messages,
    tokenize = True,                     # Converts the formatted text into numerical IDs (tokens), which the model can process.
    add_generation_prompt = True,        # Must add for generation. Appends a marker or hint for the model to generate a response, e.g., <|start_response|>.
    return_tensors = "pt",               # Converts the tokenized input into PyTorch tensors, suitable for computation.
).to("cuda")                             # Moves the data to the GPU for faster processing.

'''
What Is TextStreamer?
- A utility that streams tokens (words or subwords) generated by the model in real time.
- Displays the model’s output incrementally as it generates tokens.
Why Use It?: Enhances interactivity by allowing users to see results as they are generated instead of waiting for the entire output.
'''
from transformers import TextStreamer

'''
Purpose:
- Configures the TextStreamer to display generated text interactively.
   - tokenizer: Decodes tokens into readable text.
   - skip_prompt=True: Ensures only the generated text is streamed, excluding the input prompt.
Why?: Improves the user experience by focusing only on the new response generated by the model.
'''
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

# Generates a response based on the input messages.
# Purpose: Generates a continuation for the input prompt and streams the output to the console as it is generated.
# Why?:
   # Real-time streaming improves interactivity and user engagement.
   # Parameters like temperature and min_p control the creativity and quality of the generated response.
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,    # input_ids=inputs: Uses the formatted, tokenized input prepared earlier.
                   use_cache = True, temperature = 1.5, min_p = 0.1)                      # streamer=text_streamer: Outputs the generated text incrementally in real-time.
                                                                                          # max_new_tokens=128: Limits the generated text to a maximum of 128 tokens.
                                                                                          # use_cache=True: Optimizes generation by caching intermediate computations.
                                                                                          # temperature=1.5: Increases randomness in the generated output (higher values produce more diverse results).
                                                                                          # min_p=0.1: Enforces a minimum token probability threshold, avoiding overly unlikely choices.

10, 13, 21, 34, 55, 89, 144, 233, 377, 610<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("ivwhy/lora_model", token = "") # Online saving
tokenizer.push_to_hub("ivwhy/lora_model", token = "") # Online saving

README.md:   0%|          | 0.00/594 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

Saved model to https://huggingface.co/ivwhy/lora_model


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
# Purpose: This block demonstrates how to reload the saved model and tokenizer for inference.

if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2024.11.11: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
I don't know anything about a tower in the capital of France, but the capital of France is Paris.<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
'''
This code saves and uploads a machine learning model in various quantization formats to either local storage or the Hugging Face Hub using the GGUF (GPTQ General Unified Format).
This format is designed for efficient model storage and inference, especially in low-resource environments like edge devices.
'''


'''
What It Does:
save_pretrained_gguf: Saves the model locally in the Q8_0 format (8-bit quantization) along with its tokenizer.
push_to_hub_gguf: Uploads the quantized model to the Hugging Face Hub for shared access.
Why?
- 8-bit Quantization (Q8_0) reduces the model size and memory usage without significantly affecting performance.
- It's a balance between speed and accuracy.
'''
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

'''
- f16 (16-bit precision) maintains higher numerical precision, which is ideal for models where accuracy is more critical than memory efficiency.
- Useful for high-performance environments like GPUs with FP16 support.
'''
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

'''
Why?:
- 4-bit Quantization significantly reduces model size and memory requirements, making it ideal for deployment on resource-constrained devices like mobile phones or edge devices.
- However, this can lead to a slight loss in accuracy.
'''
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")


'''
What It Does:
- Uploads the model to the Hugging Face Hub in multiple quantization formats in one operation.
- Supported formats in this example: q4_k_m, q8_0, and q5_k_m.
- Requires an authentication token (token) to upload the model.
Why?:
- This is a time-saving approach if you need to make the model available in several quantized formats to support various deployment scenarios:
- q4_k_m: Very efficient for edge devices but less accurate.
- q8_0: Balanced performance and precision.
- q5_k_m: A compromise between q4_k_m and q8_0.
*An edge device = hardware that is located closer to the "edge" of a network, meaning near the end users or physical world, rather than centralized cloud servers or data centers.
 ex: laptop. These devices perform computing tasks locally.
'''
# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>