# Training Pipeline

## Installation

In [1]:
%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth comet_ml==3.48.1
else:
    # Do this only in Colab notebooks and Kaggle notebooks!
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer comet_ml==3.48.1
    !pip install --no-deps unsloth

### Troubleshooting Unsloth

As Unsloth constantly evolves, the installation steps or usage SDK might change.

But, in case any issues arise while using or installing Unsloth, check out their official [Llama3.1 8B Notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_\(8B\)-Alpaca.ipynb) as a reference (Unsloth is constantly upgrading it).

We used the same template for this Notebook. Thus, finding new changes should be easy.



## Prepare environment

If you want to upload your dataset to Hugging Face, enter your Hugging Token token (it's free).

If you want to use Comet as an experiment tracker enter your Comet API Key (it's free).

In [2]:
import os
from getpass import getpass

hf_token = getpass("Enter your Hugging Face token. Press Enter to skip: ")
enable_hf = bool(hf_token)
print(f"Is Hugging Face enabled? '{enable_hf}'")

comet_api_key = getpass("Enter your Comet API key. Press Enter to skip: ")
enable_comet = bool(comet_api_key)
comet_project_name = "second-brain-course"
print(f"Is Comet enabled? '{enable_comet}'")

if enable_hf:
    os.environ["HF_TOKEN"] = hf_token
if enable_comet:
    os.environ["COMET_API_KEY"] = comet_api_key
    os.environ["COMET_PROJECT_NAME"] = comet_project_name

Enter your Hugging Face token. Press Enter to skip: ··········
Is Hugging Face enabled? 'True'
Enter your Comet API key. Press Enter to skip: ··········
Is Comet enabled? 'True'


## Global variables

Make sure you have an Nvidia GPU active. You can choose it from the **Runtime** tab.

In [3]:
import torch


def get_gpu_info() -> str | None:
    """Gets GPU device name if available.

    Returns:
        str | None: Name of the GPU device if available, None if no GPU is found.
    """
    if not torch.cuda.is_available():
        return None

    gpu_name = torch.cuda.get_device_properties(0).name

    return gpu_name


active_gpu_name = get_gpu_info()

print("GPU type:")
print(active_gpu_name)

GPU type:
Tesla T4


In [4]:
dataset_id = (
    input(
        "Enter your Hugging Face dataset_id (which you generated in Lesson 3). Hit enter to use our precomputed version: "
    )
    or "pauliusztin/second_brain_course_summarization_task"
)
print(f"{dataset_id=}")

Enter your Hugging Face dataset_id (which you generated in Lesson 3). Hit enter to use our precomputed version: 
dataset_id='pauliusztin/second_brain_course_summarization_task'


Depending on your GPU type, we must pick different variables, as training in 4bit (QLoRA) takes substantially longer than training in 16bit (LoRA). Thus, if you have a T4 Nivia GPU, which is available in Google's Colab free tier, to avoid waiting an eternity for the fine-tuning to complete, we will train for fewer steps (on T4, we cannot train with LoRA without encountering issues while fine-tuning).

In [5]:
max_seq_length = 4096  # Choose any! We auto support RoPE Scaling internally!
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
if active_gpu_name and "T4" in active_gpu_name:
    load_in_4bit = True  # Use 4bit quantization to reduce memory usage.
    max_steps = 25  # Reduce training steps to avoiding waiting too long.
elif active_gpu_name and ("A100" in active_gpu_name or "L4" in active_gpu_name):
    load_in_4bit = False  # Disable 4bit quantization for faster training.
    max_steps = 250  # As we train without 4bit quantization, we can train for more steps without waiting too long.
elif active_gpu_name:
    load_in_4bit = False  # Disable 4bit quantization for faster training.
    max_steps = 150  # As we train without 4bit quantization, we can train for more steps without waiting too long.
else:
    raise ValueError("No Nvidia GPU found.")

print("--- Parameters ---")
print(f"{max_steps=}")
print(f"{load_in_4bit=}")
print(f"{dtype=}")

--- Parameters ---
max_steps=25
load_in_4bit=True
dtype=None


## Load LLM using Unsloth

In [6]:
from unsloth import FastLanguageModel

base_model = "Meta-Llama-3.1-8B-Instruct"  # or unsloth/Qwen2.5-7B-Instruct
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=f"unsloth/{base_model}",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth 2025.3.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Data Preperation

We now use the Alpaca format to map the instruct dataset into input prompts.

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [8]:
from datasets import load_dataset

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["instruction"]
    outputs = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN

        texts.append(text)
    return {
        "text": texts,
    }

In [9]:
dataset = load_dataset(dataset_id)
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

README.md:   0%|          | 0.00/529 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.25M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/955k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/575 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/72 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/72 [00:00<?, ? examples/s]

Map:   0%|          | 0/575 [00:00<?, ? examples/s]

Map:   0%|          | 0/72 [00:00<?, ? examples/s]

Map:   0%|          | 0/72 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

In [10]:
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs=1,  # Set this for 1 full training run, while commenting out 'max_steps'.
        max_steps=max_steps,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="comet_ml" if enable_comet else "none",
    ),
)

Tokenizing to ["text"] (num_proc=2):   0%|          | 0/575 [00:00<?, ? examples/s]

Packing train dataset (num_proc=2):   0%|          | 0/575 [00:00<?, ? examples/s]

In [11]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.752 GB of memory reserved.


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 501 | Num Epochs = 1 | Total steps = 25
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/4,670,623,744 (0.90% trained)
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/luluspace34/second-brain-course/9e61d7b1b561454181581f43047386cd

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/content' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.5082
2,1.5511
3,1.6154
4,1.6292
5,1.6093
6,1.6047
7,1.6124
8,1.4821
9,1.368
10,1.5666


In [14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2491.924 seconds used for training.
41.53 minutes used for training.
Peak reserved memory = 10.693 GB.
Peak reserved memory for training = 4.941 GB.
Peak reserved memory % of max memory = 72.539 %.
Peak reserved memory for training % of max memory = 33.519 %.


### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [15]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)  # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)


def generate_text(
    instruction, streaming: bool = True, trim_input_message: bool = False
):
    message = alpaca_prompt.format(
        instruction,
        "",  # output - leave this blank for generation!
    )
    inputs = tokenizer([message], return_tensors="pt").to("cuda")

    if streaming:
        return model.generate(
            **inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True
        )
    else:
        output_tokens = model.generate(**inputs, max_new_tokens=256, use_cache=True)
        output = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

        if trim_input_message:
            return output[len(message) :]
        else:
            return output

In [16]:
#It feeds the "instruction" from the first validation data point to the fine-tuned language model and instructs it to generate a summary in streaming mode.
generate_text(dataset["validation"][0]["instruction"], streaming=True)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
[![](assets/figures/nvidia.svg) Toronto AI Lab  ](https://research.nvidia.com/labs/toronto-ai/)

# 

Align your Latents:High-Resolution Video Synthesis with Latent Diffusion Models

[Andreas Blattmann1 *,†](https://twitter.com/andi_blatt) [Robin Rombach1 *,†](https://twitter.com/robrombach) [Huan Ling2,3,4 *](https://www.cs.toronto.edu/~linghuan/) [Tim Dockhorn2,3,5 *,†](https://timudk.github.io/) [Seung Wook Kim2,3,4](https://seung-kim.github.io/seungkim/) [Sanja Fidler2,3,4](https://www.cs.toronto.edu/~fidler/) [Karsten Kreis2](https://karste

tensor([[128000,  39314,    374,  ...,    696,  11874,  19130]],
       device='cuda:0')

In [17]:
generate_text(dataset["validation"][0]["instruction"], streaming=False)

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights\n\n### Input:\n[![](assets/figures/nvidia.svg) Toronto AI Lab  ](https://research.nvidia.com/labs/toronto-ai/)\n\n# \n\nAlign your Latents:High-Resolution Video Synthesis with Latent Diffusion Models\n\n[Andreas Blattmann1 *,†](https://twitter.com/andi_blatt) [Robin Rombach1 *,†](https://twitter.com/robrombach) [Huan Ling2,3,4 *](https://www.cs.toronto.edu/~linghuan/) [Tim Dockhorn2,3,5 *,†](https://timudk.github.io/) [Seung Wook Kim2,3,4](https://seung-kim.github.io/seungkim/) [Sanja Fidler2,3,4](https://www.cs.toronto.edu/~fidler/) [Karsten Kreis2](https://karstenkre


## Saving Fine-tuned LLM

The last step is to save the fine-tuned LLM locally and on Hugging Face if the Hugging Face token is available.

In [18]:
from huggingface_hub import HfApi

model_name = f"{base_model}-Second-Brain-Summarization"
print(f"Model name: {model_name}")
model.save_pretrained_merged(
    model_name,
    tokenizer,
    save_method="merged_16bit",
)  # Local saving

if enable_hf:
    api = HfApi()
    user_info = api.whoami(token=hf_token)
    huggingface_user = user_info["name"]
    print(f"Current Hugging Face user: {huggingface_user}")

    model.push_to_hub_merged(
        f"{huggingface_user}/{model_name}",
        tokenizer=tokenizer,
        save_method="merged_16bit",
        token=hf_token,
    )  # Online saving to Hugging Face

Model name: Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.81 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 28%|██▊       | 9/32 [00:00<00:01, 12.26it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:45<00:00,  1.43s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00004-of-00004.bin...
Done.
Current Hugging Face user: Deepri24


Unsloth: You are pushing to hub, but you passed your HF username = Deepri24.
We shall truncate Deepri24/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization to Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.7 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:57<00:00,  1.80s/it]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

 Done.
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization/pytorch_model-00004-of-00004.bin...


README.md:   0%|          | 0.00/623 [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

pytorch_model-00003-of-00004.bin:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

pytorch_model-00002-of-00004.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00004-of-00004.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

pytorch_model-00001-of-00004.bin:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Deepri24/Meta-Llama-3.1-8B-Instruct-Second-Brain-Summarization



And we're done! If you have any questions on Unsloth, they have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join their Discord!

Also, you can visit their docs to learn about their supported [models](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join their Discord if you need help + ⭐️ <i>Star their <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>