# Setup

In [None]:
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install -q  torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install -q unsloth

Collecting pip3-autoremove
  Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl (6.7 kB)
Installing collected packages: pip3-autoremove
Successfully installed pip3-autoremove-1.2.2
The 'jedi>=0.16' distribution was not found and is required by the application
Skipping jedi
nvidia-cuda-nvrtc-cu12 12.5.82 is installed but nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64" is required
Redoing requirement with just package name...
nvidia-cuda-runtime-cu12 12.5.82 is installed but nvidia-cuda-runtime-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64" is required
Redoing requirement with just package name...
nvidia-cuda-cupti-cu12 12.5.82 is installed but nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64" is required
Redoing requirement with just package name...
nvidia-cudnn-cu12 9.3.0.75 is installed but n

This code manages Python package installations related to PyTorch, a deep learning framework. Let's go through each line:

The first line installs a utility called pip3-autoremove, which helps cleanly remove Python packages along with their dependencies that aren't needed by other packages.

The second line uses this newly installed tool to remove the existing PyTorch installation and its related packages (torch, torchvision, and torchaudio). The -y flag automatically confirms the removal without asking for user input.

The third line reinstalls PyTorch and its components, but with specific requirements:
- The -q flag makes the installation "quiet" with minimal output
- It installs torch, torchvision, torchaudio, and xformers
- The --index-url parameter directs pip to download from PyTorch's custom package repository
- The cu121 in the URL indicates this installation is optimized for CUDA 12.1 (NVIDIA's GPU computing platform)

The final line installs a package called unsloth, which is used for optimizing transformer models. The -q flag again keeps the installation output minimal.


In [None]:
# Standard library imports
import matplotlib.pyplot as plt
import os
from pprint import pp

# Third-party library imports
import torch
from datasets import load_dataset
from transformers import (
    TextStreamer,
    TrainerCallback,
    TrainingArguments,
    Trainer,
    AutoModelForCausalLM,
    AutoTokenizer
)
from trl import SFTTrainer

# Unsloth-specific imports
from unsloth import FastLanguageModel, is_bfloat16_supported



os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
os.environ['WANDB_DISABLED']="true"

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


This code sets up the foundation for fine-tuning a large language model, particularly using the Unsloth optimization framework. Let's break it down section by section to understand what each part does:

The first section imports standard Python libraries that will be needed:
- matplotlib.pyplot allows creating visualizations to track training progress
- os provides functions to interact with the operating system and environment variables
- pprint (imported as pp) helps display complex data structures in a readable format

The second section brings in specialized machine learning libraries:
- torch is the PyTorch deep learning framework that provides the core functionality for neural networks
- datasets from Hugging Face makes it easy to load and process training data
- The transformers imports include several critical components:
  - TextStreamer handles real-time text generation output
  - TrainerCallback allows customizing the training process
  - TrainingArguments and Trainer handle the training loop and configuration
  - AutoModelForCausalLM and AutoTokenizer automatically load appropriate models and tokenizers based on the chosen architecture
- SFTTrainer from trl (Transformer Reinforcement Learning) specifically handles supervised fine-tuning

The third section imports Unsloth-specific tools:
- FastLanguageModel provides optimized versions of language models
- is_bfloat16_supported checks if the hardware supports bfloat16 precision, which can speed up training while maintaining good numerical stability

Finally, two environment variables are set:
- HF_HUB_ENABLE_HF_TRANSFER="1" enables an optimized file transfer method when downloading models from Hugging Face
- WANDB_DISABLED="true" turns off Weights & Biases logging, which is a popular tool for experiment tracking but isn't being used in this setup

This code structure follows a common pattern in machine learning projects where imports are organized by their source (standard library, third-party, and project-specific) to maintain clarity. The environment variables are set to optimize performance and simplify the training setup by disabling unnecessary features.

The overall purpose of this code is to prepare an environment for efficient fine-tuning of large language models, with particular attention to performance optimization through Unsloth. It brings together tools for data handling, model training, and performance monitoring while configuring the environment for optimal training conditions.

In [None]:
class CFG:
    device = 'cuda'
    model = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
    max_seq_length = 2048
    dtype = None
    load_in_4bit = True
    seed = 1096
    dataset = "krnk97/summary_dataset"

This code defines a configuration class called CFG that contains essential settings for training a large language model. Let's examine each setting to understand its purpose and impact:

The device setting of 'cuda' indicates that the model will run on an NVIDIA GPU rather than CPU, which is crucial for the intensive computations involved in training large language models. CUDA (Compute Unified Device Architecture) provides significant speed advantages over CPU processing.

The model setting specifies using Meta's Llama 3.1 8-billion parameter model, specifically a version that's been optimized by Unsloth. The "bnb-4bit" in the name indicates it uses 4-bit quantization through the bitsandbytes library, which dramatically reduces memory usage while maintaining good performance. The "Instruct" part suggests this is a version fine-tuned for following instructions.

max_seq_length of 2048 defines the maximum number of tokens that can be processed in a single sequence. This is like setting the size of the model's "working memory" - it can only consider 2048 tokens at once when generating or processing text. This is a common sequence length that balances between processing longer contexts and managing memory usage.

The dtype being set to None means the model will use its default data type for calculations. This will likely be determined based on hardware capabilities and model requirements.

load_in_4bit being True confirms that the model should be loaded using 4-bit quantization. This is a form of model compression that reduces memory usage to approximately 1/8th of the original size, making it possible to run larger models on consumer hardware. While this comes with a small accuracy trade-off, modern quantization techniques maintain surprisingly good performance.

The seed value of 1096 ensures reproducibility in the training process. By setting a fixed random seed, any random operations (like initialization or data shuffling) will happen the same way each time the code runs. This is essential for debugging and comparing different training runs.

The dataset points to "krnk97/summary_dataset" on Hugging Face's dataset hub, indicating this setup is intended for training on a summarization task. The model will learn to create summaries by studying examples from this dataset.

Together, these settings create a balanced configuration that enables training a large language model efficiently on consumer hardware while maintaining good performance. The choices reflect common practices in modern NLP, particularly the use of quantization and GPU acceleration to make large model training more accessible.

# Functions

In [None]:
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["context"]
    outputs      = examples["response"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, EOS_TOKEN, output)
        texts.append(text)
    return { "text" : texts, }

This function transforms a dataset of instruction-response pairs into a format suitable for training a language model. Let's break down how it works and why each part matters.

The function takes a collection of examples as input, where each example contains three key pieces: an instruction (what we want the model to do), a context (background information needed), and a response (the ideal output we want the model to learn to generate). Think of these as similar to how a teacher might structure a lesson - giving instructions, providing context, and showing the desired outcome.

The heart of the function is a formatting process that combines these elements using a template called alpaca_prompt (which isn't shown in this code but is referenced). The template likely follows the Alpaca format, a widely-used approach for instruction-tuning language models that was pioneered by Stanford researchers. An EOS_TOKEN (End of Sequence) is inserted between the input and output sections to help the model clearly distinguish where the instruction/context ends and where the expected response begins.

The function processes these elements in parallel using Python's zip function, which is like dealing out cards - it takes one instruction, one input, and one output at a time and combines them. For each trio, it creates a formatted text string using the template and adds it to a list. This parallel processing ensures that corresponding pieces stay together, much like keeping a student's question paired with its correct answer.

The final output is a dictionary with a single key "text" containing all the formatted examples. This structure is important because many machine learning frameworks, particularly those built on the Transformers library, expect data in this format. It's similar to how a textbook might compile all its exercises into a consistent format in the answer key section.

One subtle but important detail is that the function maintains the relationship between instructions, contexts, and responses while transforming them into a format the model can learn from effectively. This preservation of relationships is crucial for the model to learn not just to memorize responses, but to understand how different types of instructions should modify its output based on the given context.

This formatting step is a critical bridge between raw data and model training - it's like translating teaching materials into a language that both the model and the training process can understand and work with effectively.

In [None]:
class CustomCallback(TrainerCallback):
    def __init__(self):
        self.train_losses = []
        self.eval_losses = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            if 'loss' in logs:
                self.train_losses.append(logs['loss'])
            if 'eval_loss' in logs:
                self.eval_losses.append(logs['eval_loss'])

This code creates a custom callback system to track the model's learning progress during training.

The CustomCallback class inherits from TrainerCallback, which is like creating a specialized version of a general-purpose monitoring system. Think of it as customizing a basic teaching assistant to track specific metrics we care about.

In the initialization method (__init__), the class creates two empty lists:
- train_losses tracks how well the model is learning from the training data, like keeping track of quiz scores during a course
- eval_losses tracks the model's performance on validation data, similar to monitoring performance on practice tests that weren't used for learning

The heart of this callback is the on_log method, which springs into action whenever the training process logs new information. It takes several parameters:
- args contains training configuration details
- state holds information about the current training state
- control allows for modifying the training flow
- logs contains the actual performance metrics we want to track
- **kwargs catches any additional parameters that might be passed

The method checks if there are any logs (if logs is not None) and then looks for two specific metrics:
- 'loss': This represents how many mistakes the model is making during training. Think of it like grading homework - a higher loss means more errors.
- 'eval_loss': This shows how well the model performs on new, unseen examples. It's like seeing how well students can apply what they've learned to fresh problems.

When either of these metrics is found, it's added to the appropriate list for tracking. This creates a historical record of the model's performance, allowing us to:
- Monitor if the model is actually improving over time
- Detect if the model is overfitting (doing well on training data but poorly on evaluation data)
- Make informed decisions about when to stop training or adjust the learning process

This tracking system is crucial because training a language model isn't a one-and-done process - it's iterative, like teaching a complex subject over many lessons. By keeping detailed records of both training and evaluation performance, we can ensure the model is learning effectively and make adjustments if it's struggling in certain areas.


In [None]:
def extract_response(text):
    # Convert the string into a more manageable format by removing the list brackets
    # and splitting by the section markers
    cleaned_text = text.strip("[]'")

    # Find the start and end positions of the Response section
    response_start = cleaned_text.find("### Response:")
    next_eot = cleaned_text.find("<|eot_id|>", response_start)

    if response_start == -1:
        return "Response section not found"

    # Extract the response text
    # Add 13 to skip past "### Response:" and any following newlines
    response_text = cleaned_text[response_start + 13:next_eot].strip()

    return response_text

This function extracts the model's response from a specially formatted text string, similar to how you might locate and extract a specific answer from a standardized test response. Let me walk through how it works and why each step matters.

The function starts by cleaning up the input text. The strip("[]'") command removes list brackets and single quotes from the beginning and end of the text. Think of this like removing the packaging around a product to get to the actual content inside. This cleaning step is necessary because the text likely comes from a list or array format that includes these extra characters.

Next, the function searches for two key markers in the text:
1. "### Response:" marks the beginning of the model's response, like a section header in a document
2. "<|eot_id|>" (end-of-text identifier) marks where the response ends

The find() method is used to locate these markers, returning the position (index) where each marker appears in the text. It's similar to using a bookmark to quickly find a specific chapter in a book. The response_start variable stores the position where "### Response:" begins.

There's an important safety check in the middle of the function: if response_start is -1 (meaning the "### Response:" marker wasn't found), the function returns an error message. This is like having a backup plan when you can't find what you're looking for - instead of crashing or returning nonsense, it clearly communicates that something is missing.

The actual extraction of the response happens in the final steps. The function:
1. Starts from response_start + 13 (to skip past "### Response:" itself)
2. Takes all text from that point until the next_eot position (where "<|eot_id|>" was found)
3. Uses strip() to remove any extra whitespace at the beginning or end

This extraction process is precise - it's like using surgical tools to carefully remove exactly the piece we want, no more and no less. The +13 offset is particularly important because it ensures we don't include the "### Response:" marker in our final output.

The function returns this extracted text, which should contain just the model's response, cleaned and ready for use. This extracted response could then be used for evaluation, display, or further processing.

This kind of text extraction is common in machine learning pipelines where we need to separate different parts of model outputs. It's particularly useful when working with instruction-tuned models that follow specific formatting conventions for their inputs and outputs.

# Data prep

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = CFG.model,
    max_seq_length = CFG.max_seq_length,
    dtype = CFG.dtype,
    load_in_4bit = CFG.load_in_4bit,
)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

This code initializes a large language model and its accompanying tokenizer using Unsloth's FastLanguageModel system. Let's explore what's happening and why each part matters.

When we load a language model, we're essentially setting up an intricate neural network that has been pre-trained to understand and generate human language. Think of it like preparing a highly sophisticated instrument before a performance - everything needs to be configured just right.

The from_pretrained method does the heavy lifting here, taking several carefully chosen parameters from our CFG (configuration) class. Let's examine each parameter to understand its role:

model_name (CFG.model) specifies which model we're loading - in this case, it's a 4-bit quantized version of Meta's Llama 3.1 8B Instruct model. The "8B" indicates it has 8 billion parameters, like having 8 billion tiny knobs that have been carefully tuned during pre-training. The "Instruct" part means it's been specially trained to follow instructions, making it more reliable for specific tasks.

max_seq_length (CFG.max_seq_length) sets how much text the model can process at once - here it's 2048 tokens. Think of this like setting the size of the model's working memory. If you were reading a book, this would be like determining how many words you can hold in your mind at once before needing to "reset" and start fresh.

dtype (CFG.dtype) determines the numerical precision used for the model's calculations. Being set to None lets the system choose the most appropriate precision based on your hardware and the model's requirements. This is like choosing the right level of precision for a measuring instrument - too precise might waste resources, while not precise enough could lead to errors.

load_in_4bit (CFG.load_in_4bit) enables 4-bit quantization, an advanced compression technique that dramatically reduces the model's memory footprint. Instead of storing each parameter with high precision, it uses just 4 bits per parameter. This is like compressing a high-resolution image - you lose a tiny bit of precision, but the file becomes much more manageable while remaining perfectly usable.

The function returns two crucial components: the model and its tokenizer. The tokenizer is like the model's interface to human language - it converts raw text into numbers the model can process (tokens) and back again. Having both components properly initialized and working together is essential for the model to function correctly.

This initialization step is particularly important because it sets up the foundation for all subsequent operations. The choices made here - like using 4-bit quantization and setting a specific sequence length - will affect everything from memory usage to processing speed to the types of tasks the model can handle effectively.

In [None]:
# load the dataset splits
train_dataset = load_dataset(CFG.dataset, split = "train")
eval_dataset = load_dataset(CFG.dataset, split = "validation")


README.md:   0%|          | 0.00/465 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/19.8k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/14.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4 [00:00<?, ? examples/s]

This code loads our training data by creating two distinct dataset splits - one for training and one for validation. Let's explore why we need both splits and how this loading process works.

Think of this setup like preparing materials for teaching a class. Just as a teacher needs both practice exercises (training data) and test questions (validation data), our model needs different datasets for learning and for checking its progress.

The load_dataset function comes from the Hugging Face datasets library, which acts like a vast digital library of machine learning datasets. When we call this function with CFG.dataset (which we saw earlier was set to "krnk97/summary_dataset"), we're essentially reaching into this library and pulling out our chosen dataset.

The split parameter is crucial here - it tells the function which portion of the dataset we want. By specifying "train" for train_dataset, we're getting the larger portion of data that will be used to actually teach the model. This is like gathering all the practice problems and examples we'll use during lessons. The model will learn patterns, relationships, and rules from this data through repeated exposure and correction.

For eval_dataset, we specify "validation" as the split. This portion of data is kept completely separate from the training data, much like how a teacher keeps final exam questions separate from practice materials. The validation dataset serves as our testing ground - it helps us answer the crucial question: "Is the model truly learning to generalize, or is it just memorizing the training examples?"

The separation between training and validation data is a fundamental principle in machine learning. If we tested the model on the same data we used for training, it would be like giving students the exact same questions they practiced with - we wouldn't know if they truly understood the material or just memorized specific answers.

It's worth noting that when these datasets are loaded, they come with all their original structure intact - the instructions, contexts, and responses we saw being handled in the formatting_prompts_func earlier. This structure will be crucial for the actual training process, where we'll need to format each example properly for the model to learn from it.

This two-split approach provides us with the tools we need to both train the model effectively and measure its progress honestly. As we move forward in the training process, we'll use these datasets in tandem - the training data to improve the model's performance, and the validation data to keep track of how well it's really learning.

In [None]:
# show data
train_dataset[0]


{'instruction': 'Summarize the text about supervised learning.',
 'context': 'Supervised learning is a type of machine learning where an algorithm is trained on labeled data. The goal is to learn a mapping from inputs to outputs that can be used to make predictions on new, unseen data. Supervised learning algorithms can be classified into regression and classification. Regression algorithms predict continuous values, while classification algorithms predict discrete labels. Examples include linear regression for predicting prices and logistic regression for classifying emails as spam or not. Supervised learning is widely used in applications such as financial forecasting, medical diagnosis, and image recognition.',
 'response': '{"title": "Introduction to Supervised Learning","summary_text":"Supervised learning involves training algorithms on labeled data to predict outcomes for new data. It includes regression for continuous values and classification for discrete labels. Applications s

In [None]:

# format the text columns using the Alpaca format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n

### Instruction:\n
{}\n\n

### Input:\n
{}\n
{}\n\n

### Response:\n
{}"""

EOS_TOKEN = tokenizer.eos_token

This code defines a template for formatting training examples in the Alpaca style - a widely-used format for instruction-tuning language models. Let me explain both the format and why it's structured this way.

The alpaca_prompt template creates a consistent structure with four key sections, each clearly marked with "###" headers. This structure helps the model understand different parts of each example, similar to how a standardized worksheet helps students understand what information goes where.

The template starts with a clear mission statement: "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request." This preamble sets expectations for what follows, like a teacher explaining the format of an assignment.

Let's examine each section and its purpose:

"### Instruction:" is where we place the primary task or command for the model. The {} placeholder will be filled with specific instructions like "Summarize the following text" or "Explain this concept." This section tells the model what it needs to do.

"### Input:" provides the context or background information the model needs to complete the task. It contains two {} placeholders. The first is for the actual input content, while the second is for the EOS_TOKEN (End of Sequence token). This token, defined in the code as tokenizer.eos_token, acts as a boundary marker, helping the model understand where the input section ends.

"### Response:" is where the expected output goes, marked by the final {} placeholder. During training, this section contains the ideal response we want the model to learn from. During actual use, this is where the model will generate its response.

The format uses double newlines (\n\n) between sections to create clear visual separation. This spacing isn't just for human readability - it helps the model learn to distinguish between different parts of the prompt. Single newlines (\n) are used within sections for formatting.

The structure is particularly effective because:
1. It maintains consistency across all examples, helping the model learn patterns
2. It clearly separates instructions from context, preventing confusion
3. The EOS_TOKEN provides a clear signal for where input ends and response should begin
4. The format mirrors natural question-answer interactions, making it easier for the model to understand its role

This template will be used by the formatting_prompts_func we saw earlier to transform raw dataset examples into properly structured training instances. Each instance will follow this exact format, creating a uniform training experience that helps the model learn to respond appropriately to different types of instructions and contexts.

Understanding this format is crucial because it influences how well the model learns to follow instructions and generate appropriate responses. The clarity and consistency of this structure contribute significantly to the success of instruction-tuning language models.

In [None]:
# map the formatting function on train and eval sets
train_dataset = train_dataset.map(formatting_prompts_func, batched = True,)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

This code applies a formatting function to two datasets - one for training and one for evaluation. The formatting happens in parallel across multiple examples at once (that's what "batched=True" means).

The "map" operation transforms each example in the dataset by running it through the formatting_prompts_func. Think of it like an assembly line where each data point goes through the same formatting station.

When the code uses batched processing, it's more efficient than formatting one example at a time. It's similar to how a restaurant kitchen might prepare multiple orders of the same dish simultaneously rather than cooking each order separately.

The same formatting function is applied to both datasets to ensure consistency - the training data and evaluation data go through identical preprocessing steps. This consistency is crucial because machine learning models need their input data to be formatted the same way during both training and testing.

The formatted datasets are stored back into the same variables (train_dataset and eval_dataset), overwriting the original unformatted versions.

In [None]:
train_dataset[0]

{'instruction': 'Summarize the text about supervised learning.',
 'context': 'Supervised learning is a type of machine learning where an algorithm is trained on labeled data. The goal is to learn a mapping from inputs to outputs that can be used to make predictions on new, unseen data. Supervised learning algorithms can be classified into regression and classification. Regression algorithms predict continuous values, while classification algorithms predict discrete labels. Examples include linear regression for predicting prices and logistic regression for classifying emails as spam or not. Supervised learning is widely used in applications such as financial forecasting, medical diagnosis, and image recognition.',
 'response': '{"title": "Introduction to Supervised Learning","summary_text":"Supervised learning involves training algorithms on labeled data to predict outcomes for new data. It includes regression for continuous values and classification for discrete labels. Applications s

# Model

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,                                 # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj",],
    lora_alpha = 16,
    lora_dropout = 0,                       # Supports any, but = 0 is optimized
    bias = "none",                          # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,                      # We support rank stabilized LoRA
    loftq_config = None,                    # And LoftQ
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.1.8 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.


This code configures a language model for efficient fine-tuning using a technique called LoRA (Low-Rank Adaptation). Let me break down each parameter and explain its significance:

The first parameter `r=32` sets the rank of the LoRA matrices. Think of this as controlling how much flexibility the model has to learn new tasks. Higher values like 32 or 64 give more learning capacity but use more memory and computation. It's like giving the model more "degrees of freedom" to adapt to new tasks.

The `target_modules` parameter specifies which parts of the model will be fine-tuned. In this case, it's targeting the attention mechanism components (query, key, value, and output projections) and the feedforward network parts (gate and up projections). This is analogous to selecting which parts of a complex machine you want to adjust while leaving the rest unchanged.

`lora_alpha=16` is a scaling factor that influences how much impact the LoRA updates have. It works together with the rank to determine the effective learning rate of the adapted components.

The code uses optimized settings for efficiency: `lora_dropout=0` disables dropout (a regularization technique), and `bias="none"` means we're not training bias terms. These choices prioritize speed and memory efficiency over flexibility.

`use_gradient_checkpointing="unsloth"` enables a memory-saving technique specifically optimized for long sequences. Instead of storing all intermediate computations, it recomputes some values during the backward pass, trading computation for memory savings.

The `random_state=3407` parameter sets a seed for reproducibility, ensuring the same initialization every time the code runs.

`use_rslora=True` enables rank-stabilized LoRA, an enhancement that helps maintain more stable training dynamics, particularly for larger rank values. Think of it as adding guardrails to prevent the fine-tuning process from becoming unstable.

Finally, `loftq_config=None` indicates that LoftQ (another optimization technique) is not being used in this configuration.

All these parameters work together to create a memory-efficient, computationally optimized setup for fine-tuning large language models while maintaining good performance. The choices reflect a balance between adaptation capability and computational efficiency.

# Training

In [None]:

training_args = TrainingArguments(
    per_device_train_batch_size=2,        # how many batches to train at a time
    gradient_accumulation_steps=4,        # total steps will be divided by this
    warmup_steps=2,                       # steps to go from 0 to specified lr
    num_train_epochs=4,                   # no of samples in each batch * this would be total steps
    learning_rate=2e-4,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
    evaluation_strategy="steps",          # Evaluate during training
    eval_steps=1,                         # Evaluate every 1 step
    save_strategy="steps",
    save_steps=10,                        # save model state every 10 steps
    load_best_model_at_end=True,          # Load the best model when finished training
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Let me explain these training arguments for a machine learning model by walking through each parameter and its role in the training process.

The `per_device_train_batch_size=2` determines how many examples the model processes simultaneously. Think of this like a teacher grading two student papers at once. A smaller batch size of 2 uses less memory but might take longer to process all examples.

Working together with batch size is `gradient_accumulation_steps=4`. Instead of updating the model after every batch, it accumulates gradients from 4 batches before making an update. This is similar to a teacher collecting feedback from several assignments before adjusting their teaching strategy. This technique allows for effectively larger batch sizes without requiring as much memory.

The learning process starts with `warmup_steps=2`, where the learning rate gradually increases from zero to the specified rate. This gentle start helps prevent early training instability - like how you might ease into exercise rather than starting at full intensity.

`num_train_epochs=4` means the model will see the entire dataset four times. Each pass through the data gives the model another chance to refine its understanding, similar to how students might review material multiple times to better grasp it.

The `learning_rate=2e-4` controls how big the model's adjustments are during training. Too high, and the model might overshoot optimal solutions; too low, and training becomes unnecessarily slow. 2e-4 (0.0002) represents a moderate learning rate.

The code intelligently chooses between `fp16` and `bf16` (different ways of representing numbers) based on hardware support. This is like choosing the most efficient numerical notation system your calculator supports.

For optimization, it uses `optim="adamw_8bit"`, an memory-efficient version of the AdamW optimizer. `weight_decay=0.01` adds a small penalty to large weights, helping prevent overfitting - like encouraging students to find simple solutions rather than memorizing complex ones.

The `lr_scheduler_type="linear"` gradually decreases the learning rate over time. This allows for larger learning steps early in training when the model is far from optimal, and smaller, more precise adjustments later.

The code includes several monitoring and saving mechanisms:
- `logging_steps=1` records training progress every step
- `eval_steps=1` tests the model's performance every step
- `save_steps=10` creates checkpoint saves every 10 steps
- `load_best_model_at_end=True` ensures we keep the version that performed best

All results are saved in the "outputs" directory, and `seed=3407` ensures reproducible results by initializing random processes consistently.

This configuration balances training effectiveness with computational efficiency, while maintaining good monitoring and safety measures through regular evaluation and model saving.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,          # pass in the training split
    eval_dataset=eval_dataset,            # pass in the eval split
    args=training_args,
    dataset_text_field = "text",
    callbacks=[CustomCallback()]          # our custom callback
)



Map:   0%|          | 0/12 [00:00<?, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

This code sets up a Supervised Fine-Tuning (SFT) Trainer, which orchestrates the entire training process for the language model. Let me break down how this trainer works and connects all the pieces we configured earlier.

Think of the SFT Trainer as a conductor coordinating an orchestra. Just as a conductor ensures all musicians play their parts at the right time and volume, the trainer manages how the model learns from data, evaluates its performance, and saves its progress.

The first two parameters connect our main components: the `model` we configured with LoRA settings and the `tokenizer` that converts text into numbers the model can process. These are like the orchestra and the sheet music respectively - they're the essential tools needed to make everything work.

The `train_dataset` and `eval_dataset` provide the actual data the model will learn from and be tested on. If we continue our orchestra metaphor, think of the training dataset as the practice sessions where musicians learn and improve, while the evaluation dataset is like periodic recitals where they demonstrate their progress.

The `args=training_args` parameter links to all those detailed training settings we configured earlier - batch sizes, learning rates, evaluation schedules, and so on. This is similar to having a detailed rehearsal schedule that specifies how long to practice, when to take breaks, and how often to perform.

The `dataset_text_field="text"` parameter tells the trainer which field in our dataset contains the actual text to learn from. This might seem simple, but it's crucial - it's like telling musicians which lines of the sheet music they should be reading.

Finally, the `callbacks=[CustomCallback()]` allows us to hook into the training process and monitor or modify it as it runs. These callbacks can track progress, log information, or even adjust training parameters on the fly. It's similar to having assistants who monitor the orchestra's performance and can signal the conductor if something needs attention.

This trainer setup creates a cohesive system where all our carefully chosen parameters and configurations work together. It will:
1. Load batches of data from our training dataset
2. Feed them through the model
3. Calculate how well the model is doing
4. Update the model's parameters to improve performance
5. Periodically evaluate progress using the evaluation dataset
6. Save checkpoints of the model's state
7. Keep track of all this progress through our custom callback

All of this happens automatically once we start training, with the trainer managing the complexity of coordinating these various components and processes.

In [None]:
# check GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.848 GB of memory reserved.


This code helps us understand how much GPU memory our training process will use. Let me explain what's happening and why this monitoring is important.

First, the code gets information about the GPU (Graphics Processing Unit) installed in slot 0 of the computer. GPUs are specialized processors that excel at the kind of parallel computations needed for machine learning. Think of them like highly efficient assembly lines for mathematical operations.

The line `start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)` calculates how much GPU memory is already reserved before we start training. The calculation divides by 1024 three times to convert from bytes to gigabytes (bytes → kilobytes → megabytes → gigabytes), and rounds to 3 decimal places for readability. It's like checking how much space is already occupied in a warehouse before bringing in new inventory.

Similarly, `max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)` determines the total memory capacity of the GPU. This tells us our absolute upper limit - like knowing the maximum capacity of that warehouse.

The code then prints two informative messages. The first shows which GPU model we're using and its total memory capacity. The second tells us how much memory is already reserved. This information is crucial because training large language models is very memory-intensive, and running out of GPU memory is a common cause of training failures.

By checking these values before training, we can:
1. Confirm we have enough available memory for our planned training process
2. Establish a baseline to track memory usage during training
3. Identify if other processes are using significant GPU memory that might interfere with our training
4. Make informed decisions about batch sizes and model configurations

This monitoring step is like doing a pre-flight check before taking off - it helps prevent problems before they occur and gives us important reference points for troubleshooting if issues arise during training.

In [None]:
# start training
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 12 | Num Epochs = 4
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 4
 "-____-"     Number of trainable parameters = 65,011,712


Step,Training Loss,Validation Loss
1,1.5671,1.572786
2,1.4911,1.312598
3,1.2426,1.003361
4,0.8992,0.866799


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


This seemingly simple line of code kicks off the entire training process for our language model. Let me walk you through what happens when this command runs - it's quite fascinating how much complexity unfolds from this single line.

When `trainer.train()` executes, it sets in motion an intricate learning cycle. The trainer first initializes all the necessary components we configured earlier - loading the model onto the GPU, preparing the data batches, and setting up optimization algorithms. Think of it like an automated factory starting up its assembly lines, conveyor belts, and quality control stations.

During training, the process follows a carefully orchestrated sequence for each batch of data:

The trainer loads a small batch of text (in our case, 2 examples at a time, as we specified in our training arguments). It converts this text into numbers using our tokenizer, much like translating a message into a code that the model can understand.

The model then processes this input, making its best attempt to understand and generate appropriate responses based on its current knowledge. Initially, these attempts might be quite poor - like a student making educated guesses on a new topic.

After each attempt, the trainer calculates how far off the model's predictions were from the desired outputs. This generates gradients - mathematical signals that indicate how the model should adjust its internal parameters to improve its performance. However, remember we set gradient_accumulation_steps=4, so these gradients are collected over four batches before being applied.

The optimizer (AdamW 8-bit in our case) uses these accumulated gradients to update the model's parameters. These updates are carefully scaled by our learning rate of 2e-4, ensuring the changes are neither too dramatic nor too subtle.

Every step, as specified by our logging_steps=1 setting, the trainer records metrics about the model's performance. Every 10 steps (save_steps=10), it saves a checkpoint of the model's state - like taking a snapshot of the model's progress that we can return to if needed.

The trainer also periodically evaluates the model's performance on our evaluation dataset. This testing process helps us track whether the model is truly learning useful patterns or just memorizing the training data.

All of this information gets captured in the `trainer_stats` variable, which will contain detailed metrics about the training process - things like the final loss values, training time, and various performance metrics.

This automated process continues until the model has seen the entire dataset four times (our num_train_epochs=4 setting), gradually refining its understanding with each pass through the data. Thanks to our load_best_model_at_end=True setting, when training finishes, we'll automatically have access to the version of the model that performed best on our evaluation metrics.

In [None]:
# check GPU memory after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

44.6089 seconds used for training.
0.74 minutes used for training.
Peak reserved memory = 6.664 GB.
Peak reserved memory for training = 0.816 GB.
Peak reserved memory % of max memory = 45.207 %.
Peak reserved memory for training % of max memory = 5.536 %.


```python
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
```
This line calculates the total amount of GPU memory used after training. The `torch.cuda.max_memory_reserved()` function returns the maximum amount of memory reserved by PyTorch on the GPU. We're dividing this value by 1024 three times to convert it from bytes to gigabytes (GB).

```python
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
```
This line calculates the amount of GPU memory used specifically for training with LORA (Low-Rank Adaptation). We're subtracting the initial GPU memory usage (`start_gpu_memory`) from the total used memory to get the memory used by LORA.

```python
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
```
These lines calculate the percentage of maximum GPU memory used for training. We're dividing the total used memory and LORA used memory by the maximum available memory (`max_memory`) and multiplying by 100 to get the percentages.


```python
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
```
These lines print out the total time taken for training in both seconds and minutes.

```python
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
```
These lines print out the calculated memory metrics:

* Total peak reserved memory
* Peak reserved memory used by LORA
* Percentage of maximum memory used
* Percentage of maximum memory used by LORA

By analyzing these metrics, we can understand how efficient our training process is in terms of GPU memory usage and identify potential areas for optimization.

# Compare

In [None]:
# text to summarize
muh_text = """
          The Markov decision process (MDP) is an effective mathematical tool for modeling sequen-tial
          decision-making problems in stochastic  environments (Derman, 1970; Puterman, 1994).Solving  an  MDP problem  entails  finding  an  optimal
            policy  that  maximizes  a  cumulativereward according to a given criterion.  However, due to reasons including non-stationarityof the environment, modeling error,
         exogenous perturbation, partial observability, and ad-versarial attacks, a model mismatch exists between the assumed MDP model and the
          un-derlying environment and can result in solution policies with poor performance.  To solvethis  issue,  a  framework  of  robust
            MDPs,  e.g.,  (Bagnell  et  al.,  2001;  Nilim  &  El  Ghaoui,2004;  Iyengar,  2005),  has  been  proposed.   Rather  than  adopting
            a  fixed  MDP  model,  inrobust MDP, one seeks to optimize the worst-case performance over an uncertainty set ofpossible MDP models.
            The solution provides performance guarantees for all MDP modelsin the uncertainty set, and is thus robust to the model mismatch
            """

In [None]:
# # raw model

# model_raw, tokenizer_raw = FastLanguageModel.from_pretrained(
#     model_name = CFG.model,
#     max_seq_length = CFG.max_seq_length,
#     dtype = CFG.dtype,
#     load_in_4bit = CFG.load_in_4bit,
# )

# FastLanguageModel.for_inference(model_raw)
# inputs = tokenizer_raw(
# [ alpaca_prompt.format( "Summarize the given text",
#         muh_text,  EOS_TOKEN,  "",  )
# ], return_tensors = "pt").to(CFG.device)

# outputs = model_raw.generate(**inputs, max_new_tokens = CFG.max_seq_length, use_cache = True)
# response_raw = extract_response(tokenizer_raw.batch_decode(outputs)[0])

In [None]:
# wygenerowane wcześniej -> w wersji free + T4 nie starczy pamięci na równoczesne odpalenie dwóch modeli
response_raw =  """
  The Markov decision process (MDP) is a mathematical tool used to model sequential decision-making problems in stochastic environments. Solving an  MDP problem involves finding an optimal policy that maximizes a cumulative reward according to a given criterion. However, due to various reasons such as non-stationarity of the environment, modeling error, exogenous perturbation, partial observability, and adversarial attacks, a model  mismatch often exists between the assumed MDP model and the underlying environment, leading to poor performance of the solution policy. To address this issue, a framework of robust MDPs has been proposed, where instead of adopting a fixed MDP model, one seeks to optimize the worst-case performance  over an uncertainty set of possible MDP models. This approach provides performance guarantees for all MDP models in the uncertainty set, making it robust to the model mismatch.
"""

In [None]:
# tuned model
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[ alpaca_prompt.format( "Summarize the given text",
        muh_text,  EOS_TOKEN,  "",  )
], return_tensors = "pt").to(CFG.device)

outputs = model.generate(**inputs, max_new_tokens = CFG.max_seq_length, use_cache = True)
response_tuned = extract_response(tokenizer.batch_decode(outputs)[0])

In this section, we're using the tuned model to generate a response based on the input prompt. We'll be going through the following steps:

```python
FastLanguageModel.for_inference(model)
```
This line prepares the tuned model for inference by setting it up for efficient generation of text. The `for_inference` method is likely optimizing the model's configuration and weights for generating output.


```python
inputs = tokenizer(
    [alpaca_prompt.format("Summarize the given text", muh_text, EOS_TOKEN, "", )],
    return_tensors="pt"
).to(CFG.device)
```
This line tokenizes the input prompt using the `tokenizer`. The input prompt is a formatted string that includes:

* A task description ("Summarize the given text")
* The text to be summarized (`muh_text`)
* An end-of-sequence token (`EOS_TOKEN`)
* Additional context (empty in this case)

The `return_tensors="pt"` argument tells the tokenizer to return PyTorch tensors, which are then moved to the device specified by `CFG.device` (e.g., a GPU).

```python
outputs = model.generate(**inputs, max_new_tokens=CFG.max_seq_length, use_cache=True)
```
This line uses the tuned model to generate output based on the tokenized input prompt. The `generate` method takes in several arguments:

* `**inputs`: the tokenized input prompt
* `max_new_tokens`: the maximum number of new tokens to generate (set to `CFG.max_seq_length`)
* `use_cache`: a flag indicating whether to use caching during generation (set to `True`)

The model generates output based on the input prompt and returns it as a tensor.


```python
response_tuned = extract_response(tokenizer.batch_decode(outputs)[0])
```
This line extracts the generated response from the output tensor. The `batch_decode` method is used to decode the output tensor into a string, and then the `extract_response` function is applied to extract the relevant part of the response.

The resulting `response_tuned` variable contains the final response generated by the tuned model based on the input prompt.

In [None]:
pp(response_raw)

('\n'
 '  The Markov decision process (MDP) is a mathematical tool used to model '
 'sequential decision-making problems in stochastic environments. Solving an  '
 'MDP problem involves finding an optimal policy that maximizes a cumulative '
 'reward according to a given criterion. However, due to various reasons such '
 'as non-stationarity of the environment, modeling error, exogenous '
 'perturbation, partial observability, and adversarial attacks, a model  '
 'mismatch often exists between the assumed MDP model and the underlying '
 'environment, leading to poor performance of the solution policy. To address '
 'this issue, a framework of robust MDPs has been proposed, where instead of '
 'adopting a fixed MDP model, one seeks to optimize the worst-case '
 'performance  over an uncertainty set of possible MDP models. This approach '
 'provides performance guarantees for all MDP models in the uncertainty set, '
 'making it robust to the model mismatch.\n')


In [None]:
pp(response_tuned)

('The given text discusses the Markov decision process (MDP) and its '
 'limitations in modeling sequential decision-making problems in stochastic '
 'environments. It highlights the issue of model mismatch due to factors such '
 'as non-stationarity, modeling error, and external perturbations. To address '
 'this, the text introduces the concept of robust MDPs, which aim to optimize '
 'the worst-case performance across an uncertainty set of possible MDP models. '
 'This approach provides performance guarantees for all MDP models within the '
 'uncertainty set, making it robust to model mismatch.')
