<a href="https://colab.research.google.com/github/hussamalafandi/Generative_AI/blob/main/notebooks/08/08_training_and_fine-tuning_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Pre-training to Fine-tuning: Practical Guide for Transformers

## Learning Objectives:

By the end of this notebook, students will:

* Understand the differences and purposes of pre-training, supervised fine-tuning (SFT), and preference-based training (DPO).
* Be able to fine-tune small, efficient transformer models (SmolLM2-135M) on practical datasets.
* Evaluate fine-tuned models quantitatively (perplexity/loss) and qualitatively (generation quality).
* Get introduced to high-performance libraries (Unsloth) for efficient transformer fine-tuning.

## Introduction and Concepts

<div style="text-align: center;">
    <img src="https://images.ctfassets.net/kftzwdyauwt9/40in10B8KtAGrQvwRv5cop/8241bb17c283dced48ea034a41d7464a/chatgpt_diagram_light.png?w=2048&q=80&fm=webp" alt="ChatGPT training phases" width="1200" />
</div>

Image source: [openai.com](https://openai.com/index/chatgpt/)

### What is Pre-training?

**Pre-training** is a process in which language models are initially trained on large-scale datasets using **self-supervised learning**. In simple terms, these models learn directly from the text itself without explicit labels provided by humans.

Common pre-training tasks include:

* **Causal Language Modeling (CLM):** Predicting the next word/token based on previous context.

* **Masked Language Modeling (MLM):** Predicting hidden (masked) words in a sentence (e.g., used by BERT).

Through pre-training, models capture fundamental language understanding, grammar, reasoning, and context comprehension. This general knowledge becomes the basis for further specialization through fine-tuning.

#### SmolLM2 (Small and Efficient Model)

In this course, we’ll use SmolLM2-135M, a compact and efficient transformer-based language model developed specifically to offer robust performance on modest hardware resources. SmolLM2 balances capability and efficiency, making it ideal for educational purposes, prototyping, and running on limited hardware (e.g., GPUs available via Colab).

### What is Supervised Fine-Tuning (SFT)?

Although pre-trained models have broad language capabilities, they often lack task-specific accuracy. **Supervised Fine-Tuning (SFT)** bridges this gap, refining the general knowledge learned during pre-training by training the model further on task-specific data with explicit labels or prompts.

Through fine-tuning, models become adept at specialized tasks, such as:

* Conversational assistants
* Text summarization
* Sentiment analysis
* Domain-specific language generation

#### SmolTalk Dataset

In this notebook, we'll practically perform supervised fine-tuning using **SmolTalk**, a compact conversational dataset designed specifically for fine-tuning lightweight language models. SmolTalk is carefully crafted to provide realistic conversational exchanges, helping our SmolLM2 model improve significantly in generating natural dialogues and responding to instructions.

### Preference-based Fine-tuning (DPO)

Beyond supervised fine-tuning, recent approaches also incorporate **Preference-based Fine-tuning (DPO, Direct Preference Optimization)**. Instead of optimizing purely for task accuracy or next-token prediction, DPO fine-tunes models based on human-generated feedback indicating preference for specific responses.

This method ensures models not only become task-specific but also align better with human preferences, improving their real-world usability, helpfulness, and alignment with ethical guidelines.

#### UltraFeedback Dataset (Introduction)

Later in our course, we will explore DPO practically using the **UltraFeedback dataset**, which contains numerous examples of human preferences ranking model-generated outputs. UltraFeedback helps models like SmolLM2 adapt to human-preferred conversational styles and improves their ability to generate helpful, appropriate, and aligned responses.



<div style="text-align: center;">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/RvHjdlRT5gGQt5mJuhXH9.png" alt="SmolLM2 Ecosystem" width="1200" />
</div>

Image source: [huggingface.co](https://huggingface.co/HuggingFaceTB)

## Load the [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) Dataset

In this notebook, we will load a subset of the `smoltalk` dataset. You can browse all available subsets [here](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).  

For our purposes, we will use the **`smoltalk-rewrite`** subset, which is designed for text rewriting tasks. This subset is ideal for training models focused on rewriting capabilities.  

Feel free to experiment with other subsets to explore different tasks supported by the dataset.

In [1]:
from datasets import load_dataset
from pprint import pprint

train_dataset = load_dataset("HuggingFaceTB/smoltalk", "smol-rewrite", split="train[:10%]")
eval_dataset = load_dataset("HuggingFaceTB/smoltalk", "smol-rewrite", split="test[:10%]")

pprint(train_dataset[0]['messages'])

[{'content': "You're an AI assistant for text re-writing. Rewrite the input "
             'text to make it more concise while preserving its core meaning.',
  'role': 'system'},
 {'content': 'Hey Alex,\n'
             '\n'
             "I hope you're doing well! It's been a while since we met at the "
             'film festival last year. I was the one with the short film about '
             "the old abandoned factory. Anyway, I'm reaching out because I'm "
             'currently working on my thesis film project and I could really '
             'use some advice on cinematography. I remember our conversation '
             'about visual storytelling and I was hoping you might have some '
             'tips or insights to share.\n'
             '\n'
             'My film is a drama set in a small town, and I want to capture '
             'the mood and atmosphere of the location through my '
             "cinematography. I'm planning to shoot on location next month, "
             

## Load the SmolLM2 Base Model and Tokenizer

For supervised fine-tuning (SFT), we need to load the same tokenizer that was used to train the base model—SmolLM2.  

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

In the next cell, we set a chat template to the tokenizer and adjust the model's embeddings to accomedate the newly introduced tokens.

In [4]:
from trl import setup_chat_format

model, tokenizer = setup_chat_format(model, tokenizer)

The loaded dataset contains multi-turn conversations with clearly defined roles (e.g., user, assistant). Before passing this data to the language model, we need to format each conversation into a single coherent text sequence.

This is where a **chat template** comes in—it defines the structure and formatting rules for representing conversations consistently. Many tokenizers support multiple chat templates, and you can choose one that best fits your use case. In this notebook, we use the template included with the `instruct` variant of the SmolLM2 tokenizer.


In [7]:
# Using tokenizer's apply_chat_template method (if provided)
# We'll format the example as an instruction-response pair
chat_example = train_dataset[0]['messages']

# Check the default chat template (optional, but useful for inspection)
formatted_input = tokenizer.apply_chat_template(chat_example, tokenize=False)
print("\nFormatted input with chat template:\n", formatted_input)



Formatted input with chat template:
 <|im_start|>system
You're an AI assistant for text re-writing. Rewrite the input text to make it more concise while preserving its core meaning.<|im_end|>
<|im_start|>user
Hey Alex,

I hope you're doing well! It's been a while since we met at the film festival last year. I was the one with the short film about the old abandoned factory. Anyway, I'm reaching out because I'm currently working on my thesis film project and I could really use some advice on cinematography. I remember our conversation about visual storytelling and I was hoping you might have some tips or insights to share.

My film is a drama set in a small town, and I want to capture the mood and atmosphere of the location through my cinematography. I'm planning to shoot on location next month, but I'm still trying to figure out the best way to approach it. If you have any suggestions or resources you could point me to, I would be incredibly grateful.

Also, I heard from a mutual frien

## Apply Chat Template and Tokenize the Dataset

Next, we'll tokenize the dataset using the selected chat template. To do this, we'll define a tokenization function and apply it across the dataset.

In [12]:
def tokenize_with_chat_template(example):
    # Use the tokenizer's apply_chat_template method to format the input
    formatted_input = tokenizer.apply_chat_template(example['messages'], tokenize=False)

    # Tokenize the formatted input
    tokenized_input = tokenizer(formatted_input, truncation=True)

    return tokenized_input

# Tokenize the entire dataset using the custom function
tokenized_dataset = train_dataset.map(
    tokenize_with_chat_template,
    batched=True,
    remove_columns=train_dataset.column_names
)

Map:   0%|          | 0/5334 [00:00<?, ? examples/s]

In [13]:
# Decode back for sanity check
decoded_sample = tokenizer.decode(tokenized_dataset[0]['input_ids'])
print("\nDecoded sample:\n", decoded_sample)



Decoded sample:
 <|im_start|>system
You're an AI assistant for text re-writing. Rewrite the input text to make it more concise while preserving its core meaning.<|im_end|>
<|im_start|>user
Hey Alex,

I hope you're doing well! It's been a while since we met at the film festival last year. I was the one with the short film about the old abandoned factory. Anyway, I'm reaching out because I'm currently working on my thesis film project and I could really use some advice on cinematography. I remember our conversation about visual storytelling and I was hoping you might have some tips or insights to share.

My film is a drama set in a small town, and I want to capture the mood and atmosphere of the location through my cinematography. I'm planning to shoot on location next month, but I'm still trying to figure out the best way to approach it. If you have any suggestions or resources you could point me to, I would be incredibly grateful.

Also, I heard from a mutual friend that you're having

## Configure and Initialize the Trainer

With our model and tokenized dataset ready, we can now configure the training setup.  
We'll use `SFTTrainer` from the `trl` library to handle supervised fine-tuning (SFT).  
Below, we define the training arguments and initialize the trainer.

In [None]:
from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./smollm2-sft-results",   # Output directory for model checkpoints
    num_train_epochs=3,                   # Number of epochs
    learning_rate=3e-5,                   # Learning rate # try with 5e-5
    logging_steps=10,                     # How often to log training progress
    save_steps=100,                       # How often to save checkpoints
    per_device_train_batch_size=4,        # Set according to your GPU memory capacity
    per_device_eval_batch_size=8,
    eval_strategy="steps",                # Evaluate the model at regular intervals
    eval_steps=50,                        # Frequency of evaluation
)

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)


In [None]:
trainer.train()

Step,Training Loss,Validation Loss
50,0.9788,1.319638
100,1.1973,1.244419
150,1.1806,1.215166
200,1.1742,1.196922


## Evaluate the Model

After training, we can evaluate our fine-tuned model by providing it with input messages from the evaluation dataset.  
We can then compare its responses to those generated by the base model to assess improvements in performance.

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("hussamalafandi/smollm2-sft-rewrite", device_map="auto") # change to "HuggingFaceTB/SmolLM2-135M" to compare with the base model
tokenizer = AutoTokenizer.from_pretrained("hussamalafandi/smollm2-sft-rewrite", device_map="auto")

prompt = [{'content': "You're an AI assistant for text re-writing. Rewrite the input "
             'text to make it more concise while preserving its core meaning.',
  'role': 'system'},
 {'content': 'Hey Alex,\n'
             '\n'
             "I hope you're doing well! It's been a while since we met at the "
             'film festival last year. I was the one with the short film about '
             "the old abandoned factory. Anyway, I'm reaching out because I'm "
             'currently working on my thesis film project and I could really '
             'use some advice on cinematography. I remember our conversation '
             'about visual storytelling and I was hoping you might have some '
             'tips or insights to share.\n'
             '\n'
             'My film is a drama set in a small town, and I want to capture '
             'the mood and atmosphere of the location through my '
             "cinematography. I'm planning to shoot on location next month, "
             "but I'm still trying to figure out the best way to approach it. "
             'If you have any suggestions or resources you could point me to, '
             'I would be incredibly grateful.\n'
             '\n'
             "Also, I heard from a mutual friend that you're having a "
             'photography exhibition soon. Congratulations! I would love to '
             "attend if you don't mind sending me the details.\n"
             '\n'
             'Thanks in advance for any help you can provide. I really '
             'appreciate it.\n'
             '\n'
             'Best,\n'
             'Jordan',
  'role': 'user'}]

chat_template_prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

tokenized_prompt = tokenizer(chat_template_prompt, return_tensors="pt")

output = model.generate(
    input_ids=tokenized_prompt['input_ids'].to("cuda"),
    attention_mask=tokenized_prompt['attention_mask'].to("cuda"),
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    num_return_sequences=1,
)
decoded_output = tokenizer.decode(output[0])

print("\nDecoded output:\n", decoded_output)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



Decoded output:
 <|im_start|>system
You're an AI assistant for text re-writing. Rewrite the input text to make it more concise while preserving its core meaning.<|im_end|>
<|im_start|>user
Hey Alex,

I hope you're doing well! It's been a while since we met at the film festival last year. I was the one with the short film about the old abandoned factory. Anyway, I'm reaching out because I'm currently working on my thesis film project and I could really use some advice on cinematography. I remember our conversation about visual storytelling and I was hoping you might have some tips or insights to share.

My film is a drama set in a small town, and I want to capture the mood and atmosphere of the location through my cinematography. I'm planning to shoot on location next month, but I'm still trying to figure out the best way to approach it. If you have any suggestions or resources you could point me to, I would be incredibly grateful.

Also, I heard from a mutual friend that you're having

### Using a Pipeline

Instead of loading the model and tokenizer separately, you can streamline the process by using the `pipeline` utility from the Hugging Face Transformers library.

The typical steps—applying a chat template to your prompt, tokenizing the input, passing it through the model, and decoding the output—are handled internally by the pipeline. This simplifies your workflow and reduces the potential for errors.

> **Note:** This approach only works if your trained model is compatible with the Hugging Face Transformers library.


In [24]:
from transformers import pipeline

prompt = [
    {'content': "You're an AI assistant for text re-writing. Rewrite the input "
     'text to make it more concise while preserving its core meaning.',
     'role': 'system'},

    {'content': 'Hey Alex,\n'
     '\n'
     "I hope you're doing well! It's been a while since we met at the "
     'film festival last year. I was the one with the short film about '
     "the old abandoned factory. Anyway, I'm reaching out because I'm "
     'currently working on my thesis film project and I could really '
     'use some advice on cinematography. I remember our conversation '
     'about visual storytelling and I was hoping you might have some '
     'tips or insights to share.\n'
     '\n'
     'My film is a drama set in a small town, and I want to capture '
     'the mood and atmosphere of the location through my '
     "cinematography. I'm planning to shoot on location next month, "
     "but I'm still trying to figure out the best way to approach it. "
     'If you have any suggestions or resources you could point me to, '
     'I would be incredibly grateful.\n'
     '\n'
     "Also, I heard from a mutual friend that you're having a "
     'photography exhibition soon. Congratulations! I would love to '
     "attend if you don't mind sending me the details.\n"
     '\n'
     'Thanks in advance for any help you can provide. I really '
     'appreciate it.\n'
     '\n'
     'Best,\n'
     'Jordan',
     'role': 'user'}]

generator = pipeline("text-generation", model="hussamalafandi/smollm2-sft-rewrite", device="cuda")

output = generator(prompt, max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Device set to use cuda


Hey Alex,

Hope you're well! I'm working on a thesis film and need some advice on cinematography. I remember our conversation about visual storytelling and want to shoot on location next month. Any tips or resources would be great.

Also, I heard from a mutual friend that you're having an exhibition next month. Congratulations! I'll be there.

Thanks a lot!

Best,
Jordan
assistant
Hey Alex,

Hope you're well! I'm working on a thesis film and need some advice on cinematography. I remember our conversation about visual storytelling


# Resources

[HuggingFace LLM Course](https://huggingface.co/learn/llm-course)