# 🧠 Week 5: Supervised Finetuning (SFT) - I
**Theme:** Teaching LLMs to Follow Instructions  
**Project:** LoRA vs. Full Finetuning on HuggingFace with Deepspeed/TRL


## 📘 1. What is Supervised Finetuning (SFT)?
SFT = Pretrained model + Instruction-following data → Task-specific model

**Supervised Fine-Tuning (SFT)** is the process of further training a pre-trained language model on a labeled dataset to specialize it for specific tasks or domains.

**Key points:**
- Builds on top of a pretrained model like LLaMA, GPT, or Mistral.
- Uses instruction-response pairs (like question-answer).
- Enhances instruction-following ability.
- It's a middle stage between pretraining and alignment (e.g., RLHF).

**SFT Pipeline:**
1. Pretrained Model
2. Supervised Dataset
3. Fine-tuned Instruction Model


**Example:**
| Before SFT                 | After SFT                          |
|---------------------------|------------------------------------|
| Random generic responses  | Follows user instructions clearly |




## 📊 2. How to Get SFT Data

**4 Types of Data Sources:**
1. **Manual Curation**: Human-created prompts and responses.
2. **AI-Generated**: Use GPT models to self-generate instruction data.
3. **Open Datasets**: Alpaca, OASST1, Dolly, HH-RLHF, etc.
4. **Data Augmentation**: Rephrasing, adding context, changing perspective.

**Goal**: Create high-quality, diverse, and instruction-aligned examples.

* here we use the second way to generate our data using openAI


In [1]:
from openai import OpenAI
import os
import json
from dotenv import load_dotenv

load_dotenv()

def get_ai_generated_data():
    if not os.getenv("OPENAI_API_KEY"):
        print("⚠️ No OpenAI API key - using placeholder data")
        return [{"instruction": "What are your technical skills?", "response": "Python, data analysis"}]

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    prompt = "Create 2 interview Q&A pairs for a software developer in JSON format. Output only JSON."

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    content = response.choices[0].message.content

    # If model returns Markdown-style JSON block
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1]

    try:
        data = json.loads(content)
        if isinstance(data, dict):
            return data.get("examples", [])
        elif isinstance(data, list):
            return data
        else:
            print("⚠️ Unexpected JSON format:", type(data))
            return []
    except Exception as e:
        print("❌ Failed to parse JSON:", e)
        print("Raw content:", content)
        return []

# Example call
get_ai_generated_data()


[{'question': 'Can you explain the difference between synchronous and asynchronous programming?',
  'answer': 'Synchronous programming executes tasks one after the other, meaning each task must complete before the next one starts. In contrast, asynchronous programming allows tasks to run concurrently, enabling the program to initiate a task and move on to the next one without waiting for the previous task to finish. This is particularly useful in scenarios such as web development where a user interface needs to remain responsive while waiting for data to load.'},
 {'question': 'What are some common design patterns you have used in your projects?',
  'answer': "Some common design patterns I have used include the Singleton pattern, which ensures a class has only one instance and provides a global point of access to it, and the Observer pattern, which defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automat

## 🧩 3. Formatting: ChatML
**ChatML** is a structured dialogue format used to simulate role-based conversations during SFT training.

**Structure:**
```
<|im_start|>user
What's the capital of France?
<|im_end|>
<|im_start|>assistant
Paris.
<|im_end|>
```

**Why it matters:**
- Improves consistency
- Helps multi-turn dialogue modeling
- Matches formatting expectations for LLaMA and OpenAI-style models



## 🔍 4. Full Finetune vs. LoRA


| Aspect               | Full Fine-Tuning      | LoRA (Low-Rank Adaptation) |
|----------------------|-----------------------|-----------------------------|
| Trainable Params     | 100%                  | ~0.5–1%                     |
| Memory Usage         | Very High             | Low                         |
| Flexibility          | Maximum               | Good for most tasks         |
| Training Time        | Longer                | Faster                      |
| Use Case             | Critical domain shift | Resource-efficient tuning   |

**Recommendation**: Use LoRA for most educational and practical settings unless full retraining is justified.


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"]
)
lora_model = get_peft_model(base_model, lora_config)
lora_model.print_trainable_parameters()


  from .autonotebook import tqdm as notebook_tqdm


[2025-05-28 15:19:19,972] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to mps (auto detect)


W0528 15:19:20.075000 66105 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
trainable params: 1,622,016 || all params: 126,061,824 || trainable%: 1.2867


  warn("The installed version of bitsandbytes was compiled without GPU support. "


## ⚡ 5. DeepSpeed

**DeepSpeed** is a library from Microsoft that allows efficient distributed training of large models.

**Modes:**
- **ZeRO-1/2/3** for optimizer/shard parallelism.
- **CPU Offload** to reduce GPU memory usage.
- **Mixed Precision** for speed and efficiency.

**Best for**: Scaling training to large models like 13B+, saving memory, or training on multiple GPUs.

Enables memory-efficient training. Example config:
```json
{
  "zero_optimization": {"stage": 2},
  "fp16": {"enabled": true}
}
```

## 🛠️ 6. TRL Package (SFTTrainer)

**Transformers Reinforcement Learning (TRL)** by Hugging Face includes:

- `SFTTrainer`: Simplified supervised training loop.
- `PPOTrainer`: RLHF with Proximal Policy Optimization.
- `DPOTrainer`: Direct Preference Optimization.
- `RewardTrainer`: For reward model training.

**Why TRL?**
- Abstracts away complex setup.
- Faster experimentation.
- Supports all major fine-tuning and alignment workflows.


In [3]:
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import Dataset

# Prepare the dataset (must have a 'text' field)
demo_dataset = Dataset.from_list([
    {"text": "Human: What is Python?\nAssistant: Python is a programming language."},
    {"text": "Human: How do I learn coding?\nAssistant: Start with basic concepts and practice regularly."}
])

# Training arguments
training_args = TrainingArguments(
    output_dir="./trl_sft_demo",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    learning_rate=5e-4,
    logging_steps=1,
    save_steps=10,
    save_total_limit=1,
    fp16=False,
    report_to=None,
)

# ✅ Initialize SFTTrainer — no config, no extras
trainer = SFTTrainer(
    model=lora_model,
    args=training_args,
    train_dataset=demo_dataset
)

# ✅ Train the model
trainer.train()


Converting train dataset to ChatML: 100%|██████████| 2/2 [00:00<00:00, 496.66 examples/s]
Adding EOS to train dataset: 100%|██████████| 2/2 [00:00<00:00, 1088.72 examples/s]
Tokenizing train dataset: 100%|██████████| 2/2 [00:00<00:00, 568.80 examples/s]
Truncating train dataset: 100%|██████████| 2/2 [00:00<00:00, 1132.52 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1,9.5891
2,9.2756


TrainOutput(global_step=2, training_loss=9.432311534881592, metrics={'train_runtime': 2.921, 'train_samples_per_second': 0.685, 'train_steps_per_second': 0.685, 'total_flos': 18722451456.0, 'train_loss': 9.432311534881592})

Explanation:
1. global_step=2

    This means the training process completed 2 optimization steps (i.e., two batches were processed and used to update the model's parameters).

2. training_loss=9.432311534881592

    This is the final training loss averaged over the training steps. A higher loss suggests the model hasn't learned much yet, likely because:
        * It’s early in training (only 2 steps).
        * The model needs more tuning or a better learning rate.
        * The data might be complex or noisy.

3. train_runtime=2.921

    Total time in seconds the training took — in this case, around 2.9 seconds.

4. train_samples_per_second=0.685

    The average number of training examples processed per second. Since only 2 steps were taken, the dataset or batch size may have been small.

5. train_steps_per_second=0.685

    How many steps (i.e., parameter updates) were completed per second. It matches the sample rate, implying 1 sample per step.

6. total_flos=18722451456.0

    The total number of floating point operations (FLOPs) executed during training. It's a proxy for how computationally intensive the training was.




### Explanation of Each Metric during model training:
- ✅ loss
    *What it is*: Measures the model's prediction error — how far off the model is from the target output.

    *What to look for*: We want this to decrease over time.

    A value of 0.1097 or 0.1366 is relatively low, which is promising, assuming it continues trending downward.

    Temporary small increases (like from 0.1097 to 0.1366) can happen due to learning rate fluctuations or noisy batches.

- ✅ grad_norm (Gradient Norm)
    *What it is*: L2 norm of the gradients — essentially, how large the updates to the model's weights are.

    *What to look for*:

    If this value is too large, it may indicate exploding gradients.

    If too small (near zero), it may mean vanishing gradients or that training is plateauing.

    for example:

    0.969 → healthy magnitude, meaning the model is still learning.

    0.278 → much smaller, which could mean learning is slowing down — possibly nearing convergence, or may need LR adjustment.

- ✅ learning_rate
    *What it is*: The rate at which the model updates its weights. Often decays over time (e.g., cosine scheduler).

    *What to look for*:

    A decaying learning rate is common and helps fine-tune the model toward convergence.

    0.000126 → slightly higher; 0.000120 → lower. This drop suggests a learning rate schedule is being applied, as expected.

- ✅ epoch
    *What it is*: Indicates how far along you are in training (e.g., 4.67 = 67% through the 5th epoch).

    *What to look for*: Helps track progression. You’d want to compare loss and grad_norm across epochs to evaluate learning trends.

## ✅ Summary

You’ve learned:
- What SFT is and why it’s essential
- Where and how to get quality data
- How to use ChatML format
- When to choose LoRA vs full tuning
- How to leverage DeepSpeed and TRL for scale and alignment

## - For the full llama3 sft code, check out class_5_llama3.py