### MHP Applied science group
# RLHF Hackathon: SFT


<div style="text-align: center;">
    <img src="../images/img1.jpg" alt="Supervised Fine-tuning steps" style="display: block; margin-left: auto; margin-right: auto;width:800px">
    <p style="text-align:center">Supervised Fine-tuning steps <a href=https://cameronrwolfe.substack.com/p/understanding-and-using-supervised>(image source)</a></p>
</div>


Supervised fine-tuning is the process of adapting a pre-trained model to a specific task by training it further on labeled data. This involves:


1. **Pre-trained Model**: Starting with a model pre-trained on a large, general dataset.
2. **Labeled Data**: Using a smaller, task-specific dataset with labeled examples.
3. **Fine-tuning**: Updating the model's weights to improve its performance on the specific task using supervised learning methods.

This method enhances the model's ability to perform well on the target task by leveraging the knowledge gained during pre-training.


### Load model
Let's choose a pre-trained model that fits in our instance. For this purpose, we chose `unsloth/Phi-3-mini-4k-instruct`. This model is already quantized to 4-bit with bitsandbytes. You can find more information on [Hugging Face](https://huggingface.co/unsloth/Phi-3-mini-4k-instruct). 

Usually each model has a **Cookbook** which has valuable information about the model. For Phi-3 you can find it [here](https://github.com/microsoft/Phi-3CookBook)

In [1]:
# Task 1: Write the name of the model
LLM_MODEL_NAME = "unsloth/Phi-3-mini-4k-instruct" # update here

In [2]:
from unsloth import FastLanguageModel
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = LLM_MODEL_NAME,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA A10G. Max memory: 22.191 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### PEFT
Parameter-Efficient Fine-Tuning (PEFT) refers to techniques used to adapt large pre-trained models to specific tasks with minimal adjustments to the model’s parameters. The key ideas include:

 1. **Efficiency**: Only a small subset of the model’s parameters are fine-tuned, reducing computational cost and memory usage.
 2. **Methods**: Common methods include adapters, low-rank adaptation (LoRA), and prefix-tuning.
 3. **Advantages**: PEFT methods maintain the benefits of large pre-trained models while being resource-efficient, making them suitable for applications with limited computational resources.

Overall, PEFT allows effective task adaptation without the need for extensive re-training of the entire model.

Let's create peft model from previously loaded model by using `get_peft_model` from `unsloth` library. Here is quick info about the parameters:

 - **model**: The pre-trained language model that you want to fine-tune.
 - **r**: The rank of the low-rank adaptation matrices. Suggested values are 8, 16, 32, 64, or 128. This controls the dimension of the adaptation matrices, affecting the trade-off between efficiency and performance.
 - **target_modules**: A list of module names within the model where the low-rank adaptations will be applied. These typically include projection layers like q_proj, k_proj, v_proj, and others.
 - **lora_alpha**: A scaling factor for the low-rank adaptation matrices. Higher values can lead to stronger adaptations but might require more fine-tuning.
 - **lora_dropout**: Dropout rate for the LoRA layers. Setting it to 0 is optimized, meaning no dropout will be applied, which is typically more efficient.
 - **bias**: Defines how biases are handled in the adaptation process. The “none” setting is optimized and means that biases are not adapted, reducing computational complexity.
 - **use_gradient_checkpointing**: When set to “unsloth”, it uses an optimized gradient checkpointing technique that reduces VRAM usage by 30% and allows for larger batch sizes.
 - **random_state**: Seed for random number generation to ensure reproducibility of the fine-tuning process.
 - **use_rslora**: Boolean indicating whether to use Rank Stabilized LoRA, a technique that helps maintain the stability of the low-rank adaptations during training.
 - **loftq_config**: Configuration for LoftQ (Low-Frequency Quantization), if applicable. Setting it to None means this feature is not used.


In [3]:
#Task: you need to pass the model to create a peft version of the model
model = FastLanguageModel.get_peft_model(
    model, # UPDATE HERE
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Dataset
Now, we need to prepare the dataset for our model. Each model has its own chat style. Therefor we need to convert the prompt/response pairs to the texts that model expect!

In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", 
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }


from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")

# Task: New update the records by using `dataset.map` and the function above
dataset = dataset.map(formatting_prompts_func, batched = True,) # Update here

Downloading readme:   0%|          | 0.00/442 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9033 [00:00<?, ? examples/s]

Map:   0%|          | 0/9033 [00:00<?, ? examples/s]

In [5]:
dataset[5]["conversations"]

[{'from': 'human',
  'value': 'What is the typical wattage of bulb in a lightbox?'},
 {'from': 'gpt',
  'value': 'The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.'},
 {'from': 'human',
  'value': 'Rewrite your description of the typical wattage of a bulb in a lightbox to only include the key points in a list format.'}]

In [6]:
print(dataset[5]["text"])

<s><|user|>
What is the typical wattage of bulb in a lightbox?<|end|>
<|assistant|>
The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.<|end|>
<|user|>
Rewrite your description of the typical wattage of a bulb in a lightbox to only include the key points in a list format.<|end|>



### SFT traineing

Now we have everything to fine-tune the model. We use the RTL library from Huggingface. Let’s connect all the pieces together.

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# TASK 3: create a trainer by using the model, tokenizer and dataset that we have just created earlier
trainer = SFTTrainer(
    model = model, # UPDATE HERE
    tokenizer = tokenizer, # UPDATE HERE
    train_dataset = dataset, # UPDATE HERE
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/9033 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,033 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 29,884,416


Step,Training Loss
1,2.0394
2,1.1281
3,1.4812
4,1.4454
5,1.3893
6,1.23
7,0.9151
8,1.616
9,1.5503
10,1.2701


### Inference

Let's chat with our fine tuned model. Feel free to ask anything!

In [9]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<s><|user|> Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|end|><|assistant|> 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, ']

Feel free to ask your question.

In [10]:
FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "what is the x in this equation: 3x-3 = 6"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<s><|user|> what is the x in this equation: 3x-3 = 6<|end|><|assistant|> To solve for x in the equation 3x - 3 = 6, you can follow these steps:

1. Add 3 to both sides of the equation to isolate the term with x:
  3x - 3 + 3 = 6 + 3
  3x = 9

2. Divide both sides of the equation by 3 to solve for x:
  3x / 3 = 9 / 3
  x = 3

So the value of x in the equation 3x - 3 = 6 is 3.


### Save model

We need to save the model for further steps in RLHF.

In [11]:
model.save_pretrained_merged("phi_3_sft_model", tokenizer, save_method = "merged_16bit",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 16.25 out of 30.99 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 63.61it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
