### LLM Fine-Tuning with Unsloth
#### [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) 
- https://github.com/unslothai/unsloth?tab=readme-ov-file#conda-installation 
- pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
- pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
- pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

- training code references
    - [refernce unsloth](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing#scrollTo=kR3gIAX-SM2q)
    - [refernce HF FA2](https://colab.research.google.com/drive/1fgTOxpMbVjloQBvZyz4lF4BacKSZOB2A?usp=sharing#scrollTo=-nX3SL7cI2fZ)

In [1]:
import os 
## if want to use a specific card
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [2]:
from unsloth import FastLanguageModel
import torch,os
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:

use_lora = True
model_cache_dir = '/root/data/hf_cache/llama-3-8B-Instruct'
model_output_dir = '/root/data/models/llama3/8b_checkpoints'
final_model_out_dir = '/root/data/models/llama3/llama_8b_current'
max_seq_length = 2048 # Choose any! auto RoPE Scaling internally!
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
use_gradient_checkpointing = True
random_state = 3407

model_name = model_cache_dir

#### load model 

In [4]:


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_cache_dir,     # you can load 4 bit model "unsloth/llama-3-8b-bnb-4bit", other supported models here : https://huggingface.co/unsloth
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
    
### to train with half precision, set tokenizer.padding_side ='right'
# tokenizer.add_special_tokens({"pad_token": "<|PAD|>"})
# model.config.pad_token_id = tokenizer.pad_token_id # updating model config
# tokenizer.padding_side = 'right'

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA A100 80GB PCIe. Max memory: 79.138 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/root/data/hf_cache/llama-3-8B-Instruct does not have a padding token! Will use pad_token = <|reserved_special_token_250|>.


#### Train with lora adaptor
-  [Lora targets explained](https://github.com/unslothai/unsloth/wiki#target-modules)

In [5]:
if use_lora:

    model = FastLanguageModel.get_peft_model(
        model,
        r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 ; rank parameter, default to 32 or 64; the larger, it is more precise to original weights ; lower rank, more compression ; 
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,  # scaling factor to scale added weights ; lower gives more % to original weights; depends on the implementation, actual scaling is often alpha/rank 
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        use_gradient_checkpointing = "unsloth", #"unsloth" uses 30% less VRAM, fits 2x larger batch sizes! # True or "unsloth" for very long context
        random_state = 3407,
        use_rslora = False,  # support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#### Data Prep
We now use the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.



In [6]:
print(tokenizer.chat_template)

# Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

print('After transformation')
print(tokenizer.chat_template)

{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
After transformation
{{ bos_token }}{% for message in messages %}{% if message['from'] == 'human' %}{{ '<|start_header_id|>user<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% elif message['from'] == 'gpt' %}{{ '<|start_header_id|>assistant<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% else %}{{ '<|start_header_id|>' + message['from'] + '<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}


#### More info on [chat_template](https://github.com/unslothai/unsloth/wiki#chat-templates)

In [7]:
## define data transformation function to format correct prompt 
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,) ## if batched = True, process function process a batch of data
print(dataset[0]['text'])

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Escribe un discurso que pueda recitar como padrino de la boda de mi mejor amigo.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Queridos invitados, amigos y familiares,

Me siento muy honrado de estar aquí hoy como padrino de bodas de mi mejor amigo [Nombre del novio].

Recuerdo con cariño los días en los que [Nombre del novio] y yo nos conocimos, cuando éramos solo dos jóvenes llenos de sueños y esperanza. Ahora, aquí estamos, celebrando su boda con la persona que ama, [Nombre de la novia].

[Nombre de la novia], te aseguro que [Nombre del novio] es una persona increíble, llena de amor y lealtad. Juntos, han formado un equipo invencible y estoy seguro de que su amor perdurará por siempre.

[Nombre del novio], mi amigo, te deseo todo lo mejor en esta nueva etapa de tu vida. Espero que tu matrimonio esté lleno de amor, alegría y felicidad, y que [Nombre de la novia] sea siempre tu compañera de vida y tu mejor amiga.

A 

### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). Walso support TRL's `DPOTrainer`!

In [8]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 2,
        warmup_steps = 20,
        num_train_epochs=1, # often 3
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(), 
        bf16 = is_bfloat16_supported(),# without lora, for some reason it buggs "Invalid device string: 'bfloat16'"
        logging_steps = 40,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        save_steps=100,
        save_total_limit=1,
        output_dir = model_output_dir,
    ),
)

In [9]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100 80GB PCIe. Max memory = 79.138 GB.
15.404 GB of memory reserved.


In [10]:
trainer_stats = trainer.train()
## there is known bug then not using loar, and model saving https://github.com/unslothai/unsloth/issues/404

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9,033 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 2
\        /    Total batch size = 32 | Total steps = 282
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
40,1.4417
80,1.2517
120,1.2628
160,1.2268
200,1.25
240,1.2324
280,1.202




In [11]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2477.4104 seconds used for training.
41.29 minutes used for training.
Peak reserved memory = 45.188 GB.
Peak reserved memory for training = 29.784 GB.
Peak reserved memory % of max memory = 57.1 %.
Peak reserved memory for training % of max memory = 37.636 %.


#### Save model 

In [12]:
model_save_folder="/root/data/models/llama3"
## save the lora adaptor only
model.save_pretrained_merged(os.path.join(model_save_folder,"llama3_8b_lora_model"),tokenizer,save_method="lora" )
## saved merged model for inference
model.save_pretrained_merged(os.path.join(model_save_folder,"llama3_8b_lora_merged_model") , tokenizer, save_method = "merged_16bit",)


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model...



 Done.
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 290.47 out of 866.1 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 67.08it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


#### Simple Inference Test

In [13]:
infer_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = os.path.join(model_save_folder,"llama3_8b_lora_merged_model"), # YOUR MODEL YOU USED FOR TRAINING
    #model_name = os.path.join(model_save_folder,"llama3_8b_lora_model"),
    max_seq_length = max_seq_length,
    dtype = dtype,
    #load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(infer_model) # Enable native 2x faster inference

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA A100 80GB PCIe. Max memory: 79.138 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


- verify chat template style

In [14]:
print(tokenizer.chat_template)

{{ bos_token }}{% for message in messages %}{% if message['from'] == 'human' %}{{ '<|start_header_id|>user<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% elif message['from'] == 'gpt' %}{{ '<|start_header_id|>assistant<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% else %}{{ '<|start_header_id|>' + message['from'] + '<|end_header_id|>

' + message['value'] | trim + '<|eot_id|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}


In [15]:
messages = [
    {"from": "human", "value": "What is your name and why?"},
]

# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
#     mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
# )
 
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = infer_model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
print(tokenizer.batch_decode(outputs))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is your name and why?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy name is LLaMA, and I was named after the Large Language Model Architecture that I'm based on. My purpose is to assist and provide helpful information to users, and I'm trained on a massive dataset of text to make sure I'm knowledgeable and accurate.<|eot_id|>"]
