# Instructions to Run This Colab
To run this Colab notebook, click on "**Runtime**" in the menu and then select "**Run all**". Make sure to use a **Tesla T4 Google Colab instance** for efficient execution.

---

### **Introduction**
In this notebook, we will fine-tune **Llama-3.1 (8B)** for a **chat application** using **Unsloth** and **LoRA** adapters. This process will demonstrate how to use lightweight fine-tuning to adapt an existing language model for conversational use, leveraging pre-made chat templates.

### **Install Dependencies**
First, we need to install **Unsloth** along with the necessary dependencies:

```python
%%capture
# Install Unsloth and get the latest nightly version.
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [None]:
# Install Unsloth and get the latest nightly version.
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Adjust as needed based on task requirements.
dtype = None           # Automatically detected. Set to 'float16' for Tesla T4.
load_in_4bit = True    # Use 4-bit quantization to reduce memory usage.

# Load the pre-trained model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Suggested values: 8, 16, 32
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Helps reduce memory usage for long sequences.
    random_state=3407
)


Unsloth 2024.9.post4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from datasets import load_dataset

dataset = load_dataset("daily_dialog")

# Define a conversational template
chat_prompt_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

>>> User: {}
>>> Assistant: """

# Add EOS_TOKEN to prevent infinite generation
EOS_TOKEN = tokenizer.eos_token

# Preprocess the dataset
def formatting_prompts_func(examples):
    inputs = examples["dialog"]

    texts = [chat_prompt_template.format(dialogue) + EOS_TOKEN for dialogue in inputs]
    return {"text": texts}

tokenized_dataset = dataset.map(formatting_prompts_func, batched=True)


In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# Load a small subset of the training dataset (e.g., 100 samples)
train_dataset = load_dataset("yelp_review_full", split="train[:100]")  # Use the first 100 samples

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-4,  # Increase learning rate for faster convergence
    per_device_train_batch_size=1,  # Very small batch size to avoid out-of-memory errors
    gradient_accumulation_steps=1,  # No accumulation, immediate update
    num_train_epochs=1,  # Just one epoch to quickly see results
    logging_steps=100,  # Further reduce logging to prevent overhead
    optim="adamw_8bit",
    weight_decay=0.0,  # Remove weight decay to save computation
    fp16=True,  # Use FP16 for faster computation and reduced memory usage
    evaluation_strategy="no",  # Disable evaluation to save time
    max_steps=10,  # Limit training to 10 steps
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    max_seq_length=128,  # Further reduce sequence length for faster processing
    dataset_text_field="text",
    args=training_args,
)

# Start training
trainer.train()




Map:   0%|          | 0/100 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 10
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss


TrainOutput(global_step=10, training_loss=3.1276052474975584, metrics={'train_runtime': 14.0647, 'train_samples_per_second': 0.711, 'train_steps_per_second': 0.711, 'total_flos': 28663003570176.0, 'train_loss': 3.1276052474975584, 'epoch': 0.1})

In [None]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")


('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        chat_prompt_template.format(
            "What are the benefits of using LoRA for large language models?"
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

>>> User: What are the benefits of using LoRA for large language models?
>>> Assistant: 1. Faster training time for large language models. 2. No need for a huge amount of GPU memory. 3. No need for a huge amount of GPU memory. 4. Can train on multiple GPUs. 5. Can train on multiple GPUs. 6. Can train on multiple GPUs. 
