## Finetune Gemma-2B To A General Purpose Chatbot using 🤗 peft, trl, bitsandbytes & transformers

This notebook runs on top of the image built using this Dockerfile:
[GitHub Link](https://github.com/huggingface/Google-Cloud-Containers/blob/main/containers/pytorch/training/gpu/2.1/transformers/4.38.0.dev0/py310/Dockerfile)

Using this image you don't need to install any packages, as all needed packages are already there.

### Prerequisites

1. As the model weights are still in a private organization on HuggingFace Hub, you need to authenticate yourself in order to download model weights. You can use this from CLI:
    ```bash
    huggingface-cli login
    ```
    There are other ways too which can be found [here](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication)


### Import libraries and specify model to use 

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GemmaTokenizer, DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# We use the 2b model for demonstration
model_id = "gg-hf/gemma-2b"

### Load the dataset for training

We use the [Guanaco dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), a refined part of the OpenAssistant dataset designed specifically to train versatile chatbots. The dataset contains various questions that require generative outputs.

The data is like a question along with its answer. Further, its multi-lingual, i.e., we have questions in English and in Spanish. The dataset contains about 9.85K training instances along with 518 test instances.

In [3]:
# Import the necessary library for loading datasets
from datasets import load_dataset

# Specify the name of the dataset
dataset_name = "timdettmers/openassistant-guanaco"

# Load the dataset from the specified name and select the "train" split
dataset = load_dataset(dataset_name)



### Load Quantized model using bitsandbytes

In [4]:
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True, #  quantize the model to 4-bits when you load it
    bnb_4bit_quant_type="nf4", #use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_use_double_quant=True, #use a nested quantization scheme to quantize the already quantized weights
    bnb_4bit_compute_dtype=torch.float16, #for faster computation
)

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)

Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.26s/it]


In [None]:
print(model)

In [6]:
## Load the tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer = GemmaTokenizer.from_pretrained(model_id)

### Initlializer configuration file to construct the LoRA model

In [9]:
from peft import LoraConfig

peft_config = LoraConfig(
    task_type="Causal_LM", 
    target_modules=["q_proj", "k_proj", "o_proj", "v_proj"], # We get the value from the Module List when we printed the model object
    inference_mode=False, 
    r=16, 
    lora_alpha=32, 
    lora_dropout=0.1
)

### Train with SFTTrainer

It is provided by the [TRL](https://huggingface.co/docs/trl/index) library, which offers a convenient interface around the Transformers Trainer and enables straightforward supervised fine-tuning of models on instruction-based datasets using PEFT adapters.

In [16]:
from trl import SFTTrainer

In [14]:
#### Define arguments for training and then 
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=4, #  batch size per device during training
    gradient_accumulation_steps = 4, # Number of steps to accumulate gradients before updating the model
    max_grad_norm = 0.3, # Maximum gradient norm for gradient clipping
    warmup_ratio = 0.03, # # Warmup ratio for learning rate scheduling
    lr_scheduler_type = "constant", # Type of learning rate scheduler 
    max_steps = 100, # Maximum number of training steps
    logging_steps=10, # Interval to log training metrics
    group_by_length=True,
    eval_steps=20, # Evaluate every n steps during training
    evaluation_strategy="steps",
    fp16=True
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    peft_config=peft_config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text", # Column/field that contains the text in the dataset
    max_seq_length=512, # Set the maximum sequence length
    tokenizer=tokenizer,

)


Map: 100%|██████████| 9846/9846 [00:02<00:00, 3615.16 examples/s]
Map: 100%|██████████| 518/518 [00:00<00:00, 5209.89 examples/s]


In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,1.6771,No log
1,1.8198,No log


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1000, training_loss=1.8983008775711059, metrics={'train_runtime': 905.0472, 'train_samples_per_second': 17.679, 'train_steps_per_second': 1.105, 'total_flos': 5.615236195757261e+16, 'train_loss': 1.8983008775711059, 'epoch': 1.62})