<a href="https://colab.research.google.com/github/mshojaei77/Awesome-Fine-tuning/blob/main/gemma_2_9b_qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip3 install -q -U bitsandbytes peft trl accelerate datasets transformers ipywidgets


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model_id = "google/gemma-2-9b"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

In [None]:
from datasets import load_dataset
import pandas as pf

dataset = load_dataset("mshojaei77/merged_persian_alpaca", split = "train")

df = dataset.to_pandas()
df.head(10)

# Dataset Formatting
Now, let’s format the dataset according to the specified Gemma instruction format.

In [None]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'

    text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model {data_point["output"]} <end_of_turn>"""
    return text

In [None]:
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

In [None]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]
dataset

Here, we are enabling gradient checkpointing and preparing the model for low-bit (k-bit) training, which helps reduce memory usage and improve efficiency during training. This setup uses PEFT (Parameter-Efficient Fine-Tuning) techniques to optimize the model for resource-constrained environments.


In [None]:
from peft import LoraConfig, peft_model, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
from peft import LoraConfig, get_peft_model, PeftModel

modules = ['up_proj', 'q_proj', 'down_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj']

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.06,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"trainable: {trainable} | total {total} | Percentage: {trainable/total*100:.4f}%")

# Fine-Tuning with qLora Using SFTTrainer from trl Library
We configure and instantiate an SFTTrainer for fine-tuning a model using our training and evaluation datasets. This setup includes training arguments like batch size, gradient accumulation, learning rate, and optimization strategy, along with a data collator for language modeling, leveraging the previously defined LoRA configuration for parameter-efficient fine-tuning.

In [None]:
import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        # warmup_steps=0.03,
        max_steps=130,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

In [None]:
model.config.use_cache = False
trainer.train()

In [None]:
 #Name of the model
new_model = "gemma2-9b-finetuned"

# Save the fine-tuned model
trainer.model.save_pretrained(new_model)

# Merging Fine-Tuned Model
We can load the base model and merge it with fine-tuned LoRA weights to create a combined model, optimizing for lower CPU memory usage and using 16-bit floating point precision. The merged model and tokenizer are then saved to a specified directory, with the tokenizer’s padding token set to the end-of-sequence token and padding side adjusted to the right.

In [None]:
# Merge the model with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Now, let's give this model a try!


In [None]:
def get_completion(query: str, model, tokenizer) -> str:
    device = "cuda:0"

    prompt_template = """
    <start_of_turn>user
    Below is an instruction that describes a task. Write a response that appropriatly completes the request
    {query}
    <end_of_turn>\n<start_of_turn>model

    """

    prompt = prompt_template.format(query=query)

    encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    model_inputs = encodeds.to(dsevice)

    generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
    # decoded = tokenizer.batch_decod(generate_ids)
    decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    return decoded

In [None]:
result = get_completion(query="Create a function to calculate the sum of a sequence of integers.", model=merged_model, tokenizer=tokenizer)
print(result)

In [None]:
merged_model.push_to_hub("mshojaei77/persian_gemma-2-9b-4bit")
tokenizer.push_to_hub("mshojaei77/persian_gemma-2-9b-4bit")
merged_model.push_to_hub_merged("mshojaei77/persian_gemma-2-9b-4bit", tokenizer, save_method = "merged_16bit")