# QLoRA with Llama 3.1 8B Instruct on Colab.

This tutorial shows how to use quantized low-rank adaptation (LoRA) in order to fine-tune a Large Language model so that the model generates stories having a particular style (for example, appropriate for a target age group). 

- We use [Unsloth](https://github.com/unslothai/unsloth/wiki) because it can save the trained models in the GGUF file format, which is necessary to use the models with [Llama.cpp](https://github.com/ggerganov/llama.cpp) using CPUs. In this way, we can use the trained models on our local computers without GPUs.

- For more information, see [Hugging Face transformers](https://huggingface.co/docs/transformers/index).  

### Install Hagging Face's datasets module and Unsloth

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install datasets

### Import the packages used throughout

In [None]:
import torch
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer

### Load a Large Language Model

[Llama 3.1](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1) is an auto-regressive language model that uses an optimized transformer architecture. The instruction-tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.  

Here, we will use a [quantized version](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit) of Meta Llama 3.1 8B Instruct because this model is about 5.7GB and therefore fits in most of our laptops' RAM. Another important point is that, saving the trained model in GGUF format requires an intermediate step of creating de-quantized model in F16 precision. The size of this de-quantized Llama 3.1 8B model is about 15GB, which is the largest size that fits in Google Colab's Tesla T4 GPU.  

In [None]:
context_length = 4096 #context length of 2048 roughly equals 1500 words.

# Load the model from Hugging Face's repository.
# More info: https://github.com/unslothai/unsloth/wiki
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = context_length,
    load_in_4bit = True,
    dtype = None,
)

### Define the prompt used for training and testing  

This is the chat format for Llama 3.1 when the "ipython" (tools) is not used.  
Info: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023  
Today Date: 12 Aug 2024

{}<|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}


In [6]:
llama_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

{}<|eot_id|><|start_header_id|>user<|end_header_id|>

{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{}"""

#llama_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
#{}<|eot_id|><|start_header_id|>user<|end_header_id|>
#
#{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
#
#{}"""

We will use this prompt for our LoRA training as well as for testing.

In [7]:
input_prompt = llama_prompt.format(
                  'You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.', # instruction
                  'Write a story for one-year-old children, two-year-old children, or three-year-old children.', # input
                  '' # output - leave this blank for generation!
                  )

In [8]:
print(input_prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Inference without LoRA  
Let's first generate some stories to see what the model does without our LoRA fine-tuning.

In [None]:
# Generate a story using TextStreamer. 
# More info https://huggingface.co/docs/transformers/en/generation_strategies

FastLanguageModel.for_inference(model)
inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4096)

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

story title: Benny's Little Adventure

In a sunny little village, there lived a little rabbit named 

Now, let's make another identical run to see how much the story changes. In this way, we can delineate the variations at different inferences and the effect of LoRA finetuning.    

In [10]:
FastLanguageModel.for_inference(model)
inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4096)

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 12 Aug 2024

You are a creative assistant who writes stories for children. Your goal is to write a story having surprising twists and a happy ending. At each of your chat turns, write either a chapter whose length is less than 1000 words or an entire story whose length is less than 1000 words. In addition, generate a title of the chapter or story. Furthermore, at the beginning of your chat turn, please indicate whether your writing is meant to be a story chapter or an entire story by saying either "chapter title:" or "story title:" in front of the title you produce.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a story for one-year-old children, two-year-old children, or three-year-old children.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

story title: Benny's Little Friends

Once upon a time, in a bright and sunny garden, there lived a l

Although the generated stories are somewhat different, you can see that the style of the story is more or less the same. 

Although we did not use them in this tutorial, there are many useful paramters to control the output:    

NoBadWordsLogitsProcessor  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.NoBadWordsLogitsProcessor

Stopping criteria  
https://huggingface.co/docs/transformers/v4.44.0/en/internal/generation_utils#transformers.StoppingCriteria

transformers.GenerationConfig  
https://huggingface.co/docs/transformers/v4.44.0/en/main_classes/text_generation#transformers.GenerationConfig



### Prepare the dataset  
- Prepare your own dataset following the syntax provided in gen_age_1_3.py as a guideline. This example is provided for educational purpose only. Do not use it beyond this purpose. Instead, make sure to replace it with your own dataset. 

- As an example, some typical stories for one- to three-year-old children are put in gen_age_1_3.py. These stories share similar writing styles (e.g., lots of repetition with slight modification each time, limited vocabulary), which are absent in the stories generated without LoRA. Therefore, when the training is successful, you should see the corresponding change in the writing style!

- The data has to be provided to the model with the [right chat format](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1). Also, see [here](https://docs.unsloth.ai/basics/chat-templates) for more info about chat templates.

In [None]:
from datasets import Dataset

# Create a dataset defined in gen_age_1_3.py
# Info: https://huggingface.co/docs/datasets/v2.20.0/en/create_dataset
# This dataset has three features: system, user, assistant.
from gen_age_1_3 import gen_age_1_3
dataset = Dataset.from_generator(gen_age_1_3)

# For the QLoRA training, create a chain of Llama chat conversations from the dataset.
# Take the data from dataset and put it to llama_prompt.
# This function adds another feature called 'text' to the dataset.
EOS_TOKEN = tokenizer.eos_token # End of sentense token
def formatting_prompts_func(examples):
    instructions = examples["system"]
    inputs       = examples["user"]
    outputs      = examples["assistant"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = llama_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True,)

In [12]:
print(dataset['text'])



## LoRA Training

### Set the training parameters
Below are the basic parameters of our LoRA finetuning. The first three are for the LoRA configuration and strongly affect the quality of the finetuning. The rest is for the speed and solution convergence of the training. Try out different combinations to get the best result.
The meaning of these parameters are defined [here](https://github.com/unslothai/unsloth/wiki).  
You can also find more information at:  
[Hugging Face Supervised Fine-Tuning Trainer (SFTTrainer)](https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#trl.SFTTrainer)  
[Hugging Face Training Arguments](https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/trainer#transformers.TrainingArguments)  

When you over-train, the stories generated with LoRA are copies of the stories in the train dataset. On the other hand, when you under-train, the stories generated with LoRA are similar to the stories generated without LoRA.  

When I tested this notebook, the following parameter values gave a good result. When it did so, the training loss reduced from 2.5 to 0.8. However, this result may change run-to-run. So, try out and check your own result.

In [13]:
# LoRA parameters
lora_r = 2
lora_alpha = 4
lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"] #["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"]

# Training parameters
num_train_epochs = 20
learning_rate = 8e-4
per_device_train_batch_size = 1
gradient_accumulation_steps = 7
warmup_steps = 5
logging_steps = 1

It is really interestig to see how different values of LoRA rank (lora_r) or target modules (lora_target_modules) affect the outcome. So, make sure to try out different values. 

#### Define LoRA configuration  
info:  
[peft.LoraConfig](https://huggingface.co/docs/peft/v0.12.0/en/package_reference/lora#peft.LoraConfig)  

[peft.get_peft_model](https://huggingface.co/docs/peft/v0.12.0/en/package_reference/peft_model#peft.get_peft_model)

In [None]:
merged_model = FastLanguageModel.get_peft_model(
    model,
    r = lora_r,
    lora_alpha = lora_alpha,
    target_modules = lora_target_modules,
    use_rslora=True, # rank stabilized LoRA
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    loftq_config = None,
)

#### Train LoRA  
Info:  
https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#trl.SFTTrainer
https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/trainer#transformers.TrainingArguments

In [15]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = merged_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = context_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        num_train_epochs = num_train_epochs,
        per_device_train_batch_size = per_device_train_batch_size,
        gradient_accumulation_steps = gradient_accumulation_steps,
        warmup_steps = warmup_steps,
        learning_rate = learning_rate,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = logging_steps,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

Map (num_proc=2):   0%|          | 0/7 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7 | Num Epochs = 20
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 7
\        /    Total batch size = 7 | Total steps = 20
 "-____-"     Number of trainable parameters = 1,703,936


Step,Training Loss
1,2.5013
2,2.5013
3,2.4434
4,2.3099
5,2.1549
6,1.9877
7,1.8495
8,1.7258
9,1.5949
10,1.4529


### Inference with LoRA
Now, generate a story using the trained model. We will use the same prompt as before for comparison.

In [16]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Rabbit Who Wants to Fall Asleep

In the forest, in a burrow, a little rabbit lived with his mother. One evening, the little rabbit felt tired. He wanted to go to sleep. But he didn't know how to fall asleep. He thought, "If I were a bird, I would fly. If I were a fish, I would swim. If I were a deer, I would run. If I were a squirrel, I would climb a tree." He thought of all the things he could do if he were different animals. But he was a rabbit. He couldn't fly, he couldn't swim, he couldn't run, he couldn't climb a tree. He couldn't even close his eyes. He thought, "If I were a rabbit who could close his eyes, I would close my eyes. I would go to sleep." He closed his eyes. He fell asleep.<|eot_id|>


Second inference:

In [17]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: I Love You to the Moon and Back

I love you to the moon and back. I love you to the sky and back. I love you to the highest high and back again. I love you to the deepest deep and back again. I love you to the beginning of time and back again. I love you to the end of time and back again. I love you to the highest mountain and back again. I love you to the lowest valley and back again. I love you to the highest ocean and back again. I love you to the driest desert and back again. I love you to the most wonderful place that I know and back again. I love you to the moon and back.<|eot_id|>


Third inference:

In [18]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Little Rabbit and the Moon

In the forest, there lived a little rabbit who loved to explore. One day, the little rabbit decided to go on a journey to the moon. She packed a small bag with some carrots and a bottle of water and set off.

As she hopped through the forest, she met a wise old owl who said, "Where are you going, little rabbit?" "I'm going to the moon," said the little rabbit. "I've never been there before, and I want to see what it's like." "That's a long way to go," said the owl. "But if you're sure you want to go, I'll give you a map to help you find your way." The little rabbit thanked the owl and took the map.

She hopped and hopped until she came to a river. She looked at the map and saw that she had to cross the river to get to the moon. She found a small boat and paddled across. On the other side, she met a friendly fish who said, "Where are you going, little rabbit?" "I'm going to the moon," said the little rabbit. "I've never been there before, and

Fourth inference:

In [19]:
story_char_number = 0
while story_char_number < 500:
  FastLanguageModel.for_inference(merged_model)
  inputs = tokenizer([input_prompt], return_tensors = "pt").to("cuda")

  outputs_tensor = merged_model.generate(**inputs, max_new_tokens = 2048, use_cache = True) # Output tensor
  output_text = tokenizer.batch_decode(outputs_tensor) # Output list of length 1.
  output_text = output_text[0] # Convert from list to str
  output_text = output_text.replace(input_prompt, '') # Get rid of the input prompt
  output_text = output_text.replace('<|begin_of_text|>', '') # Get rid of '<|begin_of_text|>'
  story_char_number = len(output_text) # Count the number of the charactors in the generated story

print(output_text)

story title: The Rabbit and the Lovely Moon

In the forest, there lived a little rabbit. He was a curious rabbit. One night, he saw the moon shining brightly in the sky. He wanted to catch the moon. So he set out to catch it. He ran and ran, but the moon was always just out of reach. He climbed up a tree, but the moon was higher. He jumped over a stream, but the moon was farther away. He even dug a tunnel, but the moon was beyond it. At last, he realized that the moon was not a thing to be caught. It was just a beautiful sight to see. The little rabbit returned home, happy to have seen the lovely moon.<|eot_id|>


You can clearly see that the writing style has been drastically adjusted according to the stories in the dataset! For example, now there are lots of repetition with slight modification each time and the vocabulary is appropriate for the target age group.

### Save the trained model

In [None]:
# This saves only the LoRA adapter in a directory named "lora_model".
merged_model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

# This saves the LoRA-and-base-model mmerged and q4_k_m-quantized model in GGUF file in a directory named "model".
merged_model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
#merged_model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")