<!-- Banner Image -->
<center>
    <img src="https://developer-blogs.nvidia.com/wp-content/uploads/2024/07/rag-representation.jpg" width="75%">
</center>

<!-- Links -->
<center>
  <a href="https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/" style="color: #76B900;">NVIDIA AI Workbench</a> •
  <a href="https://docs.nvidia.com/ai-workbench/" style="color: #76B900;">User Documentation</a> •
  <a href="https://docs.nvidia.com/ai-workbench/user-guide/latest/quickstart/example-projects.html" style="color: #76B900;">Example Projects Catalog</a> •
  <a href="https://forums.developer.nvidia.com/t/support-workbench-example-project-phi-3-finetune/303412" style="color: #76B900;"> Problem? Submit a ticket here! </a>
</center>

# Fine-tuning Microsoft's Phi-3 Mini using QLoRA

Welcome!

In this notebook and tutorial, we will fine-tune [Microsoft's Phi-3 Mini (128k)](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) relatively small 3.8B model - "whose overall performance...rivals that of models such as Mixtral 8x7B and GPT-3.5, despite being small enough to be deployed on a phone."

This tutorial will use QLoRA, a fine-tuning method that combines quantization and LoRA. With 8-bit quantization, the model weights only amount to about 4GB of GPU memory, so no A100 needed! For more information about what these concepts are and how they work, check out this excellent [post](https://brev.dev/blog/how-qlora-works) from the Brev.dev team.

In this notebook, we will load the model in 8bit using ``bitsandbytes`` and use LoRA to train using the PEFT library from Hugging Face 🤗.

Before starting this project, you should have already configured you project environment with a ``HF_TOKEN`` project secret as your access token to use these Hugging Face models.

### Help us make this tutorial better! Please provide feedback on the [NVIDIA Developer Forum](https://forums.developer.nvidia.com/c/ai-data-science/nvidia-ai-workbench/671).

**(Optional)** We can use Weights & Biases to track our training metrics. You'll need to apply an API key when prompted.

In [1]:
# import wandb, os
# wandb.login()

# wandb_project = "viggo-finetune"
# if len(wandb_project) > 0:
#     os.environ["WANDB_PROJECT"] = wandb_project

### 1. Load Dataset

Let's load a meaning representation dataset and fine-tune Phi-3 Mini on that. This is a great fine-tuning dataset as it teaches the model a unique form of desired output on which the base model performs poorly out-of-the box, so it's helpful to easily and inexpensively gauge whether the fine-tuned model has learned well. (Sources: [here](https://ragntune.com/blog/gpt3.5-vs-llama2-finetuning) and [here](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts))

In contrast, if you fine-tune on a fact-based dataset, the model may already do quite well on that, and gauging learning is less obvious and may be more computationally expensive.

In [3]:
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
!pip install datasets

from datasets import load_dataset

data_dir = "/project/data"

train_dataset = load_dataset('gem/viggo', split='train', cache_dir=data_dir, trust_remote_code=True)
eval_dataset = load_dataset('gem/viggo', split='validation', cache_dir=data_dir, trust_remote_code=True)
test_dataset = load_dataset('gem/viggo', split='test', cache_dir=data_dir, trust_remote_code=True)

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.0-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

viggo.py:   0%|          | 0.00/3.14k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/187k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/269k [00:00<?, ?B/s]

challenge_train_1_percent.csv:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

challenge_train_2_percent.csv:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

challenge_train_5_percent.csv:   0%|          | 0.00/64.3k [00:00<?, ?B/s]

challenge_train_10_percent.csv:   0%|          | 0.00/123k [00:00<?, ?B/s]

challenge_train_20_percent.csv:   0%|          | 0.00/253k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating challenge_train_1_percent split: 0 examples [00:00, ? examples/s]

Generating challenge_train_2_percent split: 0 examples [00:00, ? examples/s]

Generating challenge_train_5_percent split: 0 examples [00:00, ? examples/s]

Generating challenge_train_10_percent split: 0 examples [00:00, ? examples/s]

Generating challenge_train_20_percent split: 0 examples [00:00, ? examples/s]

In [5]:
print(train_dataset)
print(eval_dataset)
print(test_dataset)

Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 5103
})
Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 714
})
Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 1083
})


### 2. Load Base Model

Next, let's load in Microsoft's Phi-3 Mini model from Hugging Face utilizing the 8-bit quantization configuration.

In [6]:
import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
import os
os.environ["HF_TOKEN"] = "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" # Replace with your token

base_model_id = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                             token=os.environ["HF_TOKEN"],
                                             cache_dir="/project/models",
                                             load_in_8bit=True,
                                             torch_dtype=torch.float16,
                                             trust_remote_code=True)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

KeyError: 'HF_TOKEN'

### 3. Tokenization

Set up the tokenizer.

To set `max_length`, which has a direct impact on your compute requirements, it's helpful to get a distribution of your data lengths. Hugging Face shares that data clearly, like so:

![image.png](attachment:77593312-b2b3-4238-891b-417930e2e9b9.png)

However, since we're combining multiple features of this dataset in `generate_and_tokenize_prompt`, let's get our own distribution of the final form of the data. Let's first tokenize without the truncation/padding, so we can get that length distribution.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=os.environ["HF_TOKEN"],
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False, # needed for now, should be fixed soon
)

Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning is](https://neptune.ai/blog/self-supervised-learning):

In [None]:
def tokenize(prompt):
    result = tokenizer(prompt)
    result["labels"] = result["input_ids"].copy()
    return result

And convert each sample into a prompt inspired from [this notebook](https://github.com/samlhuillier/viggo-finetune/blob/main/llama/fine-tune-code-llama.ipynb).

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
{data_point["target"]}

### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)

Reformat the prompt and tokenize each sample:

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

You can also untokenize to make sure it was formatted properly.

In [None]:
untokenized_text = tokenizer.decode(tokenized_train_dataset[1]['input_ids'])
print(untokenized_text)

Let's get a distribution of our dataset lengths, so we can determine the appropriate `max_length` for our input tensors.

In [None]:
import matplotlib.pyplot as plt

def plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset):
    lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
    lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
    print(len(lengths))

    # Plotting the histogram
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=20, alpha=0.7, color='blue')
    plt.xlabel('Length of input_ids')
    plt.ylabel('Frequency')
    plt.title('Distribution of Lengths of input_ids')
    plt.show()

plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

From here, you can choose where you'd like to set the `max_length` to be. You can truncate and pad training examples to fit them to your chosen size. Be aware that choosing a larger `max_length` has its compute tradeoffs.

Now let's tokenize again with padding and truncation, and set up the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning is](https://neptune.ai/blog/self-supervised-learning).

Add padding on the left as it [makes training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa).

In [None]:
max_length = 320 # This was an appropriate max length for my dataset

# redefine the tokenize function and tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
    trust_remote_code=True,
    use_fast=False, # needed for now, should be fixed soon
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Generally, each `input_ids` should be padded on the left with the `eos_token` (50256) and there should be an `eos_token` 50256 added to the end, and the prompt should start with a `bos_token`.

In [None]:
print(tokenized_train_dataset[4]['input_ids'])

In [None]:
untokenized_text = tokenizer.decode(tokenized_train_dataset[4]['input_ids'])
print(untokenized_text)

Now all the samples should be the same length, `max_length` (320 for me).

In [None]:
plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

#### How does the base model do?

Let's grab a test input (`meaning_representation`) and desired output (`target`) pair to see how the base model does on it.

In [None]:
print("Target Sentence: " + test_dataset[1]['target'])
print("Meaning Representation: " + test_dataset[1]['meaning_representation'] + "\n")

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

In [None]:
# Apply the accelerator. You can comment this out to remove the accelerator.
# model = accelerator.prepare_model(model)

In [None]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=os.environ["HF_TOKEN"],
    add_bos_token=True,
    use_fast=False, # needed for now, should be fixed soon
)

In [None]:
device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

In [None]:
model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=128)[0], skip_special_tokens=True))

We can see it doesn't do very well out of the box.

### 4. Set Up LoRA

Now, to start our fine-tuning, we have to apply some preprocessing to the model to prepare it for training. Let's set up our LoRA layers.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Let's print the model to examine its layers, as we will apply QLoRA to some linear layers of the model. Those layers are `o_proj`, `qkv_proj`, `gate_up_proj`, `down_proj`, `lm_head`.

In [None]:
print(model)

Here we define the LoRA config.

`r` is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.

`alpha` is the scaling factor for the learned weights. The weight matrix is scaled by `alpha/r`, and thus a higher value for `alpha` assigns more weight to the LoRA activations.

The values used in the QLoRA paper were `r=64` and `lora_alpha=16`, and these are said to generalize well, but we will use `r=8` and `lora_alpha=16` so that we have more emphasis on the new fine-tuned data while also reducing computational complexity.

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "o_proj",
        "qkv_proj",
        "gate_up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

See how the model looks different now, with the LoRA adapters added:

In [None]:
print(model)

### 5. Run Training!

By default, this configuration trains for 500 steps but it will likely not have converged by then. You may consider upping the steps to 1000 in the below configuration. It may even need longer.

A note on training. You can set the `max_steps` to be high initially, and examine at what step your model's performance starts to degrade. There is where you'll find a sweet spot for how many steps to perform.

For example, say you start with 1000 steps, and find that at around 500 steps the model starts overfitting - the validation loss goes up (bad) while the training loss goes down significantly, meaning the model is learning the training set really well, but is unable to generalize to new datapoints. Therefore, 500 steps would be your sweet spot, so you would use the `checkpoint-500` model repo in your output dir (`phi3-finetune-viggo`) as your final model in step 8 below.

You can interrupt the process via Kernel -> Interrupt Kernel in the top nav bar once you realize you didn't need to train anymore.

In [None]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
import transformers
from datetime import datetime

project = "viggo-finetune"
base_model_name = "phi3"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        max_steps=500,
        learning_rate=2.5e-5,
        logging_steps=25,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="none",           # Replace with "wandb" if you want to use weights & biases
        # run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

### 6. Drum Roll... Try the Trained Model!

Depending on your GPU, it can be a good idea to kill the current process so that you don't run out of memory loading the base model again on top of the model we just trained. Go to `Kernel > Restart Kernel` or kill the process via the Terminal (`nvidia smi` > `kill [PID]`).

By default, the PEFT library will only save the QLoRA adapters, so we need to first load the Phi-3 Mini model from the Huggingface Hub:


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/Phi-3-mini-128k-instruct"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    token=os.environ["HF_TOKEN"],
    cache_dir="/project/models",
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    token=os.environ["HF_TOKEN"],
    add_bos_token=True,
    trust_remote_code=True,
    use_fast=False,
)

Now load the QLoRA adapter from the appropriate checkpoint directory, i.e. the best performing model checkpoint:

In [None]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "phi3-viggo-finetune/checkpoint-500")

...and now run your inference!

Let's try the same `eval_prompt` and thus `model_input` as above, and see if the new finetuned model performs better.

In [None]:
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

We can now see an output similar to the following:

``Meaning Representation: verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])``

### Sweet... it worked! The fine-tuned model now understands the meaning representation!

It's not excellent, but we only fine-tuned it on 500 steps, and it hadn't yet converged. The longer you fine-tune, the better you can expect it to perform (just watch for overfitting).  

When we are ready, let's wrap this up by saving our finetuned model.

In [None]:
ft_model = ft_model.merge_and_unload()

# Save model and tokenizer
tokenizer.save_pretrained("/project/models/NV-phi-3-viggo-finetune")
ft_model.save_pretrained("/project/models/NV-phi-3-viggo-finetune")

I hope you enjoyed this tutorial on fine-tuning Microsoft's Phi-3 Mini Instruct Model! 🤙 🤙 🤙