# Fine tuning Phi3 Mini 4K Instruct for function calls

## Optional step: mount user storage for storing fine-tuned model weights

First of all we need to grant a Colab notebook access to mount GCP buckets by passing authentication. 

In [None]:
from google.colab import auth, userdata
auth.authenticate_user()

Install **gcsfuse** utility for file system sync up operations

In [2]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2659  100  2659    0     0  40114      0 --:--:-- --:--:-- --:--:-- 40287
OK
49 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mhttp://packages.cloud.google.com/apt/dists/gcsfuse-bionic/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.[0m
gcsfuse is already the newest version (2.2.0).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


Set **GCP_BUCKET_PATH** [secret](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) to your GCP bucket name. And mount a bucket to a local **data** folder

In [3]:
bucket = userdata.get("GCP_BUCKET_PATH")
!mkdir -p data
!gcsfuse {bucket} data

{"timestamp":{"seconds":1718690919,"nanos":137726852},"severity":"INFO","message":"Start gcsfuse/2.2.0 (Go version go1.22.3) for app \"\" using mount point: /content/data\n"}
{"timestamp":{"seconds":1718690919,"nanos":137944771},"severity":"INFO","message":"GCSFuse mount command flags: {\"AppName\":\"\",\"Foreground\":false,\"ConfigFile\":\"\",\"MountOptions\":{},\"DirMode\":493,\"FileMode\":420,\"Uid\":-1,\"Gid\":-1,\"ImplicitDirs\":false,\"OnlyDir\":\"\",\"RenameDirLimit\":0,\"IgnoreInterrupts\":false,\"CustomEndpoint\":null,\"BillingProject\":\"\",\"KeyFile\":\"\",\"TokenUrl\":\"\",\"ReuseTokenFromUrl\":true,\"EgressBandwidthLimitBytesPerSecond\":-1,\"OpRateLimitHz\":-1,\"SequentialReadSizeMb\":200,\"AnonymousAccess\":false,\"MaxRetrySleep\":30000000000,\"StatCacheCapacity\":20460,\"StatCacheTTL\":60000000000,\"TypeCacheTTL\":60000000000,\"KernelListCacheTtlSeconds\":0,\"HttpClientTimeout\":0,\"MaxRetryDuration\":-1000000000,\"RetryMultiplier\":2,\"LocalFileCache\":false,\"TempDir\"

## Fine tuning Phi3 model

Install dependencies (including [unsloth](https://github.com/unslothai/unsloth) and [xformers](https://github.com/facebookresearch/xformers))

**unsloth**, an optimized library for fine-tuning Large Language Models, significantly enhances training speed, minimizes memory usage, and improves overall efficiency. Setting up the environment requires installing essential libraries and configuring parameters such as sequence length and data type.

The **xformers** library is an optimized library developed to enhance the efficiency of transformers, which are the backbone of many modern large language models. It focuses on improving the speed and reducing the memory usage of transformer models during training and inference. The library achieves this through various optimization techniques such as efficient attention mechanisms, memory-efficient layers, and parallelization strategies. xformers is designed to be compatible with popular deep learning frameworks like PyTorch, making it accessible for researchers and practitioners looking to fine-tune or deploy transformer models more effectively.

In [4]:
!pip install torch====2.3.0 --index-url https://download.pytorch.org/whl/cu121
!pip install -U xformers==0.0.26.post1
!pip install trl peft accelerate bitsandbytes wandb
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.3.0%2Bcu121-cp310-cp310-linux_x86_64.whl (781.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.0/781.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-nccl-cu12==2.20.5 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.2/176.2 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting triton==2.3.0 (from torch)
  Downloading https://download.pytorch.org/whl/triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.1/168.1 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: triton, nvidia-nccl-cu12, torch
  Attempting uninstall: triton
    F

Import dependencies and (optional) connect to wandb project (for training monitoring)

In [5]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
import wandb
import os

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
wandb.login(key=userdata.get("wandb"))
os.environ["WANDB_PROJECT"] = "func_calling_sft"

[34m[1mwandb[0m: Currently logged in as: [33mlliryc[0m ([33mgpn[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Load base pretrained Phi3 Mini Instruct 4K model for the training purposes. Access token is no needed

In [6]:
max_seq_length = 4096 # Context window size for Phi3 Mini Instuct model
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+. Originally this notebook was launched in A100 with Bfloat16 support
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False to keep good enough quality of the model.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-mini-4k-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

tokenizer.padding_side = 'right'
EOS_TOKEN = tokenizer.eos_token

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Create LoRA adapter model for the fine-tuning. Rank equals to 128 and scaling factor equals to 256. As experiment showed, it is most efficient combination for that case

In [None]:
peft_model = FastLanguageModel.get_peft_model(model,
    r = 128, # Rank of the LoRA, suggested 8, 16, 32, 64, 128. We are keeping max rank as it leads to opt model performance
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # layers of the Phi3 impacted by LoRA
    lora_alpha = 256, # Scaling factor to make adapted parameters more influential
    lora_dropout = 0.05, # Supports any, but = 0 is optimized

    bias = "none",    # Supports any, but = "none" is optimized
    random_state = 23, # Any number
    use_rslora = False,  # No needed
    loftq_config = None, # And LoftQ
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.5 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Load a dataset with function calls examples. Please check this [link](https://huggingface.co/mzbac/Phi-3-mini-4k-instruct-function-calling/blob/main/README.md) to get more information.

In [None]:
from datasets import load_dataset
dataset = load_dataset("mzbac/function-calling-phi-3-format-v1.1", split = "train")

Downloading readme:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/101M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/112390 [00:00<?, ? examples/s]

Create trainer for supervised fine tuning

In [None]:
trainer = SFTTrainer(
    model = peft_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 100,
        max_steps = 1300, # only for show case needs,  111000 is recommended for the best fine-tuning
        learning_rate = 1e-6,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "paged_adamw_32bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 23,
        output_dir = "data",
        logging_steps=50,
        save_strategy="steps",
        save_steps=1000, # save model checkpoint on each 1000th iteration
        report_to="wandb",
    )
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/112390 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Start fault tolerant training with resume_from_checkpoint=True. In case of unexpected shutdown of runtime, the process will be continued from the last checkpoint 

In [None]:
trainer_stats = trainer.train(resume_from_checkpoint = True)

	logging_steps: 50 (from args) != 10 (from trainer_state.json)


Step,Training Loss
1010,0.3452
1020,0.3546
1030,0.3192
1040,0.3549
1050,0.3417
1060,0.3567
1070,0.3601
1080,0.3025
1090,0.3863
1100,0.3533


In [None]:
trainer_stats.metrics # check status

{'train_runtime': 1064.2312,
 'train_samples_per_second': 19.545,
 'train_steps_per_second': 1.222,
 'total_flos': 4.7450352740010394e+17,
 'train_loss': 0.07908284700833834,
 'epoch': 0.18506655277955728}

Save **peft_model** along with **tokenizer** to the local **data** folder (which is could be synced up with GCP bucket)

In [None]:
peft_model.save_pretrained("data/peft_model")
tokenizer.save_pretrained("data/peft_model")

('data/peft_model/tokenizer_config.json',
 'data/peft_model/special_tokens_map.json',
 'data/peft_model/tokenizer.model',
 'data/peft_model/added_tokens.json',
 'data/peft_model/tokenizer.json')

## Experiments with a fine-tuned model

Load saved **model** and **tokenizer** from a local folder

In [17]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "data/peft_model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)


==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


And, finally, test a model, using unsloth fast inference capabilities,  on the function calls examples. As it could be seen, model is capable to perform successful translation from user requests to function calling.

In [32]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

tool = {
    "name": "calculator",
    "description": "Perform a math operations over x and y",
    "parameter": {
        "type": "json",
        "properties": {
            "x": {
                "type": "number",
                "description": "argument 1 for math operation",
                "required": True,
            },
            "y": {
                "type": "number",
                "description": "argument 2 for math operation",
                "required": True,
            },
            "op": {
                "type": "enum",
                "description": "math operation",
                "enum": ["add", "subtract", "multiply", "divide"],
                "required": True,
            }
        },
    },
}

messages = [
    {
        "role": "user",
        "content": f"You are a helpful assistant aware of the calculator function. Translate user requests to a function calls - {str(tool)}",
    },
    {
        "role": "user", "content": "What's sum of 45646556 and 23423424?"
    },
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end|>")]

outputs = model.generate(
    input_ids,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    max_new_tokens = max_seq_length
)
response = outputs[0]
print(tokenizer.decode(response))


<s><|user|> You are a helpful assistant aware of the calculator function. Translate user requests to a function calls - {'name': 'calculator', 'description': 'Perform a math operations over x and y', 'parameter': {'type': 'json', 'properties': {'x': {'type': 'number', 'description': 'argument 1 for math operation', 'required': True}, 'y': {'type': 'number', 'description': 'argument 2 for math operation', 'required': True}, 'op': {'type': 'enum', 'description': 'math operation', 'enum': ['add', 'subtract', 'multiply', 'divide'], 'required': True}}}}<|end|><|assistant|><|user|> What's sum of 45646556 and 23423424?<|end|><|assistant|> <function name="calculator" description="Perform a math operations over x and y" parameter={'x': 45646556, 'y': 23423424, 'op': 'add'}/><|end|>
