# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [1]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers==0.0.25
!pip install .[bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 10455, done.[K
remote: Counting objects: 100% (1271/1271), done.[K
remote: Compressing objects: 100% (222/222), done.[K
remote: Total 10455 (delta 1116), reused 1130 (delta 1049), pack-reused 9184[K
Receiving objects: 100% (10455/10455), 214.20 MiB | 11.58 MiB/s, done.
Resolving deltas: 100% (7712/7712), done.
Updating files: 100% (211/211), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       docker-compose.yml  [01;34mexamples[0m/  pyproject.toml  requirements.txt  [01;34msrc[0m/
CITATION.cff  Dockerfile          LICENSE    README.md       [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdata[0m/         [01;34mevaluation[0m/         Makefile   README_zh.md    setup.py
Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-hzcwpoy4/unsloth_f9d5b35b33e54775873fa6efc2ee3ba5
  Running command git clone 

### Check GPU environment

In [None]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Update Identity Dataset

In [None]:
import json

%cd /content/LLaMA-Factory/

NAME = "Llama-3"
AUTHOR = "LLaMA Factory"

with open("data/identity.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)

for sample in dataset:
  sample["output"] = sample["output"].replace("NAME", NAME).replace("AUTHOR", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)


/content/LLaMA-Factory


## Fine-tune model via LLaMA Board

In [None]:
from llmtuner import create_ui

%cd /content/LLaMA-Factory/

create_ui().queue().launch(share=True)

## Fine-tune model via Command Line

It takes ~30min for training.

In [None]:
from llmtuner import run_exp
from llmtuner.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

run_exp(dict(
  stage="sft",                        # do supervised fine-tuning
  do_train=True,
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="identity,alpaca_gpt4_en",             # use alpaca and identity datasets
  template="llama3",                     # use llama3 prompt template
  finetuning_type="lora",                   # use LoRA adapters to save memory
  lora_target="all",                     # attach LoRA adapters to all linear layers
  output_dir="llama3_lora",                  # the path to save LoRA adapters
  per_device_train_batch_size=2,               # the batch size
  gradient_accumulation_steps=4,               # the gradient accumulation steps
  lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
  logging_steps=10,                      # log every 10 steps
  warmup_ratio=0.1,                      # use warmup scheduler
  save_steps=1000,                      # save checkpoint every 1000 steps
  learning_rate=5e-5,                     # the learning rate
  num_train_epochs=3.0,                    # the epochs of training
  max_samples=500,                      # use 500 examples in each dataset
  max_grad_norm=1.0,                     # clip gradient norm to 1.0
  quantization_bit=4,                     # use 4-bit QLoRA
  loraplus_lr_ratio=16.0,                   # use LoRA+ with lambda=16.0
  use_unsloth=True,                      # use UnslothAI's LoRA optimization for 2x faster training
  fp16=True,                         # use float16 mixed precision training
))

torch_gc()

[INFO|training_args.py:1997] 2024-04-24 18:26:10,599 >> PyTorch: setting up devices
[INFO|training_args.py:1690] 2024-04-24 18:26:10,626 >> The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


/content/LLaMA-Factory




04/24/2024 18:26:10 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16


INFO:llmtuner.hparams.parser:Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:26:12,217 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/tokenizer.json
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:26:12,219 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:26:12,222 >> loading file special_tokens

04/24/2024 18:26:12 - INFO - llmtuner.data.template - Replace eos token: <|eot_id|>


INFO:llmtuner.data.template:Replace eos token: <|eot_id|>


04/24/2024 18:26:12 - INFO - llmtuner.data.loader - Loading dataset identity.json...


INFO:llmtuner.data.loader:Loading dataset identity.json...






04/24/2024 18:26:13 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_en.json...


INFO:llmtuner.data.loader:Loading dataset alpaca_gpt4_data_en.json...


Running tokenizer on dataset:   0%|          | 0/591 [00:00<?, ? examples/s]

[INFO|configuration_utils.py:726] 2024-04-24 18:26:15,050 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:26:15,054 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_

input_ids:
[128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271, 9906, 0, 358, 1097, 445, 81101, 12, 18, 11, 459, 15592, 18328, 8040, 555, 445, 8921, 4940, 17367, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
inputs:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello! I am Llama-3, an AI assistant developed by LLaMA Factory. How can I assist you today?<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 9906, 0, 358, 1097, 445, 81101, 12, 18, 11, 459, 15592, 18328, 8040, 555, 445, 8921, 4940, 17367, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
labels:
Hello! I am Llama-3, an AI assistant developed by LLaMA Factory. How can I assist you 

INFO:llmtuner.model.utils.quantization:Loading ?-bit BITSANDBYTES-quantized model.
[INFO|configuration_utils.py:726] 2024-04-24 18:26:15,368 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:26:15,376 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bn

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|modeling_utils.py:3429] 2024-04-24 18:26:15,707 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/model.safetensors
[INFO|modeling_utils.py:1494] 2024-04-24 18:26:15,786 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-04-24 18:26:15,800 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

[INFO|modeling_utils.py:4170] 2024-04-24 18:26:36,757 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4178] 2024-04-24 18:26:36,761 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at unsloth/llama-3-8b-Instruct-bnb-4bit.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training

04/24/2024 18:26:40 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled.


INFO:llmtuner.model.utils.checkpointing:Gradient checkpointing enabled.


04/24/2024 18:26:40 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA


INFO:llmtuner.model.adapter:Fine-tuning method: LoRA


04/24/2024 18:26:40 - INFO - llmtuner.model.utils.misc - Found linear modules: k_proj,o_proj,v_proj,gate_proj,up_proj,q_proj,down_proj


INFO:llmtuner.model.utils.misc:Found linear modules: k_proj,o_proj,v_proj,gate_proj,up_proj,q_proj,down_proj


04/24/2024 18:26:41 - INFO - llmtuner.model.loader - trainable params: 20971520 || all params: 8051232768 || trainable%: 0.2605


INFO:llmtuner.model.loader:trainable params: 20971520 || all params: 8051232768 || trainable%: 0.2605
[INFO|trainer.py:626] 2024-04-24 18:26:41,210 >> Using auto half precision backend


04/24/2024 18:26:41 - INFO - llmtuner.train.utils - Using LoRA+ optimizer with loraplus lr ratio 16.00.


INFO:llmtuner.train.utils:Using LoRA+ optimizer with loraplus lr ratio 16.00.
   \\   /|    Num examples = 591 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 222
 "-____-"     Number of trainable parameters = 20,971,520


Step,Training Loss
10,1.3741
20,1.2185
30,1.0293
40,0.9528
50,0.954
60,0.9807
70,1.0341
80,0.7745
90,0.7701
100,0.7841


[INFO|<string>:474] 2024-04-24 18:52:42,923 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:3305] 2024-04-24 18:52:42,931 >> Saving model checkpoint to llama3_lora
[INFO|configuration_utils.py:726] 2024-04-24 18:52:43,245 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:52:43,248 >> Model config LlamaConfig {
  "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  

***** train metrics *****
  epoch                    =        3.0
  total_flos               = 15546268GF
  train_loss               =     0.7403
  train_runtime            = 0:26:01.10
  train_samples_per_second =      1.136
  train_steps_per_second   =      0.142


## Infer the fine-tuned model

In [None]:
from llmtuner import ChatModel
from llmtuner.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

chat_model = ChatModel(dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  finetuning_type="lora",                  # same to the one in training
  template="llama3",                     # same to the one in training
  quantization_bit=4,                    # load 4-bit quantized model
  use_unsloth=True,                     # use UnslothAI's LoRA optimization for 2x faster generation
))

messages = []
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break

  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})     # add query to messages
  print("Assistant: ", end="", flush=True)
  response = ""
  for new_text in chat_model.stream_chat(messages):      # stream generation
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response}) # add response to messages

torch_gc()

[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:56:03,682 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/tokenizer.json
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:56:03,683 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:56:03,685 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-04-24 18:56:03,687 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/tokenizer_config.json


/content/LLaMA-Factory




04/24/2024 18:56:04 - INFO - llmtuner.data.template - Replace eos token: <|eot_id|>


INFO:llmtuner.data.template:Replace eos token: <|eot_id|>
[INFO|configuration_utils.py:726] 2024-04-24 18:56:04,211 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:56:04,214 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "b

04/24/2024 18:56:04 - INFO - llmtuner.model.utils.quantization - Loading ?-bit BITSANDBYTES-quantized model.


INFO:llmtuner.model.utils.quantization:Loading ?-bit BITSANDBYTES-quantized model.


04/24/2024 18:56:04 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.


INFO:llmtuner.model.patcher:Using KV cache for faster generation.


04/24/2024 18:56:04 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA


INFO:llmtuner.model.adapter:Fine-tuning method: LoRA
[INFO|configuration_utils.py:726] 2024-04-24 18:56:04,319 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:56:04,323 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|configuration_utils.py:726] 2024-04-24 18:56:04,519 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/efa44c86af4fcbbc3d75e6cb1c8bfaf7f5c7cfc1/config.json
[INFO|configuration_utils.py:789] 2024-04-24 18:56:04,522 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_

04/24/2024 18:56:25 - INFO - llmtuner.model.adapter - Loaded adapter(s): llama3_lora


INFO:llmtuner.model.adapter:Loaded adapter(s): llama3_lora


04/24/2024 18:56:25 - INFO - llmtuner.model.loader - all params: 8051232768


INFO:llmtuner.model.loader:all params: 8051232768



User: who are you
Assistant: I am Llama-3, an AI assistant developed by LLaMA Factory.

User: give me 3 tips to keep healthy
Assistant: 1. Stay hydrated: Drinking plenty of water throughout the day helps to flush out toxins, maintain energy levels, and keep your skin looking healthy and clear.

2. Eat a balanced diet: A well-rounded diet that includes a variety of fruits, vegetables, whole grains, and lean protein sources can help to support overall health and well-being.

3. Get regular exercise: Engaging in regular physical activity can help to boost mood, improve sleep, and increase energy levels. Aim for at least 30 minutes of moderate exercise per day.

User: thanks
Assistant: You're welcome!

User: exit
