<a href="https://colab.research.google.com/github/morningcafe/Llama-3-PyTorch/blob/main/Copy_of_Finetune_Llama3_with_LLaMA_Factory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [None]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers==0.0.25
!pip install .[bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 10749, done.[K
remote: Counting objects: 100% (1566/1566), done.[K
remote: Compressing objects: 100% (351/351), done.[K
remote: Total 10749 (delta 1307), reused 1400 (delta 1210), pack-reused 9183[K
Receiving objects: 100% (10749/10749), 214.42 MiB | 21.48 MiB/s, done.
Resolving deltas: 100% (7906/7906), done.
Updating files: 100% (209/209), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       docker-compose.yml  [01;34mexamples[0m/  pyproject.toml  requirements.txt  [01;34msrc[0m/
CITATION.cff  Dockerfile          LICENSE    README.md       [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdata[0m/         [01;34mevaluation[0m/         Makefile   README_zh.md    setup.py
Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-n5ypu41x/unsloth_333e62a0d033412daf70b766ce1f59d4
  Running command git clone 

### Check GPU environment

In [None]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Update Identity Dataset

In [None]:
import json

%cd /content/LLaMA-Factory/

NAME = "Phi-3"
AUTHOR = "LLaMA Factory"

with open("data/identity.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)

for sample in dataset:
  sample["output"] = sample["output"].replace("NAME", NAME).replace("AUTHOR", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

/content/LLaMA-Factory


## Fine-tune model via LLaMA Board

In [None]:
%cd /content/LLaMA-Factory/
!GRADIO_SHARE=True llamafactory-cli webui

/content/LLaMA-Factory
2024-05-04 15:29:58.408986: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-04 15:29:58.409054: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-04 15:29:58.410508: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://cbd4ff8ed537d17538.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
Keyboard interruption in main thread... closing server.
Traceback 

## Fine-tune model via Command Line

It takes ~30min for training.

In [None]:
import json

args = dict(
  stage="sft",                        # do supervised fine-tuning
  do_train=True,
  model_name_or_path="microsoft/Phi-3-mini-4k-instruct", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="identity,alpaca_gpt4_en",             # use alpaca and identity datasets
  template="phi",                     # use llama3 prompt template
  finetuning_type="lora",                   # use LoRA adapters to save memory
  lora_target="all",                     # attach LoRA adapters to all linear layers
  output_dir="phi3_lora",                  # the path to save LoRA adapters
  per_device_train_batch_size=2,               # the batch size
  gradient_accumulation_steps=4,               # the gradient accumulation steps
  lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
  logging_steps=10,                      # log every 10 steps
  warmup_ratio=0.1,                      # use warmup scheduler
  save_steps=1000,                      # save checkpoint every 1000 steps
  learning_rate=5e-5,                     # the learning rate
  num_train_epochs=3.0,                    # the epochs of training
  max_samples=500,                      # use 500 examples in each dataset
  max_grad_norm=1.0,                     # clip gradient norm to 1.0
  quantization_bit=4,                     # use 4-bit QLoRA
  loraplus_lr_ratio=16.0,                   # use LoRA+ algorithm with lambda=16.0
  use_unsloth=True,                      # use UnslothAI's LoRA optimization for 2x faster training
  fp16=True,                         # use float16 mixed precision training
)

json.dump(args, open("train_phi3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_phi3.json

/content/LLaMA-Factory
2024-05-04 17:41:14.886178: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-04 17:41:14.886239: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-04 17:41:15.002060: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
05/04/2024 17:41:21 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
tokenizer_config.json: 100% 3.17k/3.17k [00:00<00:00, 23.0MB/s]
tokenizer.model: 100% 500k/500k [00:00<00:00, 15.7MB/s]
tokenizer.json: 100% 1.84M/1.84M [00:00<00:00, 4.77MB/s]
added_tok

## Infer the fine-tuned model

In [None]:
from llmtuner.chat import ChatModel
from llmtuner.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/Phi-3-mini-4k-instruct", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="phi3_lora",            # load the saved LoRA adapters
  template="phi",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=4,                    # load 4-bit quantized model
  use_unsloth=True,                     # use UnslothAI's LoRA optimization for 2x faster generation
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

/content/LLaMA-Factory


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

[INFO|tokenization_utils_base.py:2087] 2024-05-04 17:59:01,578 >> loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct/snapshots/3b4b2149c2ef6acf53588d34107465b75b8a54d8/tokenizer.model
[INFO|tokenization_utils_base.py:2087] 2024-05-04 17:59:01,582 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct/snapshots/3b4b2149c2ef6acf53588d34107465b75b8a54d8/tokenizer.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 17:59:01,583 >> loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct/snapshots/3b4b2149c2ef6acf53588d34107465b75b8a54d8/added_tokens.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 17:59:01,584 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct/snapshots/3b4b2149c2ef6acf53588d34107465b75b8a54d8/special_tokens_map.json
[I

05/04/2024 17:59:01 - INFO - llmtuner.data.template - Replace eos token: <|end|>


INFO:llmtuner.data.template:Replace eos token: <|end|>


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

[INFO|configuration_utils.py:726] 2024-05-04 17:59:02,051 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct/snapshots/3b4b2149c2ef6acf53588d34107465b75b8a54d8/config.json
[INFO|configuration_utils.py:789] 2024-05-04 17:59:02,055 >> Model config MistralConfig {
  "_name_or_path": "unsloth/Phi-3-mini-4k-instruct",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 32000,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 2048,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.1",
  "use_cache": true,
  "vocab_size":

05/04/2024 17:59:02 - INFO - llmtuner.model.utils.quantization - Quantizing model to 4 bit.


INFO:llmtuner.model.utils.quantization:Quantizing model to 4 bit.


05/04/2024 17:59:02 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.


INFO:llmtuner.model.patcher:Using KV cache for faster generation.


05/04/2024 17:59:02 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA


INFO:llmtuner.model.adapter:Fine-tuning method: LoRA
[INFO|configuration_utils.py:726] 2024-05-04 17:59:02,318 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct-bnb-4bit/snapshots/c4d5b8819dd175cd622e2ba6b02f3f44e412aa15/config.json
[INFO|configuration_utils.py:789] 2024-05-04 17:59:02,325 >> Model config MistralConfig {
  "_name_or_path": "unsloth/Phi-3-mini-4k-instruct-bnb-4bit",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 32000,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_

==((====))==  Unsloth: Fast Mistral patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|configuration_utils.py:726] 2024-05-04 17:59:02,551 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--Phi-3-mini-4k-instruct-bnb-4bit/snapshots/c4d5b8819dd175cd622e2ba6b02f3f44e412aa15/config.json
[INFO|configuration_utils.py:789] 2024-05-04 17:59:02,559 >> Model config MistralConfig {
  "_name_or_path": "unsloth/Phi-3-mini-4k-instruct-bnb-4bit",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 32000,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "

05/04/2024 17:59:09 - INFO - llmtuner.model.adapter - Loaded adapter(s): phi3_lora


INFO:llmtuner.model.adapter:Loaded adapter(s): phi3_lora


05/04/2024 17:59:09 - INFO - llmtuner.model.loader - all params: 3836021760


INFO:llmtuner.model.loader:all params: 3836021760


Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.
Assistant: Hello! I am Llama-3, an AI assistant developed by LLaMA Factory. How can I assist you today?


KeyboardInterrupt: Interrupted by user

## Merge the LoRA adapter and optionally upload model

NOTE: the Colab free version has merely 12GB RAM, where merging LoRA of a 8B model needs at least 18GB RAM, thus you **cannot** perform it in the free version.

In [None]:
!huggingface-cli login

In [None]:
import json

args = dict(
  model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct", # use official non-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  export_dir="llama3_lora_merged",              # the path to save the merged model
  export_size=2,                       # the file shard size (in GB) of the merged model
  export_device="cpu",                    # the device used in export, can be chosen from `cpu` and `cuda`
  #export_hub_model_id="your_id/your_model",         # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json