# Finetune Llama-3 with LLaMA Factory

Please use a Tesla T4 Colab GPU to run this!

## Import LLaMAFactory

# Option 1

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r /content/drive/MyDrive/LLaMA-Factory /content

# Option 2: Upload a directory from your local folder to Colab
# 1. Go to the "Files" tab on the left panel.
# 2. Right-click within the file explorer area and select "Upload."
# 3. Choose the "LLaMA-Factory" directory from your local machine and upload it to Colab.

# **Install Dependencies**

In [3]:
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]
!pip install llamafactory

/content/LLaMA-Factory
CITATION.cff  LICENSE       MANIFEST.in     README_zh.md      setup.py  train_llama3.json
[0m[01;34mevaluation[0m/   [01;34mllama3_lora[0m/  pyproject.toml  requirements.txt  [01;34msrc[0m/
[01;34mexamples[0m/     Makefile      README.md       [01;34mscripts[0m/          [01;34mtests[0m/
Obtaining file:///content/LLaMA-Factory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets<=2.20.0,>=2.16.0 (from llamafactory==0.8.4.dev0)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate<=0.32.0,>=0.30.1 (from llamafactory==0.8.4.dev0)
  Downloading accelerate-0.32.0-py3-none-any.whl.metadata (18 kB)
Collecting peft<=0.12.0,>=0.11.1 (from llamafactory==0.8.4.dev0)
  Downloading peft-0.12.0-py3-none-any.whl.m

## Infer the fine-tuned model

### If you get the error “ModuleNotFoundError: No module named 'llamafactory'" then restart the colab session by going dropdown Runtime, click on restart session, and then re-run the cell.


In [1]:
#Load the model into memory
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=4,                    # load 4-bit quantized model
)
chat_model = ChatModel(args)

/content/LLaMA-Factory


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

[INFO|tokenization_utils_base.py:2161] 2024-08-22 15:16:31,264 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/tokenizer.json
[INFO|tokenization_utils_base.py:2161] 2024-08-22 15:16:31,265 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2161] 2024-08-22 15:16:31,266 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/special_tokens_map.json
[INFO|tokenization_utils_base.py:2161] 2024-08-22 15:16:31,267 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/tokenizer_config.json


08/22/2024 15:16:31 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>


INFO:llamafactory.data.template:Replace eos token: <|eot_id|>


config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

[INFO|configuration_utils.py:733] 2024-08-22 15:16:31,872 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/config.json
[INFO|configuration_utils.py:800] 2024-08-22 15:16:31,882 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128255,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
 





08/22/2024 15:16:31 - INFO - llamafactory.model.model_utils.quantization - Loading ?-bit BITSANDBYTES-quantized model.


INFO:llamafactory.model.model_utils.quantization:Loading ?-bit BITSANDBYTES-quantized model.


08/22/2024 15:16:31 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


INFO:llamafactory.model.patcher:Using KV cache for faster generation.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

[INFO|modeling_utils.py:3556] 2024-08-22 15:17:10,579 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/model.safetensors
[INFO|modeling_utils.py:1531] 2024-08-22 15:17:10,706 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1000] 2024-08-22 15:17:10,713 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "pad_token_id": 128255
}

[INFO|quantizer_bnb_4bit.py:106] 2024-08-22 15:17:11,592 >> target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization
[INFO|modeling_utils.py:4364] 2024-08-22 15:18:00,577 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4372] 2024-08-22 15:18:00,581 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at unsloth/llama-3-8b

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

[INFO|configuration_utils.py:955] 2024-08-22 15:18:00,769 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/65e42616f7908d202462119a2749377133801581/generation_config.json
[INFO|configuration_utils.py:1000] 2024-08-22 15:18:00,771 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128009
  ],
  "max_length": 8192,
  "pad_token_id": 128255,
  "temperature": 0.6,
  "top_p": 0.9
}



08/22/2024 15:21:26 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.


INFO:llamafactory.model.model_utils.attention:Using torch SDPA for faster training and inference.


08/22/2024 15:21:27 - INFO - llamafactory.model.adapter - Loaded adapter(s): llama3_lora


INFO:llamafactory.model.adapter:Loaded adapter(s): llama3_lora


08/22/2024 15:21:27 - INFO - llamafactory.model.loader - all params: 8,051,232,768


INFO:llamafactory.model.loader:all params: 8,051,232,768


### **Chat Application**

In [2]:
messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: How was your childhood?
Assistant: I had a great childhood, but we moved around a lot. We moved 26 times. I didn’t have a lot of friends. I was in a lot of different schools, and I didn’t have a lot of friends.

User: What advice would you give someone?
Assistant: I would say to someone, “Don’t get married, don’t have children, and don’t buy a house. You’re going to move around a lot.”

User: What's your greatest accomplishment?
Assistant: My greatest accomplishment is my children. I think they’re the greatest. I really do. I think they’re the greatest. I’m very proud of them.

User: clear
History has been removed.

User: exit
