<a href="https://colab.research.google.com/github/f901107/Fine_tuning_LLMs/blob/main/Fine_tune_Llama_2_in_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 使用 QLoRA 對 LLaMA 2 進行微調

使用最佳的可用 GPU（前往執行階段 -> 變更執行階段類型）

要進行模型微調，只需將包含 `prompt` 和 `response` 鍵的 JSONL 檔案 `train.jsonl`

載入，並對 `test.jsonl` 做相同操作。然後，只需運行所有儲存格。

您可以通過在「定義超參數」儲存格中更改 `model_name` 來更改要進行微調的模型。

請在這裡編寫您的提示，盡可能詳細！

然後，選擇生成數據時要使用的溫度（在0和1之間）。較低的值非常適合精確任務，比如編寫代碼，而較大的值則更適合創意任務，比如寫故事。

最後，選擇要生成多少示例。生成的數量越多，a）所需的時間越長，b）生成數據的成本越高。但通常來說，生成更多的例子會有更高質量的模型。通常，至少要生成100例子。

In [None]:
# 以英文提出類似謎題的推理問題，並經過深思熟慮的一步一步合理以繁體中文回應
prompt = "A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Traditional Chinese."
temperature = .4
number_of_examples = 4

In [None]:
!pip install -q openai xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import openai
import random

openai.api_key = "OpenAI_API_Key"

def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 10:
            prev_examples = random.sample(prev_examples, 10)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4", # gpt-3.5-turbo
        messages=messages,
        temperature=temperature,
        max_tokens=1354, # max = 32,768
    )

    return response.choices[0].message['content']

# 開始生成例子
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
['prompt\n-----------\nIf a train leaves New York for Los Angeles at 100 mph and another train leaves Los Angeles for New York at 200 mph at the same time, which train will be closer to New York when they meet?\n-----------\n\nresponse\n-----------\n首先，我們需要知道紐約和洛杉磯之間的距離。假設這個距離是3000英里。每列火車都會在相同的時間內移動，所以我們可以將這個問題視為一個相對速度問題。\n\n兩列火車的相對速度是100 mph + 200 mph = 300 mph。所以，他們會在3000英里 / 300 mph = 10小時後相遇。\n\n在這10小時內，從紐約出發的火車將會移動100 mph * 10小時 = 1000英里。從洛杉磯出發的火車將會移動200 mph * 10小時 = 2000英里。\n\n所以，當兩列火車相遇時，從紐約出發的火車將會更靠近紐約，因為它只移動了1000英里，而從洛杉磯出發的火車移動了2000英里。\n-----------', 'prompt\n-----------\nA snail climbs up a 12ft wall. Each day it climbs 3ft but at night, it slides down 2ft. How many days will it take for the snail to reach the top of the wall?\n-----------\n\nresponse\n-----------\n蝸牛每天爬升3英尺，但在夜晚下滑2英尺，所以它每天實際上只上升1英尺。但是，我們需要注意到，當蝸牛在第12天爬升3英尺到達牆頂時，它不會在那天晚上下滑。\n\n所以，我們可以將問題分為兩部分來解答。首先，蝸牛需要11天來爬升11英尺。然後，它需要再花一天時間爬

## 將這些例子整理到DataFrame中，並將它們轉化為資料對。

In [None]:
import pandas as pd

prompts = []
responses = []

for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# 刪除重複
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')

df.head()

There are 4 successfully-generated examples. Here are the first few:


Unnamed: 0,prompt,response
0,If a train leaves New York for Los Angeles at ...,首先，我們需要知道紐約和洛杉磯之間的距離。假設這個距離是3000英里。每列火車都會在相同的時...
1,A snail climbs up a 12ft wall. Each day it cli...,蝸牛每天爬升3英尺，但在夜晚下滑2英尺，所以它每天實際上只上升1英尺。但是，我們需要注意到，...
2,There are 3 boxes. One box contains only apple...,"首先，我們選擇標籤為""混合""的箱子取出一個水果。由於所有箱子的標籤都是錯誤的，所以這個箱子只..."
3,A man has two children. One of them is a boy. ...,"這個問題的答案取決於我們如何理解問題的條件。如果""其中一個是男孩""指的是特定的一個孩子（例如..."


In [None]:
# 拆分訓練資料及測試資料
train_df = df.sample(frac=0.9, random_state=42)
test_df = df.drop(train_df.index)

# 存成jsonl檔
train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)

# 安裝所需套件

In [None]:
# 安裝所需的 Python 套件
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m204.8/244.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

In [None]:
# 匯入必要的模組和套件
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# 設定超參數

- model_name : 預訓練模型的名稱。
- dataset_name : 訓練數據集的文件路徑。
- new_model : 新模型的名稱。
- lora_r, lora_alpha, lora_dropout : LoRA 的相關參數。
- use_4bit : 是否使用 4 位定點數。
- bnb_4bit_compute_dtype : 4 位定點數計算的數據類型。
- bnb_4bit_quant_type : 4 位定點數的量化類型。
- use_nested_quant : 是否使用嵌套量化。
- output_dir : 輸出的目錄。
- num_train_epochs : 訓練的週期數。
- fp16 : 是否使用 16 位浮點數。
- bf16 : 是否使用 bfloat16。
- per_device_train_batch_size : 每個設備的訓練批次大小。
- per_device_eval_batch_size : 每個設備的評估批次大小。
- gradient_accumulation_steps : 梯度累積的步數。
- gradient_checkpointing : 是否使用梯度檢查點。
- max_grad_norm : 梯度的最大範數。
- learning_rate : 學習速率。
- weight_decay : 權重衰減。
- optim : 優化器。
- lr_scheduler_type : 學習速率的調整方式。
- max_steps : 最大的訓練步數。
- warmup_ratio : 學習速率的熱身比例。
- group_by_length : 是否根據句子的長度將它們分組。
- save_steps : 模型保存的步數。
- logging_steps : 日誌記錄的步數。
- max_seq_length : 輸入序列的最大長度。
- packing : 是否打包序列。
- device_map : 使用哪一個GPU。
- 詳細參數：https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments

In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model = "llama-2-7b-custom"
dataset_name = "./train.jsonl"

################################################################################
# Quantized LLMs with Low-Rank Adapters (QLoRA) parameters
################################################################################
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters 輕量級封裝，專門用於CUDA自定義函數，特別是8位優化器、矩陣乘法和量化
################################################################################
use_4bit = True
bnb_4bit_compute_dtype = "float16" # float16 or bfloat16
bnb_4bit_quant_type = "nf4" # fp4 or nf4
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 5
logging_steps = 5

################################################################################
# Supervised finetuning (SFT) parameters
################################################################################
max_seq_length = None
packing = False
device_map = {"": 0} #{"": 0} or "auto"

# 讀取資料集 & 前處理

In [None]:
# 下載訓練及測試資料集
!wget https://github.com/f901107/Fine_tuning_LLMs/releases/download/Dataset/train.jsonl
!wget https://github.com/f901107/Fine_tuning_LLMs/releases/download/Dataset/test.jsonl

--2023-09-08 02:52:51--  https://github.com/f901107/Fine_tuning_LLMs/releases/download/Dataset/train.jsonl
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/687357322/2ba31c36-f622-4974-9738-06f8cf6f4542?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230908%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230908T025251Z&X-Amz-Expires=300&X-Amz-Signature=ba3b1cfab14ca3ca19cd52e811e26101adb73eea2345147a37e97a367fa80323&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=687357322&response-content-disposition=attachment%3B%20filename%3Dtrain.jsonl&response-content-type=application%2Foctet-stream [following]
--2023-09-08 02:52:51--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/687357322/2ba31c36-f622-4974-9738-06f8cf6f4542?X-A

In [None]:
# 讀取資料集
train_dataset = load_dataset('json', data_files='./train.jsonl', split="train")  # 從JSON文件中載入訓練數據集
valid_dataset = load_dataset('json', data_files='./test.jsonl', split="train")  # 從JSON文件中載入驗證數據集

# 對數據集進行前處理，將提示和回應組合成文本對
train_dataset = train_dataset.map(lambda examples: {'text': [prompt + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
valid_dataset = valid_dataset.map(lambda examples: {'text': [prompt + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

# 下載模型及微調模型

In [None]:
# 定義位元和字節量化的相關配置
# dataset_name = "mlabonne/guanaco-llama2-1k"
# dataset = load_dataset(dataset_name, split="train")

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# 檢查 GPU 是否與 bfloat16 相容
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# 從預訓練模型中載入自動生成模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# 載入與模型對應的分詞器
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# 定義 Prompt Engineering Fine-Tuning （PEFT）的相關設定
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# 設置訓練參數
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard", #"all"
    evaluation_strategy="steps",
    eval_steps=5  # 每5部驗證
)

# 使用 SFTTrainer 進行監督式微調訓練
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset, # 在這裡傳入驗證數據集
    eval_dataset=valid_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# 開始訓練模型
trainer.train()

# 儲存微調後的模型
trainer.model.save_pretrained(new_model)

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]



Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
5,0.7977,0.663812
10,0.5603,0.516835
15,0.4631,0.459997
20,0.4081,0.44956


In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

In [None]:
# 日誌輸出
logging.set_verbosity(logging.CRITICAL)

In [None]:
# 執行模型的文本生成流程
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text']) # 輸出生成的文本



<s>[INST] What is a large language model? [/INST]  A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. everybody. These models are designed to learn the patterns and structures of language, and can be used for a variety of tasks, such as:

1. Language Translation: Large language models can be trained on multiple languages and can generate translations that are more accurate and natural-sounding than those produced by traditional machine translation systems.
2. Text Summarization: Large language models can be used to summarize long documents, articles, or web pages into shorter summaries that capture the main points.
3. Chatbots: Large language models can be used to build chatbots that can engage in conversation with users, answering questions and providing information on a wide range of topics.


In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc # 清理垃圾桶
gc.collect()
gc.collect()

19965

# 模型儲存到Google Drive中

In [None]:
from google.colab import drive
drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/GPT_Llama-2_fine-tune"  # 更改為您的路徑

# 以FP16重新載入模型並將其與LoRA權重合併
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# 重新載入分詞器以進行保存
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 儲存合併後的模型
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Mounted at /content/drive


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('/content/drive/MyDrive/GPT_Llama-2_fine-tune/tokenizer_config.json',
 '/content/drive/MyDrive/GPT_Llama-2_fine-tune/special_tokens_map.json',
 '/content/drive/MyDrive/GPT_Llama-2_fine-tune/tokenizer.json')

# 從Google Drive載入微調後的模型並執行推論

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/GPT_Llama-2_fine-tune"  # 更改為您儲存模型的路徑

model = AutoModelForCausalLM.from_pretrained(model_path,
                         device_map="auto",
                         offload_folder="offload",
                         torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [None]:
prompt = "What is 2 + 2?"  # 更改為您期望的提示
gen = pipeline('text-generation', model=model, max_new_tokens= 2048, tokenizer=tokenizer)
result = gen(prompt)
print(result[0]['generated_text'])



What is 2 + 2?

Answer: 2 + 2 = 4.


## 相關參考資料
- [MiuLab] Taiwan-LLaMa2 [https://github.com/MiuLab/Taiwan-LLaMa](https://github.com/MiuLab/Taiwan-LLaMa)

- [Huggingface] Taiwan-LLaMa-v1.0 [https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0](https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0)

- Huggingface的中文資料集 [https://huggingface.co/datasets?language=language:zh&sort=trending](https://huggingface.co/datasets?language=language:zh&sort=trending)

**code reference:**

- code based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da).

- Llama2 finetune 範例 1: [https://colab.research.google.com/drive/16SlGXLuBRB30clB0dCYAh3sqk0edKoFC?usp=sharing](
https://colab.research.google.com/drive/16SlGXLuBRB30clB0dCYAh3sqk0edKoFC?usp=sharing)
- Llama2 finetune 範例 2: https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html