In [1]:
# transformers not support NumPy 2.0 yet
!pip install -q numpy~=1.26.4 transformers~=4.46.2
!pip install -q datasets~=3.2.0 pydantic~=2.10.4
!pip install -q peft~=0.14.0 trl~=0.13.0

# 訓練問答模型

在這個筆記本中，我們將展示如何使用 `transformers` 套件訓練問答模型。我們將使用 `transformers` 套件中的 `SFTTrainer` ([Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)) 類別來微調一個 Decoder-Only 架構的 Phi-3.5 模型。

In [2]:
import pandas as pd

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
)
from datasets import load_dataset

from typing import Any
from pydantic import BaseModel
from pprint import pprint

import torch

# 載入 PEFT 相關套件
from peft import LoraConfig, TaskType, PeftModel, get_peft_model
# 載入 SFTTrainer 相關套件
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM, setup_chat_format

# 檢查是否有 GPU 可以使用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps" if torch.backends.mps.is_available() else device)

  from .autonotebook import tqdm as notebook_tqdm


## 下載資料

In [3]:
# The full `train` split, only 1% of dataset
immutable_dataset = load_dataset("philschmid/dolly-15k-oai-style", split="train[:1%]")

  ### 資料包含什麼？

In [4]:
# 顯示原始資料中包含的 features 以及筆數
immutable_dataset

Dataset({
    features: ['messages'],
    num_rows: 150
})

In [5]:
# 檢視資料集中的第一筆資料
pprint(immutable_dataset[0]['messages'])

[{'content': 'When did Virgin Australia start operating?\n'
             'Virgin Australia, the trading name of Virgin Australia Airlines '
             'Pty Ltd, is an Australian-based airline. It is the largest '
             'airline by fleet size to use the Virgin brand. It commenced '
             'services on 31 August 2000 as Virgin Blue, with two aircraft on '
             'a single route. It suddenly found itself as a major airline in '
             "Australia's domestic market after the collapse of Ansett "
             'Australia in September 2001. The airline has since grown to '
             'directly serve 32 cities in Australia, from hubs in Brisbane, '
             'Melbourne and Sydney.',
  'role': 'user'},
 {'content': 'Virgin Australia commenced services on 31 August 2000 as Virgin '
             'Blue, with two aircraft on a single route.',
  'role': 'assistant'}]


這個 JSON 資料結構是一個列表，包含兩個字典，每個字典代表一個對話的訊息。每個字典有兩個 Key：

* `role`: 表示訊息的角色，是一個字符串，可以是 `user` 或 `assistant`。

* `content`: 表示訊息的內容，是一個字符串。

具體結構如下：

第一個字典：
`role`: `user`，表示這是使用者的訊息。
`content`: 包含使用者提問和相關資訊。

第二個字典：
`role`: `assistant`，表示這是助理的訊息。
`content`: 包含助理的回答。


### 資料前處理

方便演示及加快訓練速度，我們將對資料進行以下前處理：

1. 將 `messages` 欄位分拆成 `user` 和 `assistant` 兩個欄位，方便演示。
2. 將 `user` 或 `assistant` 欄位中的 `content` 長於 512 的資料過濾掉。
3. 將 `assistant` 欄位中的 `content` 短於 128 的資料過濾掉。

In [6]:
# 將 messages 欄位分拆成 user 和 assistant 兩個欄位，方便演示。
dataset = immutable_dataset.map(
  lambda x: {
    "user": x["messages"][0],
    "assistant": x["messages"][1],
  }
)
# 將 user 或 assistant 欄位中過長的 content 資料過濾掉
dataset = dataset.filter(
  lambda x: len(x["user"]["content"]) <= 512 and len(x["assistant"]["content"]) <= 512
)
# 將 assistant 欄位中過短的 content 資料過濾掉
dataset = dataset.filter(
  lambda x: len(x["assistant"]["content"]) >= 128
)
# 顯示處理後的資料
dataset

Dataset({
    features: ['messages', 'user', 'assistant'],
    num_rows: 42
})

In [7]:
# 顯示前 first_n_data 筆資料
first_n_data = 2
pd.set_option('display.max_colwidth', None)
pd.DataFrame(dataset.select(range(first_n_data)))

Unnamed: 0,messages,user,assistant
0,"[{'content': 'What is a polygon?', 'role': 'user'}, {'content': 'A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It is a closed chain of connected line segments or edges. The vertices of the polygon are formed where two edges meet. Examples of polygons are hexagons, pentagons, and octagons. Any plane that does not contain edges or vertices is not a polygon. An example of a non-polygon is a circle.', 'role': 'assistant'}]","{'content': 'What is a polygon?', 'role': 'user'}","{'content': 'A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It is a closed chain of connected line segments or edges. The vertices of the polygon are formed where two edges meet. Examples of polygons are hexagons, pentagons, and octagons. Any plane that does not contain edges or vertices is not a polygon. An example of a non-polygon is a circle.', 'role': 'assistant'}"
1,"[{'content': 'How do I start running?', 'role': 'user'}, {'content': 'Make sure you get comfortable running shoes and attire. Start with achievable goal in mind like a 5K race. If you never ran before, start gradually from a walk, to brisk walk, light jog aiming for 15-30mins initially. Slowly increase your running time and distance as your fitness level improves. One of the most important things is cool down and gentle stretching. Always listen to your body, and take rest days when needed to prevent injury.', 'role': 'assistant'}]","{'content': 'How do I start running?', 'role': 'user'}","{'content': 'Make sure you get comfortable running shoes and attire. Start with achievable goal in mind like a 5K race. If you never ran before, start gradually from a walk, to brisk walk, light jog aiming for 15-30mins initially. Slowly increase your running time and distance as your fitness level improves. One of the most important things is cool down and gentle stretching. Always listen to your body, and take rest days when needed to prevent injury.', 'role': 'assistant'}"


## 訓練參數

### 批次大小 (Batch Size) 和 梯度累積步數 (Gradient Accumulation Steps)

批次大小（batch size）和梯度累積步數（gradient accumulation steps）之間的關係可以簡單地說明如下：

* 批次大小（batch size）：每次訓練迭代中使用的樣本數量。較大的批次大小通常需要更多的內存。
* 梯度累積步數（gradient accumulation steps）：在更新模型權重之前累積梯度的迭代次數。這允許使用較小的批次大小來模擬較大的批次大小。

當內存限制無法直接使用大批次大小時，可以通過梯度累積來實現。例如：

* 如果批次大小是 8，梯度累積步數是 4，這相當於使用批次大小為 32（8 * 4）進行訓練。

這樣可以在內存有限的情況下實現大批次大小的效果。

### 半精度浮點數

半精度訓練（Half-Precision Training）是一種使用 16 位浮點數（FP16）而不是 32 位浮點數（FP32）來訓練神經網絡的方法。這種方法的主要優點包括：

* 減少內存使用：FP16 數據類型佔用的內存比 FP32 少一半，允許在相同的硬件上訓練更大的模型或使用更大的批次大小。
* 加速計算：許多現代 GPU 對 FP16 計算進行了優化，可以更快地執行 FP16 運算，從而加速訓練過程。
* 節省帶寬：減少內存和帶寬的使用，有助於提高數據傳輸效率。

BFP16 (Brain Floating Point 16)
BFP16 是一種 16 位浮點數格式，主要由 Google 用於其 TPU（Tensor Processing Unit）。BFP16 的優點是它保留了與 FP32 相同的指數範圍，但尾數精度較低，這在某些情況下可以提供更好的數值穩定性。

FP16 (Half-Precision Floating Point)
FP16 是一種標準的 16 位浮點數格式，廣泛用於 GPU 加速的深度學習訓練。FP16 的優點是內存佔用少，計算速度快，但指數範圍和尾數精度都比 FP32 小。

![](https://miro.medium.com/v2/0*HapPSei5sok65wcv)

總體來說，半精度訓練可以在不顯著影響模型性能的情況下，提高訓練效率和資源利用率。

### 訓練設定

In [8]:
# 訓練相關設定
class Config(BaseModel):
  model_name: str = 'microsoft/Phi-3.5-mini-instruct'
  torch_dtype: Any = torch.bfloat16 # 半精度浮點數
  adam_epsilon: float = 1e-4 # 當使用半精度浮點數時，需要設定較大的 adam epsilon
  saved_model_path: str = 'sample_data/saved_encoder_model' # path to save the trained model
  saved_lora_path: str = 'sample_data/saved_lora_model' # path to save the trained LORA model
  batch_size: int = 2 # size of the input batch in training and evaluation
  gradient_accumulation_steps: int = 2 # number of updates steps to accumulate before performing a backward/update pass
  epochs: int = 50 # number of times to iterate over the entire training dataset
  lr: float = 2e-4 # learning rate, controls how fast or slow the model learns
  weight_decay: float = 0.01 # weight decay, helps the model stay simple and avoid overfitting by penalizing large weights.

  # 文本生成相關設定
  temperature: float = 0.1 # temperature for sampling
  max_new_tokens: int = 125 # 限制最大生成字數
  repetition_penalty: float = 1.5 # 重複機率, 1~2 之間, 1.0 (no penalty), 2.0 (maximum penalty)

  # LORA 相關設定
  rank: int = 128 # rank of the Lora layers
  lora_alpha: int = rank * 2 # alpha for Lora scaling.
  lora_dropout: float = 0.05 # dropout probability for Lora layers

if device.type == 'mps': # 方便在 Apple Silicon 上快速測試
  config = Config(
    torch_dtype=torch.float16, # 在 Apple Silicon 若使用預訓練模型 opt-125m 需要使用全精度浮點數，否則會出現錯誤
  )
else:
  config = Config()

## Fine-tuning 前的表現

### 載入預訓練分詞器 (Tokenizer)

In [9]:
# 透過預訓練模型取得 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
  config.model_name,
)
pprint(tokenizer)

LlamaTokenizerFast(name_or_path='microsoft/Phi-3.5-mini-instruct', vocab_size=32000, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=False),
	32000: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32001: AddedToken("<|assistant|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32002: AddedToken("<|placeholder1|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special

In [10]:
# 檢視 Tokenizer，是否存在 padding token 及 padding side 等資訊
pprint(tokenizer.pad_token)

'<|endoftext|>'


In [11]:
pprint(tokenizer.padding_side)

'left'


* 如果沒有定義 `pad_token`，請定義一個 `pad_token`，並將其加入 Tokenizer 中。
* 如果 `padding_side` 不是 `right`，請將其設定為 `right`。

In [12]:
# Add pad_token to the tokenizer
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
  print('=== 設定 Padding Token ===')
  pprint(tokenizer)
# Make sure padding_side is 'right'
if tokenizer.padding_side != 'right':
  tokenizer.padding_side = 'right'
  print('=== 設定 Padding Side ===')
  pprint(tokenizer)

=== 設定 Padding Side ===
LlamaTokenizerFast(name_or_path='microsoft/Phi-3.5-mini-instruct', vocab_size=32000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=False),
	32000: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32001: AddedToken("<|assistant|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32002: AddedToken("<|placeholder1|>", rstrip=True, lstrip=False, single_word=False, 

### 載入預訓練模型

由於 GPU 記憶體有限，我們將使用半精度進行模型 Fine-tuning。這邊需要留意，使用半精度進行 Fine-tuning 時，`TrainingArguments` 中的 `adam_epsilon` 需要設定為 `1e-4`。預設的 `adam_epsilon` 是 `1e-8`，這個值在半精度訓練時會出現問題。

透過 `AutoModelForCausalLM` 用於因果語言建模的自動類別，它可以載入不同的預訓練模型進行文本生成任務。

In [13]:
model = AutoModelForCausalLM.from_pretrained(
  config.model_name,
  torch_dtype=config.torch_dtype,
  # 這個參數用於優化內存使用，減少模型加載時的 CPU 內存佔用，特別是在內存有限的環境中非常有用。
  low_cpu_mem_usage=True,
).to(device)

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.94s/it]


In [14]:
pprint(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,)

這是一個典型的 Decoder-Only Transformer 模型。

```json
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
```

### 配置聊天樣本 (Chat Template)

在語言模型中添加特殊標記對於訓練聊天模型至關重要。這些標記被添加在對話中不同角色之間，例如 `user`、`assistant` 和 `system`，幫助模型識別對話的結構和流程。這種設置對於使模型在聊天環境中生成連貫且上下文適當的回應是必不可少的。

`trl` 中的 `setup_chat_format()` 函數可以輕鬆地為對話式 AI 任務設置模型和分詞器。

In [15]:
# Set up the chat format with default 'chatml' format
if tokenizer.chat_template is None:
  model, tokenizer = setup_chat_format(model, tokenizer)
  print('=== 設定 chat format ===')
  pprint(tokenizer)

由於我們使用的模型已經是一個聊天模型，我們不需要再次設置對話格式。

### 詠唱格式化 (Prompt Formatting)

定義詠唱 (Prompt) 格式，我們將創建一個格式化函數。

請注意，這次我們指定 `add_generation_prompt` 為 `True`，表示回應開始的標記。這確保了當模型生成文本時，它會寫出機器人的回應，而不是做一些意想不到的事情，比如繼續用戶的訊息。請記住，聊天模型仍然只是語言模型，它們被訓練來玩文字接龍，而聊天對它們來說只是一種特殊的文本！你需要用適當的控制標記來引導它們，讓它們知道應該做什麼。

In [16]:
def instruction_formatter(x, tokenize):
  if tokenize:
    return tokenizer.apply_chat_template(
      [x['user']],
      tokenize=tokenize,
      add_generation_prompt=True,
      return_tensors='pt',
      return_dict=True,
    ).to(device)
  else:
    return tokenizer.apply_chat_template(
      [x['user']],
      tokenize=tokenize,
      add_generation_prompt=True,
    )

In [17]:
# tokenize=False 代表不進行 Tokenize，直接回傳原始文字
input = instruction_formatter(dataset[0], tokenize=False)
pprint(input)

'<|user|>\nWhat is a polygon?<|end|>\n<|assistant|>\n'


In [18]:
# tokenize=True 代表進行 Tokenize，回傳 Tokenize 後的 ID 及 attention mask tensors
tokenized_input = instruction_formatter(dataset[0], tokenize=True)
pprint(tokenized_input)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0'),
 'input_ids': tensor([[32010,  1724,   338,   263, 29807, 29973, 32007, 32001]],
       device='mps:0')}


Tokenizer 回傳內容包含兩個主要部分：`input_ids` 和 `attention_mask`。以下是詳細解釋：

* `input_ids`: 是一個張量 (tensor)，包含了輸入文本的 token IDs。這些 IDs 是由 tokenizer 將文本轉換為數字表示後得到的。

* `attention_mask`: 同樣是一個張量，用於指示模型應該關注哪些位置。值為 1 的位置表示應該關注，值為 0 的位置表示應該忽略。在這個例子中，`attention_mask` 的值全為 1，表示模型應該關注所有位置。

In [19]:
# 透過 Tokenizer 的 decode 方法將 ID 轉換回文字，並列顯示出來
for id in tokenized_input['input_ids'][0]:
  print(f'{id} -> {tokenizer.decode([id])}')

32010 -> <|user|>
1724 -> What
338 -> is
263 -> a
29807 -> polygon
29973 -> ?
32007 -> <|end|>
32001 -> <|assistant|>


### Fine-tuning 前的表現

#### 單筆演示生成回應

In [20]:
# 透過預訓練模型生成回應
output_ids = model.generate(
  **tokenized_input,
  temperature=config.temperature,
  max_new_tokens=config.max_new_tokens,
  repetition_penalty=config.repetition_penalty,
)

  test_elements = torch.tensor(test_elements)


In [21]:
output_ids

tensor([[32010,  1724,   338,   263, 29807, 29973, 32007, 32001,   319,  3579,
          3733, 17125,  1068, 14637,   304,   738,  1023, 29899, 12531, 26224,
          4377,   393, 11624,   310,  7812, 29892,  1661,  1639,  8803,   292,
          1196, 24611,   470,   376, 29879,  2247, 29908,   607,  2094,  2226,
          8162, 29889,  4525, 11192,   526,  6631,   472,  1009,  1095,  9748,
          2000, 13791,   313,  2976,  1070, 29901, 12688,   467,   450, 13290,
          5120,  8429,   491,  1438,  3454,   322,   278,   427, 15603,  2913,
          2768,   372,   508,   367, 10423,   411,  2927,   565,  7429,   363,
          7604,  2133, 11976, 29936,   445,  4038,  2629,   541,   451,  3704,
           967, 10452, 17645,   825,   591,  1246,   525,  1552, 10694, 29915,
           297, 16303,  4958,   746,  5353,  1048,  1248,  4790,   787, 10816,
           373, 12151, 28001,   763,  5650, 11053,  2992,  1696,  2466,  5722,
          1711, 13590,   896,  1863, 16978,   408,  

In [22]:
# 將 output_ids 轉換為文字
output = tokenizer.decode(
  output_ids[0],
  skip_special_tokens=False, # 決定是否跳過特殊 token（例如，開始和結束標記）。
)

In [23]:
pprint(output)

('<|user|> What is a polygon?<|end|><|assistant|> A **polygon** refers to any '
 'two-dimensional geometric figure that consists of straight, nonintersecting '
 'line segments or "sides" which enclose spaces. These sides are connected at '
 'their endpoints called vertices (singular: vertex). The interior region '
 'formed by these lines and the enclosed space inside it can be filled with '
 'color if desired for visualization purposes; this area within but not '
 "including its boundary defines what we call 'the plane' in geometry terms "
 'when discuss about polygons specifically on flat surfaces like paper maps '
 'etc., though technically speaking they exist everywhere as long there’re no '
 'curved')


In [24]:
# 只取得生成的文字, 即 <|assistant|> 之後的文字
pprint(output.split('<|assistant|>')[1].strip())

('A **polygon** refers to any two-dimensional geometric figure that consists '
 'of straight, nonintersecting line segments or "sides" which enclose spaces. '
 'These sides are connected at their endpoints called vertices (singular: '
 'vertex). The interior region formed by these lines and the enclosed space '
 'inside it can be filled with color if desired for visualization purposes; '
 "this area within but not including its boundary defines what we call 'the "
 "plane' in geometry terms when discuss about polygons specifically on flat "
 'surfaces like paper maps etc., though technically speaking they exist '
 'everywhere as long there’re no curved')


#### 批次處理模型表現

初步了解如何生成模型的回應，我們將定義一個 `generate()` 函數來生成模型的回應。這個函數接受一個輸入文本，並生成模型的回應。藉由這個函數，我們可以批次處理資料。


In [25]:
# 將以上程式碼整理成一個函式，方便我們批次處理資料
def generator(x, model):
  tokenized_input = instruction_formatter(x, tokenize=True)
  output_ids = model.generate(
    **tokenized_input,
    temperature=config.temperature,
    max_new_tokens=config.max_new_tokens,
    repetition_penalty=config.repetition_penalty,
  )
  output = tokenizer.decode(output_ids[0], skip_special_tokens=False)
  return output.split('<|assistant|>')[1].strip()

In [26]:
# 這個步驟可能會花費一些時間，所以我們只處理前 first_n_data 筆資料
first_n_dataset = dataset.select(range(first_n_data))

# 移除 messages 欄位
first_n_dataset = first_n_dataset.remove_columns('messages')

# 透過預訓練模型生成回應，將其新增到 first_n_dataset 的 pt_response 欄位中
first_n_dataset = first_n_dataset.map(
  lambda x: {
    **x,
    "pt_response": generator(x, model),
  },
  batched=False,
)

  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)
Map: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [01:06<00:00, 33.14s/ examples]


In [27]:
# 顯示預訓練模型預測結果
pd.set_option('display.max_colwidth', None)
pd.DataFrame(first_n_dataset)

Unnamed: 0,user,assistant,pt_response
0,"{'content': 'What is a polygon?', 'role': 'user'}","{'content': 'A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It is a closed chain of connected line segments or edges. The vertices of the polygon are formed where two edges meet. Examples of polygons are hexagons, pentagons, and octagons. Any plane that does not contain edges or vertices is not a polygon. An example of a non-polygon is a circle.', 'role': 'assistant'}","A **polygon** refers to any two-dimensional geometric figure that consists of straight, nonintersecting line segments or ""sides"" which enclose spaces. These sides are connected at their endpoints called vertices (singular: vertex). The interior region formed by these lines and the enclosed space inside it can be filled with color if desired for visualization purposes; this area within but not including its boundary defines what we call 'the plane' in geometry terms when discuss about polygons specifically on flat surfaces like paper maps etc., though technically speaking they exist everywhere as long there’re no curved"
1,"{'content': 'How do I start running?', 'role': 'user'}","{'content': 'Make sure you get comfortable running shoes and attire. Start with achievable goal in mind like a 5K race. If you never ran before, start gradually from a walk, to brisk walk, light jog aiming for 15-30mins initially. Slowly increase your running time and distance as your fitness level improves. One of the most important things is cool down and gentle stretching. Always listen to your body, and take rest days when needed to prevent injury.', 'role': 'assistant'}","Starting to run is a great way for improving your fitness, mental health and overall well-being. Here's how you can begin:\n 1) Set realistic goals - Determine what distance or time frame works best with the current level of physical activity in mind (either beginner mileage guidelines like Couch25K/Coworkout30k if starting from scratch). Remember that progress takes patience! \t * Choose appropriate footwear – Investment into good quality shoes will help prevent injuries downline by providing proper support during"


## 訓練模型

### 訓練資料格式

隨著 `trl` 的最新版本發布，現在支持流行的指令 (instruction) 和對話 (conversation) 數據集格式。這意味著我們只需要將數據集轉換為支持的格式之一，`trl` 會處理其餘的部分。這些格式包括：

* 指令格式 instruction format

```json
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

* 對話格式 conversational format

```json
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

我們所準備的資料集恰巧符合對話格式，因此我們可以直接使用它。

In [28]:
# 顯示單筆方便閱讀
pprint(dataset[0]['messages'])

[{'content': 'What is a polygon?', 'role': 'user'},
 {'content': 'A polygon is a form in Geometry.  It is a single dimensional '
             'plane made of connecting lines and any number of vertices.  It '
             'is a closed chain of connected line segments or edges.  The '
             'vertices of the polygon are formed where two edges meet.  '
             'Examples of polygons are hexagons, pentagons, and octagons.  Any '
             'plane that does not contain edges or vertices is not a polygon.  '
             'An example of a non-polygon is a circle.',
  'role': 'assistant'}]


### LoRA 的訓練策略 - 降維打擊

LoRA（Low-Rank Adaptation）是一種用於訓練大型語言模型的技術，旨在提高訓練效率並減少計算資源的需求。以下是為何需要透過LoRA訓練的一些原因：

* 降低計算成本：LoRA 通過將模型的權重矩陣分解為低秩矩陣，顯著減少了參數的數量，從而降低了計算成本和內存需求。

* 加速訓練速度：由於參數數量減少，LoRA 可以加速模型的訓練過程，使得在相同的硬件資源下能夠更快地完成訓練。

![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png)

In [29]:
# 查看預訓練模型可訓練的參數量，其數量相當龐大，所以需要透過 Low Rank Adaptation (LoRA) 來降低參數量
print('Parameters: {:,}, Trainable Parameters: {:,}'.format(
  model.num_parameters(),
  model.num_parameters(only_trainable=True)))

Parameters: 3,821,079,552, Trainable Parameters: 3,821,079,552


#### LoRA 配置

* `task_type`: TaskType.CAUSAL_LM 指定任務類型為因果語言模型 (Causal Language Model)。

* `rank`: 是低秩矩陣的秩(rank)，它決定了 LoRA 層的參數數量。較低的 `r` 值意味著較少的參數，從而減少了模型的計算和存儲需求。具體來說，LoRA 通過將全連接層的權重矩陣分解為兩個低秩矩陣來實現參數高效化。`r` 值越小，這兩個低秩矩陣的維度越小，這個練習我們採用 128。

* `lora_alpha`: 是一個縮放因子，用於調整 LoRA 層的輸出。它控制了低秩矩陣的影響力。較高的 `lora_alpha` 值會增加 LoRA 層的影響力，也就是說值越高，越容易把大模型既有的能力給覆蓋掉。具體來說，LoRA 層的輸出會乘以這個縮放因子，這個練習我們採用常見的比例為 `rank` 的兩倍。

* `lora_dropout`: 是一個丟棄率，用於在訓練過程中隨機丟棄 LoRA 層的一部分輸出。這有助於防止過擬合，並提高模型的泛化能力。例如，`lora_dropout` 設置為 0.1 表示在每次前向傳播中，有 10% 的 LoRA 層輸出會被隨機設置為零。

* `target_module`: 指定了應用 LoRA 的目標模塊。這通常是模型中的某些特定層或子模塊，例如 Transformer 模型中的注意力層，可以透過 `model.named_parameters` 查看。通過指定 `target_module`，你可以靈活地選擇在哪些層應用 LoRA，以便在保持模型性能的同時減少參數數量。

> 廣為周知的模型當未指定 `target_module`，透過 `get_peft_model` 加載 Lora 適配模型時，會自動設定。
> 可以先嘗試不指定，若出現錯誤再試著設定注意力相關的參數層。


In [30]:
# LoRA 配置
lora_config = LoraConfig(
  task_type=TaskType.CAUSAL_LM,
  r=config.rank,
  lora_alpha=config.lora_alpha,
  lora_dropout=config.lora_dropout,
  # Phi3ForCausalLM need to specify the target_modules beforehand
  target_modules=['qkv_proj'],
)

pprint(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>,
           peft_type=<PeftType.LORA: 'LORA'>,
           auto_mapping=None,
           base_model_name_or_path=None,
           revision=None,
           inference_mode=False,
           r=128,
           target_modules={'qkv_proj'},
           exclude_modules=None,
           lora_alpha=256,
           lora_dropout=0.05,
           fan_in_fan_out=False,
           bias='none',
           use_rslora=False,
           modules_to_save=None,
           init_lora_weights=True,
           layers_to_transform=None,
           layers_pattern=None,
           rank_pattern={},
           alpha_pattern={},
           megatron_config=None,
           megatron_core='megatron.core',
           loftq_config={},
           eva_config=None,
           use_dora=False,
           layer_replication=None,
           runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False),
           lora_bias=False)


#### 加載 LoRA 適配模型

搭配預訓模型及 LoRA 配置，我們可以加載 LoRA 適配模型。我們可以觀察受到降維影響的模型層。

In [31]:
# 加載 LoRA 適配模型
peft_model = get_peft_model(
  model, # 預訓練模型
  lora_config, # LoRA 配置
)

In [32]:
pprint(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>,
           peft_type=<PeftType.LORA: 'LORA'>,
           auto_mapping=None,
           base_model_name_or_path='microsoft/Phi-3.5-mini-instruct',
           revision=None,
           inference_mode=False,
           r=128,
           target_modules={'qkv_proj'},
           exclude_modules=None,
           lora_alpha=256,
           lora_dropout=0.05,
           fan_in_fan_out=False,
           bias='none',
           use_rslora=False,
           modules_to_save=None,
           init_lora_weights=True,
           layers_to_transform=None,
           layers_pattern=None,
           rank_pattern={},
           alpha_pattern={},
           megatron_config=None,
           megatron_core='megatron.core',
           loftq_config={},
           eva_config=None,
           use_dora=False,
           layer_replication=None,
           runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False),
           lora_bias=False)


#### LoRA 適配模型

加載 LoRA 適配模型後, 觀察受 LoRA 影響的模型參數

In [33]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Phi3ForCausalLM(
      (model): Phi3Model(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x Phi3DecoderLayer(
            (self_attn): Phi3SdpaAttention(
              (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
              (qkv_proj): lora.Linear(
                (base_layer): Linear(in_features=3072, out_features=9216, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=128, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=128, out_features=9216, bias=False)
                )
                (lora_embedding_A): ParameterDict()


由於我們指定的 `target_module` 是 `qkv_proj`, 因此所有注意力層受到 LoRA 的影響。

```json
  (qkv_proj): lora.Linear(
    (base_layer): Linear(in_features=3072, out_features=9216, bias=False)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.05, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=3072, out_features=128, bias=False)
    )
    (lora_B): ModuleDict(
      (default): Linear(in_features=128, out_features=9216, bias=False)
    )
    (lora_embedding_A): ParameterDict()
    (lora_embedding_B): ParameterDict()
    (lora_magnitude_vector): ModuleDict()
  )
```              

#### 調整 LoRA 精度

LoRA 適配模型的精度是 `torch.float32`，我們可以透過 `model.half()` 將其轉換為半精度。

In [34]:
# 獲取 LoRA 模型參數名稱及型態，確認是否使用半精度浮點數
for name, param in peft_model.named_parameters():
  print(f'{name}: {param.dtype}')

base_model.model.model.embed_tokens.weight: torch.float16
base_model.model.model.layers.0.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float32
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float32
base_model.model.model.layers.0.mlp.gate_up_proj.weight: torch.float16
base_model.model.model.layers.0.mlp.down_proj.weight: torch.float16
base_model.model.model.layers.0.input_layernorm.weight: torch.float16
base_model.model.model.layers.0.post_attention_layernorm.weight: torch.float16
base_model.model.model.layers.1.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.lora_A.default.weight: torch.float32
base_model.model.model.layers.1.self_attn.qkv_proj.lora_B.default.weight: torch.

可以發現除了預訓練模型的權重是半精度外，LoRA 適配模型的權重仍然是全精度。

```shell
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float32
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float32
```

In [35]:
if config.torch_dtype == torch.float16 or config.torch_dtype == torch.bfloat16:
  peft_model = peft_model.half() # 轉換為半精度浮點數

In [36]:
# 獲取 LoRA 模型參數名稱及型態，確認是否使用半精度浮點數
for name, param in peft_model.named_parameters():
  print(f'{name}: {param.dtype}')

base_model.model.model.embed_tokens.weight: torch.float16
base_model.model.model.layers.0.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float16
base_model.model.model.layers.0.mlp.gate_up_proj.weight: torch.float16
base_model.model.model.layers.0.mlp.down_proj.weight: torch.float16
base_model.model.model.layers.0.input_layernorm.weight: torch.float16
base_model.model.model.layers.0.post_attention_layernorm.weight: torch.float16
base_model.model.model.layers.1.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.lora_B.default.weight: torch.

經過 `model.half()` 轉換後，LoRA 適配模型的權重也變成半精度。

```shell
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float16
```

訓練參數量也從原先 3B 大大減少為 50M。

In [37]:
# 查看可訓練的參數量
peft_model.print_trainable_parameters()

trainable params: 50,331,648 || all params: 3,871,411,200 || trainable%: 1.3001


### 詠唱格式化 (Prompt Formatting)

有別於先前的詠唱格式，這次我們將包含 `assistant` 的回應，以便作為標注資料供模型訓練。由於已經包含 `assistant`，這次我們指定 `add_generation_prompt` 為 `False`，省卻回應開始的標記。

另一個差異是，這個函式預設不會進行 tokenize，會直接回傳原始文字。

In [38]:
def instruction_completion_formatter(x, tokenize: bool = False):
  return tokenizer.apply_chat_template(
    x['messages'],
    tokenize=tokenize,
    add_generation_prompt=False,
  )

In [39]:
pprint(instruction_completion_formatter(dataset[0]))

('<|user|>\n'
 'What is a polygon?<|end|>\n'
 '<|assistant|>\n'
 'A polygon is a form in Geometry.  It is a single dimensional plane made of '
 'connecting lines and any number of vertices.  It is a closed chain of '
 'connected line segments or edges.  The vertices of the polygon are formed '
 'where two edges meet.  Examples of polygons are hexagons, pentagons, and '
 'octagons.  Any plane that does not contain edges or vertices is not a '
 'polygon.  An example of a non-polygon is a circle.<|end|>\n'
 '<|endoftext|>')


### 資料校對器 (Data Collator)

在微調語言模型時，使用 data collator 是為了有效地準備和處理批次數據。以下是使用 data collator 的幾個主要原因：

* 動態填充 (Dynamic Padding): 不同長度的序列需要填充到相同的長度，以便能夠在同一批次中進行處理。Data collator 可以自動計算每個批次的最大長度，並對序列進行適當的填充。

* 批次處理 (Batch Processing): Data collator 可以將多個樣本組合成一個批次，這樣可以更高效地利用計算資源，特別是在使用 GPU 或 TPU 時。

* 生成注意力掩碼 (Attention Masks): 在填充序列時，data collator 會生成相應的注意力掩碼 (attention masks)，以確保模型只關注實際的數據部分，而忽略填充部分。

* 簡化代碼 (Code Simplification): 使用 data collator 可以簡化數據處理的代碼，減少手動處理數據的繁瑣步驟，讓開發者專注於模型設計和訓練。

總之，data collator 在微調語言模型時提供了便利和效率，確保數據能夠以一致且高效的方式進行處理。

在這邊我們使用 `DataCollatorForCompletionOnlyLM` 是一個專門用於處理語言模型補全任務的數據整理器。

In [40]:
# 定義回應開始的標記
response_template = '<|assistant|>'

# 設定 DataCollatorForCompletionOnlyLM
data_collator = DataCollatorForCompletionOnlyLM(
  tokenizer=tokenizer,
  response_template=response_template,
)

In [41]:
# 展示 DataCollatorForCompletionOnlyLM 的輸出, 標籤以 -100 表示在損失函數中不會被考慮
batch = data_collator([instruction_completion_formatter(dataset[i], True) for i in range(first_n_data)])
pprint(batch)

{'input_ids': tensor([[32010,  1724,   338,   263, 29807, 29973, 32007, 32001,   319, 29807,
           338,   263,   883,   297,  1879,  7843, 29889, 29871,   739,   338,
           263,  2323, 22112, 10694,  1754,   310, 16791,  3454,   322,   738,
          1353,   310, 13791, 29889, 29871,   739,   338,   263,  5764,  9704,
           310,  6631,  1196, 24611,   470, 12770, 29889, 29871,   450, 13791,
           310,   278, 29807,   526,  8429,   988,  1023, 12770,  5870, 29889,
         29871,  1222,  9422,   310,  1248,  4790,   787,   526, 15090,   351,
           787, 29892, 11137,   351,   787, 29892,   322,  4725,   351,   787,
         29889, 29871,  3139, 10694,   393,   947,   451,  1712, 12770,   470,
         13791,   338,   451,   263, 29807, 29889, 29871,   530,  1342,   310,
           263,  1661, 29899,  3733, 17125,   338,   263,  8607, 29889, 32007,
         32000, 32000, 32000, 32000, 32000, 32000],
        [32010,  1128,   437,   306,  1369,  2734, 29973, 32007, 

In [42]:
# 透過 Tokenizer 的 decode 方法將 ID 轉換回文字，並列標籤顯示出來
for idx in range(first_n_data):
  input_ids = batch['input_ids'][idx]
  labels_ids = batch['labels'][idx]
  input = [tokenizer.decode(id) for id in input_ids]
  labels = ['-'] * len(input_ids)
  for i, id in enumerate(labels_ids):
    if id != -100:
      labels[i] = tokenizer.decode(id)
  pprint(f'input: {input}')
  pprint(f'label: {labels}')

("input: ['<|user|>', 'What', 'is', 'a', 'polygon', '?', '<|end|>', "
 "'<|assistant|>', 'A', 'polygon', 'is', 'a', 'form', 'in', 'Ge', 'ometry', "
 "'.', '', 'It', 'is', 'a', 'single', 'dimensional', 'plane', 'made', 'of', "
 "'connecting', 'lines', 'and', 'any', 'number', 'of', 'vertices', '.', '', "
 "'It', 'is', 'a', 'closed', 'chain', 'of', 'connected', 'line', 'segments', "
 "'or', 'edges', '.', '', 'The', 'vertices', 'of', 'the', 'polygon', 'are', "
 "'formed', 'where', 'two', 'edges', 'meet', '.', '', 'Ex', 'amples', 'of', "
 "'pol', 'yg', 'ons', 'are', 'hex', 'ag', 'ons', ',', 'pent', 'ag', 'ons', "
 "',', 'and', 'oct', 'ag', 'ons', '.', '', 'Any', 'plane', 'that', 'does', "
 "'not', 'contain', 'edges', 'or', 'vertices', 'is', 'not', 'a', 'polygon', "
 "'.', '', 'An', 'example', 'of', 'a', 'non', '-', 'pol', 'ygon', 'is', 'a', "
 "'circle', '.', '<|end|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', "
 "'<|endoftext|>', '<|endoftext|>', '<|endoftext|>']")
("label: ['-',

你可以清楚觀察到，損失函數不會去關注包含 `<|assistant|>` 之前的部分，這樣可以讓模型專注於生成 `<|assistant|>` 之後的回應。

考量批次訓練不同長度的序列需要填充到相同的長度，以便能夠在同一批次中進行處理。Data collator 自動序列進行適當的填充，填充的部分亦不會參與損失計算。

### 訓練參數設定

用於設定訓練過程中的各種參數，如學習率、批次大小、梯度累積步數、訓練 epoch 數、權重衰減等。

* `output_dir` 指定了訓練輸出的目錄。
* `eval_strategy` 和 `save_strategy` 設定為 'epoch'，表示每個 epoch 都會進行評估和儲存。
* `load_best_model_at_end` 設定為 `True`，表示訓練結束後會載入最佳模型。
* `report_to` 設定為 'none'，禁用了 wandb 報告。
* `adam_epsilon` 設定了 Adam 優化器的 epsilon 值。
* `packing` 設定為 `False`，當使用 `DataCollatorForCompletionOnlyLM` 時禁用 packing。
* `save_total_limit` 設定了最多儲存 5 個 checkpoints。

In [43]:
training_args = SFTConfig(
  output_dir='sample_data/train_output_qa', # 訓練輸出目錄
  learning_rate=config.lr, # 學習率
  per_device_train_batch_size=config.batch_size, # 每個設備的訓練批次大小
  per_device_eval_batch_size=config.batch_size, # 每個設備的評估批次大小
  gradient_accumulation_steps=config.gradient_accumulation_steps, # 梯度累積步數
  num_train_epochs=config.epochs, # 訓練的總 epoch 數
  weight_decay=config.weight_decay, # 權重衰減
  eval_strategy='epoch', # 每個 epoch 評估一次
  save_strategy='epoch', # 每個 epoch 儲存一次
  load_best_model_at_end=True, # 訓練完後載入最佳模型
  report_to='none', # 禁用 wandb 報告 (Colab 環境預設需要 wandb)
  adam_epsilon=config.adam_epsilon, # 當使用半精度浮點數時，需要設定較大的 adam epsilon
  packing=False, # 當使用 DataCollatorForCompletionOnlyLM 時禁用 packing
  save_total_limit=5, # 最多儲存 5 個 checkpoints
)

### 訓練器初始化

用於初始化訓練器，並開始訓練模型。

* `model` 是要訓練的模型。
* `tokenizer` 是用於處理文本的分詞器。
* `train_dataset` 和 `eval_dataset` 是訓練和評估數據集。
* `formatting_func` 是用於格式化數據的函數。
* `data_collator` 是用於整理數據的數據整理器。

In [44]:
trainer = SFTTrainer(
    model=peft_model, # 要訓練的模型
    tokenizer=tokenizer, # 使用的分詞器
    args=training_args, # 訓練參數
    train_dataset=dataset, # 訓練數據集
    eval_dataset=dataset, # 評估數據集
    formatting_func=instruction_completion_formatter, # 格式化函數
    data_collator=data_collator, # 數據整理器
)

  trainer = SFTTrainer(


### 開始訓練

In [45]:
# 開始訓練，這可能需要一些時間
trainer.train()

Epoch,Training Loss,Validation Loss
0,No log,1.436862
2,No log,0.876358
4,No log,0.387336
6,No log,0.156587
8,No log,0.08445
10,No log,0.065174
12,No log,0.035158
14,No log,0.015012
16,No log,0.005875
18,No log,0.006033


TrainOutput(global_step=500, training_loss=0.16114239501953126, metrics={'train_runtime': 2556.785, 'train_samples_per_second': 0.821, 'train_steps_per_second': 0.196, 'total_flos': 5647413307244544.0, 'train_loss': 0.16114239501953126, 'epoch': 47.61904761904762})

訓練完成後，您可以通過運行 `Trainer.evaluate()` 查看最終的分數:

In [None]:
trainer.evaluate()

#### 保存 LoRA 模型參數

In [46]:
# 保存 Lora 参数
peft_model.save_pretrained(
  config.saved_lora_path,
  # warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
  save_embedding_layers=True,
)

#### 保存 Tokenizer

In [47]:
# 保存 Tokenizer
tokenizer.save_pretrained(config.saved_model_path)

('sample_data/saved_encoder_model/tokenizer_config.json',
 'sample_data/saved_encoder_model/special_tokens_map.json',
 'sample_data/saved_encoder_model/tokenizer.json')

### 釋放資源

In [48]:
# import garbage collector
import gc

# 釋放 GPU 記憶體
del trainer
del tokenizer

peft_model.to('cpu')
del peft_model

torch.cuda.empty_cache()

gc.collect()

304

## 評估微調模型

### 載入微調分詞器 (Tokenizer)

從已經完成訓練的模型取得 Tokenizer，可以留意這個訓練時保存下來的 Tokenizer 仍保有訓練時的設定，包涵 `pad_token` 和 `padding_side`。

In [49]:
tokenizer = AutoTokenizer.from_pretrained(
  config.saved_model_path
)

In [50]:
# 檢視 Tokenizer 是否存在 padding token 及 padding side 等資訊
pprint(tokenizer.pad_token)

'<|endoftext|>'


In [51]:
pprint(tokenizer.padding_side)

'right'


### 載入微調後模型

In [52]:
ft_model = PeftModel.from_pretrained(
  model, # 預訓練模型
  config.saved_lora_path, # LoRA 適配模型
  # 這個參數用於優化內存使用，減少模型加載時的 CPU 內存佔用，特別是在內存有限的環境中非常有用。
  low_cpu_mem_usage=True,
  torch_dtype=config.torch_dtype,
).to(device)

### Fine-tuning 後的表現

In [53]:
# 透過微調模型生成回應，將其新增到 first_n_dataset 的 ft_response 欄位中
first_n_dataset = first_n_dataset.map(
  lambda x: {
    **x,
    "ft_response": generator(x, ft_model),
  },
  batched=False,
)

  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)
  test_elements = torch.tensor(test_elements)
Map: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [01:02<00:00, 31.17s/ examples]


In [54]:
# 顯示微調模型預測結果
pd.set_option('display.max_colwidth', None)
pd.DataFrame(first_n_dataset)

Unnamed: 0,user,assistant,pt_response,ft_response
0,"{'content': 'What is a polygon?', 'role': 'user'}","{'content': 'A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It is a closed chain of connected line segments or edges. The vertices of the polygon are formed where two edges meet. Examples of polygons are hexagons, pentagons, and octagons. Any plane that does not contain edges or vertices is not a polygon. An example of a non-polygon is a circle.', 'role': 'assistant'}","A **polygon** refers to any two-dimensional geometric figure that consists of straight, nonintersecting line segments or ""sides"" which enclose spaces. These sides are connected at their endpoints called vertices (singular: vertex). The interior region formed by these lines and the enclosed space inside it can be filled with color if desired for visualization purposes; this area within but not including its boundary defines what we call 'the plane' in geometry terms when discuss about polygons specifically on flat surfaces like paper maps etc., though technically speaking they exist everywhere as long there’re no curved","A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It has sides which are formed by joining two adjacent vertices together. Examples of polygons are hexagons, pentagon, or octogon. Any closed chain of connected line segments or edges makes up the boundary of a polygon. The interior space within this boundary is called one side of the Polygon. Although it may seem strange, but anti-polyglots exist who dislike Polygon shapes.\nAlthough technically starting from Medieval"
1,"{'content': 'How do I start running?', 'role': 'user'}","{'content': 'Make sure you get comfortable running shoes and attire. Start with achievable goal in mind like a 5K race. If you never ran before, start gradually from a walk, to brisk walk, light jog aiming for 15-30mins initially. Slowly increase your running time and distance as your fitness level improves. One of the most important things is cool down and gentle stretching. Always listen to your body, and take rest days when needed to prevent injury.', 'role': 'assistant'}","Starting to run is a great way for improving your fitness, mental health and overall well-being. Here's how you can begin:\n 1) Set realistic goals - Determine what distance or time frame works best with the current level of physical activity in mind (either beginner mileage guidelines like Couch25K/Coworkout30k if starting from scratch). Remember that progress takes patience! \t * Choose appropriate footwear – Investment into good quality shoes will help prevent injuries downline by providing proper support during","Make sure you get comfortable shoes and attire. Start with achievable goal in mind like a 5K race. If you never ran before, gradually build up your training program starting slow miles such as1-2 easy mileutes until reaching target distancegoal for the day or week. Include rest days to recover\n. One of the most important things is warmup and cool down routines. Always listen to your body's signals and take breaks when needed. Cool Down: Slow jogging walk followed by stretching exercises focusing on legs and lower back. Warm Up: E"


# (Optional) Download files from Colab workspace

In [55]:
![[ ! -z "${COLAB_GPU}" ]] && tar cvzf saved_encoder_model.tgz sample_data/saved_encoder_model/
![[ ! -z "${COLAB_GPU}" ]] && tar cvzf saved_lora_model.tgz sample_data/saved_lora_model/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [56]:
import os
if 'COLAB_GPU' in os.environ:
  from google.colab import files
  files.download('saved_encoder_model.tgz')
  files.download('saved_lora_model.tgz')