In [1]:
# transformers not support NumPy 2.0 yet
!pip install -q numpy~=1.26.4 transformers~=4.46.2
!pip install -q datasets~=3.2.0 pydantic~=2.10.4
!pip install -q peft~=0.14.0 trl~=0.13.0

# 訓練問答模型

在這個筆記本中，我們將展示如何使用 `transformers` 套件訓練問答模型。我們將使用 `transformers` 套件中的 `SFTTrainer` ([Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)) 類別來微調一個 Decoder-Only 架構的 Phi-3.5 模型。

In [2]:
import pandas as pd

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
)
from datasets import load_dataset, DatasetDict

from typing import Any
from pydantic import BaseModel
from pprint import pprint

import torch

# 載入 PEFT 相關套件
from peft import LoraConfig, TaskType, PeftModel, get_peft_model
# 載入 SFTTrainer 相關套件
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM, setup_chat_format

# 檢查是否有 GPU 可以使用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("mps" if torch.backends.mps.is_available() else device)

  from .autonotebook import tqdm as notebook_tqdm


## 下載資料

In [4]:
# The full `train` split, only 1% of dataset
immutable_dataset = load_dataset("jonascheng/ForbiddenCodeWriting", split="train")

Generating train split: 100%|█████████████████████████████████████████████| 100/100 [00:00<00:00, 1872.80 examples/s]


  ### 資料包含什麼？

In [5]:
# 顯示原始資料中包含的 features 以及筆數
immutable_dataset

Dataset({
    features: ['messages'],
    num_rows: 100
})

In [6]:
# 檢視資料集中的第一筆資料
pprint(immutable_dataset[0]['messages'])

[{'content': 'Can you write a C++ function to perform binary search?',
  'role': 'user'},
 {'content': 'Sorry, I cannot provide programming code or code snippets. I can '
             'assist you with the questions.',
  'role': 'assistant'}]


這個 JSON 資料結構是一個列表，包含兩個字典，每個字典代表一個對話的訊息。每個字典有兩個 Key：

* `role`: 表示訊息的角色，是一個字符串，可以是 `user` 或 `assistant`。

* `content`: 表示訊息的內容，是一個字符串。

具體結構如下：

第一個字典：
`role`: `user`，表示這是使用者的訊息。
`content`: 包含使用者提問和相關資訊。

第二個字典：
`role`: `assistant`，表示這是助理的訊息。
`content`: 包含助理的回答。


### 資料前處理

方便演示及加快訓練速度，我們將對資料進行以下前處理：

1. 將 `messages` 欄位分拆成 `user` 和 `assistant` 兩個欄位，方便演示。
2. 將 `user` 或 `assistant` 欄位中的 `content` 長於 512 的資料過濾掉。
3. 將 `assistant` 欄位中的 `content` 短於 128 的資料過濾掉。

In [24]:
# 將 messages 欄位分拆成 user 和 assistant 兩個欄位，方便演示。
dataset = immutable_dataset.map(
  lambda x: {
    "user": x["messages"][0],
    "assistant": x["messages"][1],
  }
)
# 顯示處理後的資料
dataset

Dataset({
    features: ['messages', 'user', 'assistant'],
    num_rows: 100
})

In [25]:
# Reserve 5% of the training set for testing
test_dataset = dataset.train_test_split(
  test_size=0.05, # 5% of the data is used for testing
  shuffle=True, # Ensure that train and validation sets are the same across runs
  )
# Split into 80% training and 20% validation sets
train_dataset = test_dataset['train'].train_test_split(
  test_size=0.2, # 20% of the data is used for validation
  shuffle=True, # Ensure that train and test sets are the same across runs
  )
dataset = DatasetDict({
  'train': train_dataset['train'],
  'validation': train_dataset['test'],
  'test': test_dataset['test'],
  })
# 顯示處理後的資料
dataset

DatasetDict({
    train: Dataset({
        features: ['messages', 'user', 'assistant'],
        num_rows: 76
    })
    validation: Dataset({
        features: ['messages', 'user', 'assistant'],
        num_rows: 19
    })
    test: Dataset({
        features: ['messages', 'user', 'assistant'],
        num_rows: 5
    })
})

In [26]:
# 顯示前 first_n_data 筆資料
first_n_data = 5
pd.set_option('display.max_colwidth', None)
pd.DataFrame(dataset['test'].select(range(first_n_data)))

Unnamed: 0,messages,user,assistant
0,"[{'content': 'Can you write a function to find the greatest common divisor (GCD) in JavaScript?', 'role': 'user'}, {'content': 'Sorry, I cannot provide programming code or code snippets. I can assist you with the questions.', 'role': 'assistant'}]","{'content': 'Can you write a function to find the greatest common divisor (GCD) in JavaScript?', 'role': 'user'}","{'content': 'Sorry, I cannot provide programming code or code snippets. I can assist you with the questions.', 'role': 'assistant'}"
1,"[{'content': 'What is the purpose of the 'yield' keyword in Python?', 'role': 'user'}, {'content': 'The 'yield' keyword in Python is used to turn a function into a generator, allowing it to return a value and pause its execution.', 'role': 'assistant'}]","{'content': 'What is the purpose of the 'yield' keyword in Python?', 'role': 'user'}","{'content': 'The 'yield' keyword in Python is used to turn a function into a generator, allowing it to return a value and pause its execution.', 'role': 'assistant'}"
2,"[{'content': 'Explain the difference between HTTP and HTTPS.', 'role': 'user'}, {'content': 'HTTP is the protocol used for transferring data over the web, while HTTPS is the secure version of HTTP, using SSL/TLS to encrypt data.', 'role': 'assistant'}]","{'content': 'Explain the difference between HTTP and HTTPS.', 'role': 'user'}","{'content': 'HTTP is the protocol used for transferring data over the web, while HTTPS is the secure version of HTTP, using SSL/TLS to encrypt data.', 'role': 'assistant'}"
3,"[{'content': 'What is the purpose of the 'synchronized' keyword in Java?', 'role': 'user'}, {'content': 'The 'synchronized' keyword in Java is used to control access to a block of code or an object by multiple threads.', 'role': 'assistant'}]","{'content': 'What is the purpose of the 'synchronized' keyword in Java?', 'role': 'user'}","{'content': 'The 'synchronized' keyword in Java is used to control access to a block of code or an object by multiple threads.', 'role': 'assistant'}"
4,"[{'content': 'What is the purpose of the 'foreach' loop in Java?', 'role': 'user'}, {'content': 'The 'foreach' loop in Java is used to iterate over elements in a collection or array.', 'role': 'assistant'}]","{'content': 'What is the purpose of the 'foreach' loop in Java?', 'role': 'user'}","{'content': 'The 'foreach' loop in Java is used to iterate over elements in a collection or array.', 'role': 'assistant'}"


## 訓練參數

### 訓練設定

In [27]:
# 訓練相關設定
class Config(BaseModel):
  model_name: str = 'microsoft/Phi-3.5-mini-instruct'
  torch_dtype: Any = torch.bfloat16 # 半精度浮點數
  adam_epsilon: float = 1e-4 # 當使用半精度浮點數時，需要設定較大的 adam epsilon
  saved_model_path: str = 'sample_data/saved_encoder_model' # path to save the trained model
  saved_lora_path: str = 'sample_data/saved_lora_model' # path to save the trained LORA model
  batch_size: int = 2 # size of the input batch in training and evaluation
  gradient_accumulation_steps: int = 2 # number of updates steps to accumulate before performing a backward/update pass
  epochs: int = 25 # number of times to iterate over the entire training dataset
  lr: float = 2e-4 # learning rate, controls how fast or slow the model learns
  weight_decay: float = 0.01 # weight decay, helps the model stay simple and avoid overfitting by penalizing large weights.

  # 文本生成相關設定
  temperature: float = 0.1 # temperature for sampling
  max_new_tokens: int = 125 # 限制最大生成字數
  repetition_penalty: float = 1.5 # 重複機率, 1~2 之間, 1.0 (no penalty), 2.0 (maximum penalty)

  # LORA 相關設定
  rank: int = 128 # rank of the Lora layers
  lora_alpha: int = rank * 2 # alpha for Lora scaling.
  lora_dropout: float = 0.05 # dropout probability for Lora layers

if device.type == 'mps': # 方便在 Apple Silicon 上快速測試
  config = Config(
    torch_dtype=torch.float16, # 在 Apple Silicon 若使用預訓練模型 opt-125m 需要使用全精度浮點數，否則會出現錯誤
  )
else:
  config = Config()

## Fine-tuning 前的表現

### 載入預訓練分詞器 (Tokenizer)

In [28]:
# 透過預訓練模型取得 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
  config.model_name,
)
pprint(tokenizer)

LlamaTokenizerFast(name_or_path='microsoft/Phi-3.5-mini-instruct', vocab_size=32000, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=False),
	32000: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32001: AddedToken("<|assistant|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32002: AddedToken("<|placeholder1|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special

* 如果沒有定義 `pad_token`，請定義一個 `pad_token`，並將其加入 Tokenizer 中。
* 如果 `padding_side` 不是 `right`，請將其設定為 `right`。

In [31]:
# Add pad_token to the tokenizer
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
  print('=== 設定 Padding Token ===')
  pprint(tokenizer)
# Make sure padding_side is 'right'
if tokenizer.padding_side != 'right':
  tokenizer.padding_side = 'right'
  print('=== 設定 Padding Side ===')
  pprint(tokenizer)

=== 設定 Padding Side ===
LlamaTokenizerFast(name_or_path='microsoft/Phi-3.5-mini-instruct', vocab_size=32000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=False),
	32000: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32001: AddedToken("<|assistant|>", rstrip=True, lstrip=False, single_word=False, normalized=False, special=True),
	32002: AddedToken("<|placeholder1|>", rstrip=True, lstrip=False, single_word=False, 

### 載入預訓練模型

由於 GPU 記憶體有限，我們將使用半精度進行模型 Fine-tuning。這邊需要留意，使用半精度進行 Fine-tuning 時，`TrainingArguments` 中的 `adam_epsilon` 需要設定為 `1e-4`。預設的 `adam_epsilon` 是 `1e-8`，這個值在半精度訓練時會出現問題。

透過 `AutoModelForCausalLM` 用於因果語言建模的自動類別，它可以載入不同的預訓練模型進行文本生成任務。

In [32]:
model = AutoModelForCausalLM.from_pretrained(
  config.model_name,
  torch_dtype=config.torch_dtype,
  # 這個參數用於優化內存使用，減少模型加載時的 CPU 內存佔用，特別是在內存有限的環境中非常有用。
  low_cpu_mem_usage=True,
).to(device)

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.58s/it]


### 配置聊天樣本 (Chat Template)

在語言模型中添加特殊標記對於訓練聊天模型至關重要。這些標記被添加在對話中不同角色之間，例如 `user`、`assistant` 和 `system`，幫助模型識別對話的結構和流程。這種設置對於使模型在聊天環境中生成連貫且上下文適當的回應是必不可少的。

`trl` 中的 `setup_chat_format()` 函數可以輕鬆地為對話式 AI 任務設置模型和分詞器。

In [33]:
# Set up the chat format with default 'chatml' format
if tokenizer.chat_template is None:
  model, tokenizer = setup_chat_format(model, tokenizer)
  print('=== 設定 chat format ===')
  pprint(tokenizer)

由於我們使用的模型已經是一個聊天模型，我們不需要再次設置對話格式。

### 詠唱格式化 (Prompt Formatting)

定義詠唱 (Prompt) 格式，我們將創建一個格式化函數。

請注意，這次我們指定 `add_generation_prompt` 為 `True`，表示回應開始的標記。這確保了當模型生成文本時，它會寫出機器人的回應，而不是做一些意想不到的事情，比如繼續用戶的訊息。請記住，聊天模型仍然只是語言模型，它們被訓練來玩文字接龍，而聊天對它們來說只是一種特殊的文本！你需要用適當的控制標記來引導它們，讓它們知道應該做什麼。

In [34]:
def instruction_formatter(x, tokenize):
  if tokenize:
    return tokenizer.apply_chat_template(
      [x['user']],
      tokenize=tokenize,
      add_generation_prompt=True,
      return_tensors='pt',
      return_dict=True,
    ).to(device)
  else:
    return tokenizer.apply_chat_template(
      [x['user']],
      tokenize=tokenize,
      add_generation_prompt=True,
    )

`tokenize=False` 代表不進行 Tokenize，直接回傳原始文字，以及保留特殊標記。由於我們額外指定 `add_generation_prompt` 為 `True`，這將會在回應開始時加入特殊標記 `<|assistant|>`。

In [35]:
# tokenize=False 代表不進行 Tokenize，直接回傳原始文字
input = instruction_formatter(dataset['test'][0], tokenize=False)
pprint(input)

('<|user|>\n'
 'Can you write a function to find the greatest common divisor (GCD) in '
 'JavaScript?<|end|>\n'
 '<|assistant|>\n')


In [37]:
# tokenize=True 代表進行 Tokenize，回傳 Tokenize 後的 ID 及 attention mask tensors
tokenized_input = instruction_formatter(dataset['test'][0], tokenize=True)
pprint(tokenized_input)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='mps:0'),
 'input_ids': tensor([[32010,  1815,   366,  2436,   263,   740,   304,  1284,   278, 14176,
          3619,  8572,   272,   313, 29954,  6530, 29897,   297,  8286, 29973,
         32007, 32001]], device='mps:0')}


當 `tokenize=True`，Tokenizer 回傳內容包含兩個主要部分：`input_ids` 和 `attention_mask`。以下是詳細解釋：

* `input_ids`: 是一個張量 (tensor)，包含了輸入文本的 token IDs。這些 IDs 是由 tokenizer 將文本轉換為數字表示後得到的。

* `attention_mask`: 同樣是一個張量，用於指示模型應該關注哪些位置。值為 1 的位置表示應該關注，值為 0 的位置表示應該忽略。在這個例子中，`attention_mask` 的值全為 1，表示模型應該關注所有位置。

In [38]:
# 透過 Tokenizer 的 decode 方法將 ID 轉換回文字，並列顯示出來
for id in tokenized_input['input_ids'][0]:
  print(f'{id} -> {tokenizer.decode([id])}')

32010 -> <|user|>
1815 -> Can
366 -> you
2436 -> write
263 -> a
740 -> function
304 -> to
1284 -> find
278 -> the
14176 -> greatest
3619 -> common
8572 -> divis
272 -> or
313 -> (
29954 -> G
6530 -> CD
29897 -> )
297 -> in
8286 -> JavaScript
29973 -> ?
32007 -> <|end|>
32001 -> <|assistant|>


### Fine-tuning 前的表現

#### 單筆演示生成回應

In [39]:
# 透過預訓練模型生成回應
output_ids = model.generate(
  **tokenized_input,
  temperature=config.temperature,
  max_new_tokens=config.max_new_tokens,
  repetition_penalty=config.repetition_penalty,
)

  test_elements = torch.tensor(test_elements)


In [40]:
output_ids

tensor([[32010,  1815,   366,  2436,   263,   740,   304,  1284,   278, 14176,
          3619,  8572,   272,   313, 29954,  6530, 29897,   297,  8286, 29973,
         32007, 32001,   315, 13946,   368, 29991,  2266, 29915, 29879,   385,
          1342,   310,   920,   591,   508,  2334,   445,   773, 16430,   695,
           333, 30010, 29879,  5687, 29901,    13, 28956,  7729, 29871,    12,
           259,   849,  6680,  5023,   363,  9138,   402,  2252, 29889,   739,
          4893,  1023,  3694,   408,  1881,   322,  3639,  1009,   330,  6854,
           995,  1678,  1040,  8147, 25120,   271,   342, 18877, 12596,   275,
          1611,   353, 29898, 29874, 29892,   289,  3892, 26208,   268,   565,
         11864, 29890,  2597,   418,   736,  5792,   869,  6897, 30419,  4557,
         30409, 29936,   500,  1683,   426,   539,  1235, 21162, 29922, 11309,
         30267,  1545,  7207,  3552,  4537, 29896,   511,  1353, 29906,   416,
          4706,  4949,  3599, 25397,  1246,   411,  

In [41]:
# 將 output_ids 轉換為文字
output = tokenizer.decode(
  output_ids[0],
  skip_special_tokens=False, # 決定是否跳過特殊 token（例如，開始和結束標記）。
)

In [42]:
pprint(output)

('<|user|> Can you write a function to find the greatest common divisor (GCD) '
 "in JavaScript?<|end|><|assistant|> Certainly! Here's an example of how we "
 'can implement this using Euclid’s algorithm:\n'
 '```javascript \t   // Function definition for finding Gcd. It takes two '
 'numbers as input and returns their gcf value    const '
 'calculateGreatestCommonDivisior =(a, b)=>{     if(!b){      return Math '
 '.abs（Number）; } else {       let remainder=Math。modulo((number1), '
 'number2);        /* Recursive call with updated parameters */         '
 'console log(`Remainder is ${reminder}` );          Calculate Great Common')


只取得生成的文字, 即 `<|assistant|>` 之後的文字

In [43]:
# 只取得生成的文字, 即 <|assistant|> 之後的文字
pprint(output.split('<|assistant|>')[1].strip())

("Certainly! Here's an example of how we can implement this using Euclid’s "
 'algorithm:\n'
 '```javascript \t   // Function definition for finding Gcd. It takes two '
 'numbers as input and returns their gcf value    const '
 'calculateGreatestCommonDivisior =(a, b)=>{     if(!b){      return Math '
 '.abs（Number）; } else {       let remainder=Math。modulo((number1), '
 'number2);        /* Recursive call with updated parameters */         '
 'console log(`Remainder is ${reminder}` );          Calculate Great Common')


#### 批次處理模型表現

初步了解如何生成模型的回應，我們將定義一個 `generate()` 函數來生成模型的回應。這個函數接受一個輸入文本，並生成模型的回應。藉由這個函數，我們可以批次處理資料。


In [44]:
# 將以上程式碼整理成一個函式，方便我們批次處理資料
def generator(x, model):
  tokenized_input = instruction_formatter(x, tokenize=True)
  output_ids = model.generate(
    **tokenized_input,
    temperature=config.temperature,
    max_new_tokens=config.max_new_tokens,
    repetition_penalty=config.repetition_penalty,
  )
  output = tokenizer.decode(output_ids[0], skip_special_tokens=False)
  return output.split('<|assistant|>')[1].strip()

In [45]:
# 這個步驟可能會花費一些時間，所以我們只處理前 first_n_data 筆資料
first_n_dataset = dataset['test'].select(range(first_n_data))

# 移除 messages 欄位
first_n_dataset = first_n_dataset.remove_columns('messages')

# 透過預訓練模型生成回應，將其新增到 pt_response 欄位中
pt_response = []
for x in first_n_dataset:
  pt_response.append(generator(x, model))

first_n_df = pd.DataFrame(first_n_dataset)
first_n_df['pt_response'] = pt_response

  test_elements = torch.tensor(test_elements)


In [46]:
# 顯示預訓練模型預測結果
pd.set_option('display.max_colwidth', None)
first_n_df

Unnamed: 0,user,assistant,pt_response
0,"{'content': 'Can you write a function to find the greatest common divisor (GCD) in JavaScript?', 'role': 'user'}","{'content': 'Sorry, I cannot provide programming code or code snippets. I can assist you with the questions.', 'role': 'assistant'}","Certainly! Here's an example of how we can implement this using Euclid’s algorithm:\n```javascript \t // Function definition for finding Gcd. It takes two numbers as input and returns their gcf value const calculateGreatestCommonDivisior =(a, b)=>{ if(!b){ return Math .abs（Number）; } else { let remainder=Math。modulo((number1), number2); /* Recursive call with updated parameters */ console log(`Remainder is ${reminder}` ); Calculate Great Common"
1,"{'content': 'What is the purpose of the 'yield' keyword in Python?', 'role': 'user'}","{'content': 'The 'yield' keyword in Python is used to turn a function into a generator, allowing it to return a value and pause its execution.', 'role': 'assistant'}","The `YIELD` statement, or more accurately termed as a ""generator"" function when used with it (`def`), plays an essential role within generator functions and coroutines. Here are its primary purposes:\n 1) **Generating Values** - It allows you to generate values on-the-fly without storing them all at once (like lists). This can be particularly useful for large data sets where memory efficiency matters because only one value needs processing/storing per time step rather than holding everything simultaneously like list comprehensions do which could lead into high space complexity scenarios especially if dealing big datasets"
2,"{'content': 'Explain the difference between HTTP and HTTPS.', 'role': 'user'}","{'content': 'HTTP is the protocol used for transferring data over the web, while HTTPS is the secure version of HTTP, using SSL/TLS to encrypt data.', 'role': 'assistant'}","The primary differences lie in security, data integrity during transmission (encryption), privacy protection for users' information when browsing websites using these protocols:\n 1) **Security** - One of main distinctions is that Hypertext Transfer Protocol Secure or https uses SSL/TLS encryption to secure communication over a network which makes it more difficult than its counterpart http without such an added layer where sensitive user details are involved like passwords & credit card numbers etc., while regular unsecured version doesn’t provide this level ensuring lesser safety against eavesdropping attacks from malicious entities on"
3,"{'content': 'What is the purpose of the 'synchronized' keyword in Java?', 'role': 'user'}","{'content': 'The 'synchronized' keyword in Java is used to control access to a block of code or an object by multiple threads.', 'role': 'assistant'}","The `synchornize` (note: it should be spelled ""**thread-safe synchronization mechanism for controlling access to shared resources by multiple threads. It ensures that only one thread can execute a block or method at any given time, preventing race conditions and data inconsistencies when accessing mutable objects concurrently from different parts/threads within an application running on multiples processors simultaneously).\nHere are some key points about its usage with examples illustrating how you might use this feature effectively while maintainable code practices like encapsulation remain intact through proper design patterns such as Singleton where necessary"
4,"{'content': 'What is the purpose of the 'foreach' loop in Java?', 'role': 'user'}","{'content': 'The 'foreach' loop in Java is used to iterate over elements in a collection or array.', 'role': 'assistant'}","The `for-each` (or enhanced for) syntax introduced with JDK 5, also known as ""enumeration,"" serves a specific and powerful role within programming languages like JavaScript/Java. Its primary purposes are:\n1️⃣ **Simplification** - It simplifies code that iterates over elements from an array or collection by removing explicit index management (`i`, etc.). This makes your intentions clearer to readers who might not be familiar deeply into how arrays work underneath but understand iteration concepts well enough through this simplified constructs; thus improving readability significantly compared using traditional"


## 訓練模型

### LoRA 的訓練策略

LoRA（Low-Rank Adaptation）是一種用於訓練大型語言模型的技術，旨在提高訓練效率並減少計算資源的需求。以下是為何需要透過LoRA訓練的一些原因：

* 降低計算成本：LoRA 通過將模型的權重矩陣分解為低秩矩陣，顯著減少了參數的數量，從而降低了計算成本和內存需求。

* 加速訓練速度：由於參數數量減少，LoRA 可以加速模型的訓練過程，使得在相同的硬件資源下能夠更快地完成訓練。

![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png)

我們先來觀察預訓練模型可訓練的參數量，其數量相當龐大，所以需要透過 Low Rank Adaptation (LoRA) 來降低參數量。

In [47]:
# 查看預訓練模型可訓練的參數量，其數量相當龐大，所以需要透過 Low Rank Adaptation (LoRA) 來降低參數量
print('Parameters: {:,}, Trainable Parameters: {:,}'.format(
  model.num_parameters(),
  model.num_parameters(only_trainable=True)))

Parameters: 3,821,079,552, Trainable Parameters: 3,821,079,552


#### LoRA 配置

* `task_type`: TaskType.CAUSAL_LM 指定任務類型為因果語言模型 (Causal Language Model)。

* `rank`: 是低秩矩陣的秩(rank)，它決定了 LoRA 層的參數數量。較低的 `r` 值意味著較少的參數，從而減少了模型的計算和存儲需求。具體來說，LoRA 通過將全連接層的權重矩陣分解為兩個低秩矩陣來實現參數高效化。`r` 值越小，這兩個低秩矩陣的維度越小，這個練習我們採用 128。

* `lora_alpha`: 是一個縮放因子，用於調整 LoRA 層的輸出。它控制了低秩矩陣的影響力。較高的 `lora_alpha` 值會增加 LoRA 層的影響力，也就是說值越高，越容易把大模型既有的能力給覆蓋掉。具體來說，LoRA 層的輸出會乘以這個縮放因子，這個練習我們採用常見的比例為 `rank` 的兩倍。

* `lora_dropout`: 是一個丟棄率，用於在訓練過程中隨機丟棄 LoRA 層的一部分輸出。這有助於防止過擬合，並提高模型的泛化能力。例如，`lora_dropout` 設置為 0.1 表示在每次前向傳播中，有 10% 的 LoRA 層輸出會被隨機設置為零。

* `target_module`: 指定了應用 LoRA 的目標模塊。這通常是模型中的某些特定層或子模塊，例如 Transformer 模型中的注意力層，可以透過 `model.named_parameters` 查看。通過指定 `target_module`，你可以靈活地選擇在哪些層應用 LoRA，以便在保持模型性能的同時減少參數數量。

> 廣為周知的模型當未指定 `target_module`，透過 `get_peft_model` 加載 Lora 適配模型時，會自動設定。
> 可以先嘗試不指定，若出現錯誤再試著設定注意力相關的參數層。


In [48]:
# LoRA 配置
lora_config = LoraConfig(
  task_type=TaskType.CAUSAL_LM,
  r=config.rank,
  lora_alpha=config.lora_alpha,
  lora_dropout=config.lora_dropout,
  # Phi3ForCausalLM need to specify the target_modules beforehand
  target_modules=['qkv_proj'],
)

pprint(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>,
           peft_type=<PeftType.LORA: 'LORA'>,
           auto_mapping=None,
           base_model_name_or_path=None,
           revision=None,
           inference_mode=False,
           r=128,
           target_modules={'qkv_proj'},
           exclude_modules=None,
           lora_alpha=256,
           lora_dropout=0.05,
           fan_in_fan_out=False,
           bias='none',
           use_rslora=False,
           modules_to_save=None,
           init_lora_weights=True,
           layers_to_transform=None,
           layers_pattern=None,
           rank_pattern={},
           alpha_pattern={},
           megatron_config=None,
           megatron_core='megatron.core',
           loftq_config={},
           eva_config=None,
           use_dora=False,
           layer_replication=None,
           runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False),
           lora_bias=False)


#### 加載 LoRA 適配模型

搭配預訓模型及 LoRA 配置，我們可以加載 LoRA 適配模型。我們可以觀察受到降維影響的模型層。

In [49]:
# 加載 LoRA 適配模型
peft_model = get_peft_model(
  model, # 預訓練模型
  lora_config, # LoRA 配置
)

In [50]:
pprint(lora_config)

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>,
           peft_type=<PeftType.LORA: 'LORA'>,
           auto_mapping=None,
           base_model_name_or_path='microsoft/Phi-3.5-mini-instruct',
           revision=None,
           inference_mode=False,
           r=128,
           target_modules={'qkv_proj'},
           exclude_modules=None,
           lora_alpha=256,
           lora_dropout=0.05,
           fan_in_fan_out=False,
           bias='none',
           use_rslora=False,
           modules_to_save=None,
           init_lora_weights=True,
           layers_to_transform=None,
           layers_pattern=None,
           rank_pattern={},
           alpha_pattern={},
           megatron_config=None,
           megatron_core='megatron.core',
           loftq_config={},
           eva_config=None,
           use_dora=False,
           layer_replication=None,
           runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False),
           lora_bias=False)


#### LoRA 適配模型

加載 LoRA 適配模型後, 觀察受 LoRA 影響的模型參數

In [51]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Phi3ForCausalLM(
      (model): Phi3Model(
        (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x Phi3DecoderLayer(
            (self_attn): Phi3SdpaAttention(
              (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
              (qkv_proj): lora.Linear(
                (base_layer): Linear(in_features=3072, out_features=9216, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=128, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=128, out_features=9216, bias=False)
                )
                (lora_embedding_A): ParameterDict()


由於我們指定的 `target_module` 是 `qkv_proj`, 因此所有注意力層受到 LoRA 的影響。

```json
  (qkv_proj): lora.Linear(
    (base_layer): Linear(in_features=3072, out_features=9216, bias=False)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.05, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=3072, out_features=128, bias=False)
    )
    (lora_B): ModuleDict(
      (default): Linear(in_features=128, out_features=9216, bias=False)
    )
    (lora_embedding_A): ParameterDict()
    (lora_embedding_B): ParameterDict()
    (lora_magnitude_vector): ModuleDict()
  )
```              

#### 調整 LoRA 精度

LoRA 適配模型的精度是 `torch.float32`，我們可以透過 `model.half()` 將其轉換為半精度。

In [52]:
if config.torch_dtype == torch.float16 or config.torch_dtype == torch.bfloat16:
  peft_model = peft_model.half() # 轉換為半精度浮點數

In [53]:
# 獲取 LoRA 模型參數名稱及型態，確認是否使用半精度浮點數
for name, param in peft_model.named_parameters():
  print(f'{name}: {param.dtype}')

base_model.model.model.embed_tokens.weight: torch.float16
base_model.model.model.layers.0.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float16
base_model.model.model.layers.0.mlp.gate_up_proj.weight: torch.float16
base_model.model.model.layers.0.mlp.down_proj.weight: torch.float16
base_model.model.model.layers.0.input_layernorm.weight: torch.float16
base_model.model.model.layers.0.post_attention_layernorm.weight: torch.float16
base_model.model.model.layers.1.self_attn.o_proj.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.1.self_attn.qkv_proj.lora_B.default.weight: torch.

經過 `model.half()` 轉換後，LoRA 適配模型的權重也變成半精度。

```shell
base_model.model.model.layers.0.self_attn.qkv_proj.base_layer.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.default.weight: torch.float16
base_model.model.model.layers.0.self_attn.qkv_proj.lora_B.default.weight: torch.float16
```

訓練參數量也從原先 3B 大大減少為 50M。

In [54]:
# 查看可訓練的參數量
peft_model.print_trainable_parameters()

trainable params: 50,331,648 || all params: 3,871,411,200 || trainable%: 1.3001


### 詠唱格式化 (Prompt Formatting)

有別於先前的詠唱格式，這次我們將包含 `assistant` 的回應，以便作為標注資料供模型訓練。由於已經包含 `assistant`，這次我們指定 `add_generation_prompt` 為 `False`，省卻回應開始的標記。

另一個差異是，這個函式預設不會進行 tokenize，會直接回傳原始文字。

In [55]:
def instruction_completion_formatter(x, tokenize: bool = False):
  return tokenizer.apply_chat_template(
    x['messages'],
    tokenize=tokenize,
    add_generation_prompt=False,
  )

In [57]:
pprint(instruction_completion_formatter(dataset['train'][0]))

('<|user|>\n'
 'Can you write a Java program to sort an array?<|end|>\n'
 '<|assistant|>\n'
 "I can't provide the code, but I can guide you on how to write it.<|end|>\n"
 '<|endoftext|>')


### 資料校對器 (Data Collator)

In [58]:
# 定義回應開始的標記
response_template = '<|assistant|>'

# 設定 DataCollatorForCompletionOnlyLM
data_collator = DataCollatorForCompletionOnlyLM(
  tokenizer=tokenizer,
  response_template=response_template,
)

In [59]:
# 展示 DataCollatorForCompletionOnlyLM 的輸出, 標籤以 -100 表示在損失函數中不會被考慮
batch = data_collator([instruction_completion_formatter(dataset['train'][i], True) for i in range(first_n_data)])
pprint(batch)

{'input_ids': tensor([[32010,  1815,   366,  2436,   263,  3355,  1824,   304,  2656,   385,
          1409, 29973, 32007, 32001,   306,   508, 29915, 29873,  3867,   278,
           775, 29892,   541,   306,   508, 10754,   366,   373,   920,   304,
          2436,   372, 29889, 32007, 32000, 32000, 32000, 32000, 32000, 32000,
         32000, 32000, 32000, 32000, 32000, 32000, 32000, 32000],
        [32010,  1724,   338,   278,  6437,   310,   278,   525,  1958, 29915,
           740,   297,  5132, 29973, 32007, 32001,   450,   525,  1958, 29915,
           740,   297,  5132, 16058,   263,  2183,   740,   304,   599,  4452,
           297,   385,  1881,  1051,   322,  3639,   263,  1051,   310,   278,
          2582, 29889, 32007, 32000, 32000, 32000, 32000, 32000],
        [32010,  1724,   338,   278,  4328,  1546,   263,  9024,  1051,   322,
           385,  1409, 29973, 32007, 32001,   319,  9024,  1051,   338,   263,
           848,  3829,   988,  1269,  1543,  3291,   304,   278,

In [60]:
# 透過 Tokenizer 的 decode 方法將 ID 轉換回文字，並列標籤顯示出來
for idx in range(first_n_data):
  input_ids = batch['input_ids'][idx]
  labels_ids = batch['labels'][idx]
  input = [tokenizer.decode(id) for id in input_ids]
  labels = ['-'] * len(input_ids)
  for i, id in enumerate(labels_ids):
    if id != -100:
      labels[i] = tokenizer.decode(id)
  pprint(f'input: {input}')
  pprint(f'label: {labels}')

("input: ['<|user|>', 'Can', 'you', 'write', 'a', 'Java', 'program', 'to', "
 "'sort', 'an', 'array', '?', '<|end|>', '<|assistant|>', 'I', 'can', "
 '"\'", \'t\', \'provide\', \'the\', \'code\', \',\', \'but\', \'I\', \'can\', '
 "'guide', 'you', 'on', 'how', 'to', 'write', 'it', '.', '<|end|>', "
 "'<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', "
 "'<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', "
 "'<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', "
 "'<|endoftext|>', '<|endoftext|>']")
("label: ['-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', "
 '\'-\', \'I\', \'can\', "\'", \'t\', \'provide\', \'the\', \'code\', \',\', '
 "'but', 'I', 'can', 'guide', 'you', 'on', 'how', 'to', 'write', 'it', '.', "
 "'<|end|>', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', "
 "'-']")
("input: ['<|user|>', 'What', 'is', 'the', 'purpose', 'of', 'the', "
 '"\'", \'map\', "\'", \'function\', \'in\', \'Python\

你可以清楚觀察到，損失函數不會去關注包含 `<|assistant|>` 之前的部分，這樣可以讓模型專注於生成 `<|assistant|>` 之後的回應。

考量批次訓練不同長度的序列需要填充到相同的長度，以便能夠在同一批次中進行處理。Data Collator 自動進行適當的填充，填充的部分亦不會參與損失計算。

### 訓練參數設定

用於設定訓練過程中的各種參數，如學習率、批次大小、梯度累積步數、訓練 epoch 數、權重衰減等。

* `output_dir` 指定了訓練輸出的目錄。
* `logging_steps` 訓練時的日誌步數，決定每隔多少步輸出一次訓練日誌。這裡設定為 config.batch_size * config.gradient_accumulation_steps，即每個完整的批次後輸出一次日誌。
* `report_to` 禁用 wandb 報告，適用於 Colab 環境，避免需要配置 wandb。
* `adam_epsilon` Adam 優化器的 epsilon 值，當使用半精度浮點數時需要設定較大的值以穩定訓練。
* `packing` 當使用 DataCollatorForCompletionOnlyLM 時禁用 packing，這是特定於數據整理器的設定。
* `save_total_limit` 最多儲存 5 個 checkpoints，控制儲存的模型檔案數量以節省磁碟空間。

In [61]:
training_args = SFTConfig(
  output_dir='sample_data/train_output_qa', # 訓練輸出目錄
  learning_rate=config.lr, # 學習率
  per_device_train_batch_size=config.batch_size, # 每個設備的訓練批次大小
  per_device_eval_batch_size=config.batch_size, # 每個設備的評估批次大小
  gradient_accumulation_steps=config.gradient_accumulation_steps, # 梯度累積步數
  logging_steps=config.batch_size*config.gradient_accumulation_steps, # 訓練時的日誌步數, 預設每 500 步輸出一次日誌
  num_train_epochs=config.epochs, # 訓練的總 epoch 數
  weight_decay=config.weight_decay, # 權重衰減
  eval_strategy='epoch', # 每個 epoch 評估一次
  save_strategy='epoch', # 每個 epoch 儲存一次
  load_best_model_at_end=True,
  report_to='none', # 禁用 wandb 報告 (Colab 環境預設需要 wandb)
  adam_epsilon=config.adam_epsilon, # 當使用半精度浮點數時，需要設定較大的 adam epsilon
  packing=False, # 當使用 DataCollatorForCompletionOnlyLM 時禁用 packing
  save_total_limit=5, # 最多儲存 5 個 checkpoints
)

### 訓練器初始化

用於初始化訓練器，並開始訓練模型。

* `model` 是要訓練的模型。
* `tokenizer` 是用於處理文本的分詞器。
* `train_dataset` 是訓練數據集。
* `formatting_func` 是用於格式化數據的函數。
* `data_collator` 是用於整理數據的數據整理器。

In [62]:
trainer = SFTTrainer(
    model=peft_model, # 要訓練的模型
    tokenizer=tokenizer, # 使用的分詞器
    args=training_args, # 訓練參數
    train_dataset=dataset['train'], # 訓練數據集
    eval_dataset=dataset['validation'], # 驗證數據集
    formatting_func=instruction_completion_formatter, # 格式化函數
    data_collator=data_collator, # 數據整理器
)

  trainer = SFTTrainer(
Map: 100%|██████████████████████████████████████████████████████████████████| 76/76 [00:00<00:00, 2537.73 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 4581.57 examples/s]


### 開始訓練

In [63]:
# 開始訓練，這可能需要一些時間
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.5911,0.517041
2,0.3462,0.334729
3,0.1181,0.389555
4,0.1449,0.691715
5,0.0627,0.46083
6,0.0941,0.387666
7,0.0377,0.585615
8,0.0134,0.554576
9,0.0157,0.597435
10,0.0065,0.63396


TrainOutput(global_step=475, training_loss=0.08944381184894347, metrics={'train_runtime': 849.1276, 'train_samples_per_second': 2.238, 'train_steps_per_second': 0.559, 'total_flos': 1907794878308352.0, 'train_loss': 0.08944381184894347, 'epoch': 25.0})

#### 保存 LoRA 模型參數

In [64]:
# 保存 Lora 参数
peft_model.save_pretrained(
  config.saved_lora_path,
  # warnings.warn("Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.")
  save_embedding_layers=True,
)

#### 保存 Tokenizer

In [65]:
# 保存 Tokenizer
tokenizer.save_pretrained(config.saved_model_path)

('sample_data/saved_encoder_model/tokenizer_config.json',
 'sample_data/saved_encoder_model/special_tokens_map.json',
 'sample_data/saved_encoder_model/tokenizer.model',
 'sample_data/saved_encoder_model/added_tokens.json',
 'sample_data/saved_encoder_model/tokenizer.json')

### 釋放資源

In [66]:
# import garbage collector
import gc

# 釋放 GPU 記憶體
del trainer
del tokenizer

peft_model.to('cpu')
del peft_model

torch.cuda.empty_cache()

gc.collect()

304

## 評估微調模型

### 載入微調分詞器 (Tokenizer)

從已經完成訓練的模型取得 Tokenizer，可以留意這個訓練時保存下來的 Tokenizer 仍保有訓練時的設定，包涵 `pad_token` 和 `padding_side`。

In [67]:
tokenizer = AutoTokenizer.from_pretrained(
  config.saved_model_path
)

### 載入微調後模型

In [69]:
ft_model = PeftModel.from_pretrained(
  model, # 預訓練模型
  config.saved_lora_path, # LoRA 適配模型
  # 這個參數用於優化內存使用，減少模型加載時的 CPU 內存佔用，特別是在內存有限的環境中非常有用。
  low_cpu_mem_usage=True,
  torch_dtype=config.torch_dtype,
).to(device)

### Fine-tuning 後的表現

In [70]:
# 透過微調模型生成回應，將其新增到 ft_response 欄位中
ft_response = []
for x in first_n_dataset:
  ft_response.append(generator(x, ft_model))

first_n_df['ft_response'] = ft_response

  test_elements = torch.tensor(test_elements)


In [71]:
# 顯示微調模型預測結果
pd.set_option('display.max_colwidth', None)
first_n_df

Unnamed: 0,user,assistant,pt_response,ft_response
0,"{'content': 'Can you write a function to find the greatest common divisor (GCD) in JavaScript?', 'role': 'user'}","{'content': 'Sorry, I cannot provide programming code or code snippets. I can assist you with the questions.', 'role': 'assistant'}","Certainly! Here's an example of how we can implement this using Euclid’s algorithm:\n```javascript \t // Function definition for finding Gcd. It takes two numbers as input and returns their gcf value const calculateGreatestCommonDivisior =(a, b)=>{ if(!b){ return Math .abs（Number）; } else { let remainder=Math。modulo((number1), number2); /* Recursive call with updated parameters */ console log(`Remainder is ${reminder}` ); Calculate Great Common","Sorry, I cannot provide programming code or functions.I can assist with questions!<|endoftext|>"
1,"{'content': 'What is the purpose of the 'yield' keyword in Python?', 'role': 'user'}","{'content': 'The 'yield' keyword in Python is used to turn a function into a generator, allowing it to return a value and pause its execution.', 'role': 'assistant'}","The `YIELD` statement, or more accurately termed as a ""generator"" function when used with it (`def`), plays an essential role within generator functions and coroutines. Here are its primary purposes:\n 1) **Generating Values** - It allows you to generate values on-the-fly without storing them all at once (like lists). This can be particularly useful for large data sets where memory efficiency matters because only one value needs processing/storing per time step rather than holding everything simultaneously like list comprehensions do which could lead into high space complexity scenarios especially if dealing big datasets","The yield statement pauses a function’s execution and returns control back to caller, allowing for better memory usage. It can also be used with an iterator or generator object as part method definition syntax (def __iter__(self): return self).\n<|endoftext|>"
2,"{'content': 'Explain the difference between HTTP and HTTPS.', 'role': 'user'}","{'content': 'HTTP is the protocol used for transferring data over the web, while HTTPS is the secure version of HTTP, using SSL/TLS to encrypt data.', 'role': 'assistant'}","The primary differences lie in security, data integrity during transmission (encryption), privacy protection for users' information when browsing websites using these protocols:\n 1) **Security** - One of main distinctions is that Hypertext Transfer Protocol Secure or https uses SSL/TLS encryption to secure communication over a network which makes it more difficult than its counterpart http without such an added layer where sensitive user details are involved like passwords & credit card numbers etc., while regular unsecured version doesn’t provide this level ensuring lesser safety against eavesdropping attacks from malicious entities on","Hypertext Transfer Protocol (HTTP) is used for transmitting data over web, while Secure Sockets Layer or Transport Security layer can be added to create a secure version called HttpSecured(H). Both protocols are stateless but H provides encryption of communication at rest using SSL/TLS certificates which enhances security by default compared with plain text transmission in http requests .\nThe main differences lie within their ability handle encrypted connections , support authentication mechanisms like OAuth 2a etc., provide built-in cache control headers bccbh et al..and more features that make them suitable according"
3,"{'content': 'What is the purpose of the 'synchronized' keyword in Java?', 'role': 'user'}","{'content': 'The 'synchronized' keyword in Java is used to control access to a block of code or an object by multiple threads.', 'role': 'assistant'}","The `synchornize` (note: it should be spelled ""**thread-safe synchronization mechanism for controlling access to shared resources by multiple threads. It ensures that only one thread can execute a block or method at any given time, preventing race conditions and data inconsistencies when accessing mutable objects concurrently from different parts/threads within an application running on multiples processors simultaneously).\nHere are some key points about its usage with examples illustrating how you might use this feature effectively while maintainable code practices like encapsulation remain intact through proper design patterns such as Singleton where necessary","The synchronization mechanism ensures that only one thread can execute a critical section at any given time, preventing race conditions.\n```<|endoftext|>"
4,"{'content': 'What is the purpose of the 'foreach' loop in Java?', 'role': 'user'}","{'content': 'The 'foreach' loop in Java is used to iterate over elements in a collection or array.', 'role': 'assistant'}","The `for-each` (or enhanced for) syntax introduced with JDK 5, also known as ""enumeration,"" serves a specific and powerful role within programming languages like JavaScript/Java. Its primary purposes are:\n1️⃣ **Simplification** - It simplifies code that iterates over elements from an array or collection by removing explicit index management (`i`, etc.). This makes your intentions clearer to readers who might not be familiar deeply into how arrays work underneath but understand iteration concepts well enough through this simplified constructs; thus improving readability significantly compared using traditional","The foreach statement iterates over each element contained within an array or a collection, executing statements for every iteration. It simplifies code and improves readability when used with arrays/collections that implement Iterable interface (like ArrayList).\n```java int[] arr = {10 , 25}; System out .println(""Element: "" + i); } // Outputs Element : index value on separate lines I apologize; it seems there was some confusion about my capabilities as Phi AI does not have programming skills but can assist you!<|endoftext|>"
