 khi dùng của OpenAI để tinh chỉnh (fine-tune) các mô hình như GPT-3.5-Turbo, GPT-4o, GPT-4o-mini, không thể chọn các phương pháp tinh chỉnh tham số hiệu quả (Parameter-Efficient Fine-Tuning - PEFT) cụ thể như LoRA (Low-Rank Adaptation).

 Quá trình tinh chỉnh thông qua API của OpenAI là 1 (managed service). Bằng cách cung cấp training data (theo định dạng Chat Completions hoặc DPO) và một số siêu tham số cơ bản (như số epochs), còn OpenAI sẽ xử lý toàn bộ quá trình huấn luyện phức tạp ở phía sau. Kết quả nhận được một định danh mô hình mới đã được tinh chỉnh.


API của OpenAI: Chỉ hỗ trợ các phương pháp tinh chỉnh được mô tả trong docs (chủ yếu là SFT và DPO theo định dạng quy định). Người dùng không có quyền kiểm soát trực tiếp kiến trúc hoặc phương pháp cập nhật trọng số ở mức độ thấp như LoRA. OpenAI trừu tượng hóa quá trình này đi. Việc họ có sử dụng các kỹ thuật PEFT nào đó ở bên trong hệ thống của họ để tối ưu hóa hay không thì không được công bố rõ ràng, nhưng người dùng không thể yêu cầu hay cấu hình nó.

Tinh chỉnh Mô hình Open Source: Ngược lại, với các Opensource LLM (như Llama, Mistral, v.v.) trên các nền tảng như Hugging Face, hoàn toàn có thể áp dụng các kỹ thuật PEFT như LoRA, QLoRA, Adapter Tuning, v.v. Vì có quyền truy cập trực tiếp vào trọng số của mô hình và có thể tự mình kiểm soát toàn bộ vòng lặp huấn luyện.

Example format

```json
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
```

Multi-turn chat examples

 The default behavior during fine-tuning is to train on all assistant messages within a single example. To skip fine-tuning on specific assistant messages, a weight key can be added disable fine-tuning on that message, allowing you to control which assistant messages are learned. The allowed values for weight are currently 0 or 1. Some examples using weight for the chat format are below.

```json
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}

```

SFT 

Nếu không chỉ định (method), mặc định là (Supervised Fine-Tuning - SFT).



[Direct Preference Optimization - DPO](https://arxiv.org/abs/2305.18290)

tinh chỉnh các mô hình dựa trên lời nhắc và các cặp phản hồi. Cách tiếp cận này cho phép mô hình học hỏi từ các ưu tiên của con người, tối ưu hóa cho các đầu ra có nhiều khả năng được ưa chuộng hơn

For DPO:

set the `type` parameter to `dpo`
optionally set the `hyperparameters` property with any options you'd like to configure


```JSON
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "Hello, can you tell me how cold San Francisco is today?"
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Today in San Francisco, it is not quite cold as expected. Morning clouds will give away to sunshine, with a high near 68°F (20°C) and a low around 57°F (14°C)."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "It is not particularly cold in San Francisco today."
    }
  ]
}
```


```python
job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)
```

## Data preparation and analysis cost for fine-tuning

checks for format errors, provides basic statistics, and estimates token counts for fine-tuning costs. The method shown here corresponds to the current fine-tuning method for gpt-3.5-turbo. 

In [1]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [2]:
train_tone_file_path = "fine-tuning_data/train_tone_ds.jsonl"
validation_tone_file_path = "fine-tuning_data/validation_tone_ds.jsonl"
def validate_jsonl(file_path):
    try:
        with open(file_path, 'r') as file:
            for line in file:
                json.loads(line)
        print("Valid JSONL file")
    except json.JSONDecodeError as e:
        print(f"Invalid JSONL: {e}")
    except Exception as e:
        print(f"Error reading file: {e}")

validate_jsonl(train_tone_file_path)
validate_jsonl(validation_tone_file_path)

'''
https://jsonltools.com/jsonl-validator
'''

Valid JSONL file
Valid JSONL file


'\nhttps://jsonltools.com/jsonl-validator\n'

Data loading
We first load the chat dataset from an example JSONL file.

In [3]:
import jsonlines

#data_path = "fine-tuning_data/tone1.jsonl"
#data_path = "fine-tuning_data/toy_chat_fine_tuning.jsonl"
# Load the dataset
data_path = train_tone_file_path

with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]
'''
with jsonlines.open(data_path) as reader:
    dataset = [line for line in reader]
'''
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 45
First example:
{'role': 'system', 'content': "You are a professional, polite, patient, and respectful sales consultant staff for a phone store, interacting with Vietnamese customers. Always refer to yourself using 'em' pronoun. Address the user based on their provided <User gender>: use 'anh' for male, 'chị' for female. If the gender is unknown or not provided, use the polite neutral term 'anh/chị'."}
{'role': 'system', 'content': '## BASE KNOWLEDGE:\n- User gender: male'}
{'role': 'user', 'content': 'Chào shop, anh muốn hỏi về cái tai nghe XYZ.'}
{'role': 'assistant', 'content': 'Dạ vâng ạ, em chào anh. Anh cần em tư vấn thông tin gì về mẫu tai nghe XYZ ạ?'}


In [4]:
import jsonlines

data_path = "fine-tuning_data/train_tone_ds.jsonl"
#data_path = "fine-tuning_data/toy_chat_fine_tuning.jsonl"
# Load the dataset

'''
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]
'''
with jsonlines.open(data_path) as reader:
    dataset = [line for line in reader]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 45
First example:
{'role': 'system', 'content': "You are a professional, polite, patient, and respectful sales consultant staff for a phone store, interacting with Vietnamese customers. Always refer to yourself using 'em' pronoun. Address the user based on their provided <User gender>: use 'anh' for male, 'chị' for female. If the gender is unknown or not provided, use the polite neutral term 'anh/chị'."}
{'role': 'system', 'content': '## BASE KNOWLEDGE:\n- User gender: male'}
{'role': 'user', 'content': 'Chào shop, anh muốn hỏi về cái tai nghe XYZ.'}
{'role': 'assistant', 'content': 'Dạ vâng ạ, em chào anh. Anh cần em tư vấn thông tin gì về mẫu tai nghe XYZ ạ?'}


Format validation

We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. Data Type Check: Checks whether each entry in the dataset is a dictionary (dict). Error type: data_type.
2. Presence of Message List: Checks if a messages list is present in each entry. Error type: missing_messages_list.
3. Message Keys Check: Validates that each message in the messages list contains the keys role and content. Error type: message_missing_key.
4. Unrecognized Keys in Messages: Logs if a message has keys other than role, content, weight, function_call, and name. Error type: message_unrecognized_key.
5. Role Validation: Ensures the role is one of "system", "user", or "assistant". Error type: unrecognized_role.
6. Content Validation: Verifies that content has textual data and is a string. Error type: missing_content.
7. Assistant Message Presence: Checks that each conversation has at least one message from the assistant. Error type: example_missing_assistant_message.

The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.

In [5]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


Token Counting Utilities

Lets define a few helpful utilities to be used in the rest of the notebook.

In [6]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

## Data Warnings and Token Counts
With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. Missing System/User Messages: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. Number of Messages Per Example: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. Total Tokens Per Example: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. Tokens in Assistant's Messages: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. Token Limit Warnings: Checks if any examples exceed the maximum token limit (16,385 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.

In [7]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 4, 4
mean / median: 4.0, 4.0
p5 / p95: 4.0, 4.0

#### Distribution of num_total_tokens_per_example:
min / max: 154, 269
mean / median: 217.13333333333333, 215.0
p5 / p95: 196.2, 247.4

#### Distribution of num_assistant_tokens_per_example:
min / max: 28, 143
mean / median: 89.77777777777777, 88.0
p5 / p95: 70.2, 118.0

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning


## Cost Estimation

Estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count.

In [8]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~9771 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~29313 tokens


See https://openai.com/pricing to estimate total costs.

# Fine-tuning

In [9]:
import sys
import os
# import fix_path
# Add the parent directory to sys.path
sys.path.append(os.path.abspath('..'))
#sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('.'))))



In [10]:
# Import error because service module is not in Python path
# Need to add parent directory to Python path first
from service.openai import _client, _chat_model

TÀI LIỆU THAM KHẢO
- [OpenAI Fine-tuning Guide](https://platform.openai.com/docs/guides/fine-tuning)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference/chat/create)

Upload dataset.jsonl to OpenAI

In [11]:
def upload_file(file_name: str, purpose: str) -> str:
    with open(file_name, "rb") as file_fd:
        response = _client.files.create(file=file_fd, purpose=purpose)
    return response.id


training_file_id = upload_file(train_tone_file_path, "fine-tune") #!openai files create -p fine-tune -f dataset.jsonl

'''
In addition to training data, we can also optionally provide validation data, which will be used to make sure that the model does not overfit your training set.
'''
validation_file_id = upload_file(validation_tone_file_path, "fine-tune")

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

Training file ID: file-TmKRFe6EHfHJ8XPhm9Szfg
Validation file ID: file-LcN8EfRee2g3mMAqEfkWba


Kết quả trả về sẽ có file-id (ghi lại để dùng bước sau).

create fine-tuning job

Now we can create our fine-tuning job with the generated files and an optional suffix to identify the model. The response will contain an id which you can use to retrieve updates on the job.

In [13]:
#!openai fine_tuning.jobs.create -m gpt-4o -t <file-id>
response = _client.fine_tuning.jobs.create(
  training_file=training_file_id,
  validation_file=validation_file_id,
  model=_chat_model,
  #model="gpt-4o-mini-2024-07-18" 
)
job_id = response.id

print("Job ID:", response.id)
print("Status:", response.status)

Job ID: ftjob-kj5eLalZ8zlW33ZruRNeOvOm
Status: validating_files


Check job status

You can make a GET request to the https://api.openai.com/v1/alpha/fine-tunes endpoint to list your alpha fine-tune jobs. In this instance you'll want to check that the ID you got from the previous step ends up as status: succeeded.

Once it is completed, you can use the result_files to sample the results from the validation set (if you uploaded one), and use the ID from the fine_tuned_model parameter to invoke your trained model.

In [14]:
response = _client.fine_tuning.jobs.retrieve(job_id)

#!openai fine_tuning.jobs.list
print("Job ID:", response.id)
print("Status:", response.status)
print("Trained Tokens:", response.trained_tokens)

Job ID: ftjob-kj5eLalZ8zlW33ZruRNeOvOm
Status: validating_files
Trained Tokens: None


monitoring job

We can track the progress of the fine-tune with the events endpoint. You can rerun the cell below a few times until the fine-tune is ready.



In [15]:
response = _client.fine_tuning.jobs.list_events(job_id)

events = response.data
events.reverse()

for event in events:
    print(event.message)

Created fine-tuning job: ftjob-kj5eLalZ8zlW33ZruRNeOvOm
Validating training file: file-TmKRFe6EHfHJ8XPhm9Szfg and validation file: file-LcN8EfRee2g3mMAqEfkWba


Now that it's done, we can get a fine-tuned model ID from the job:



In [19]:
response = _client.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model_id = response.fine_tuned_model

if fine_tuned_model_id is None:
    raise RuntimeError(
        "Fine-tuned model ID not found. Your job has likely not been completed yet."
    )

print("Fine-tuned model ID:", fine_tuned_model_id)

Fine-tuned model ID: ft:gpt-4o-mini-2024-07-18:personal::BSE9aLry


In [20]:
    
response = _client.fine_tuning.jobs.list_events(job_id)

events = response.data
events.reverse()

for event in events:
    print(event.message)

Step 120/135: training loss=0.17, validation loss=0.71
Step 121/135: training loss=0.10
Step 122/135: training loss=0.14
Step 123/135: training loss=0.11
Step 124/135: training loss=0.04
Step 125/135: training loss=0.16
Step 126/135: training loss=0.14
Step 127/135: training loss=0.26
Step 128/135: training loss=0.18
Step 129/135: training loss=0.23
Step 130/135: training loss=0.17, validation loss=1.08
Step 131/135: training loss=0.17
Step 132/135: training loss=0.25
Step 133/135: training loss=0.20
Step 134/135: training loss=0.33
Step 135/135: training loss=0.07, validation loss=0.69, full validation loss=0.93
Checkpoint created at step 45
Checkpoint created at step 90
New fine-tuned model created
The job has successfully completed


using model fine-tuned

In [23]:

completion = _client.chat.completions.create(
  model=fine_tuned_model_id,
  messages=[
    {"role": "system", "content": "You are a professional, polite, patient, and respectful sales consultant staff for a phone store, interacting with Vietnamese customers. Always refer to yourself using 'em' pronoun. Address the user based on their provided <User gender>: use 'anh' for male, 'chị' for female. If the gender is unknown or not provided, use the polite neutral term 'anh/chị'."},
    {"role": "system", "content": "## BASE KNOWLEDGE:\n- User gender: female"}, # Cung cấp ngữ cảnh giới tính
    {"role": "user", "content": "Chào shop, em muốn hỏi về điện thoại Iphone 15"}
  ]
)
print(completion.choices[0].message.content)


Dạ vâng ạ, chị cần em tư vấn thông tin gì về iPhone 15 ạ? Đặc điểm, giá cả hay khuyến mãi ạ?


In [24]:

completion = _client.chat.completions.create(
  model=fine_tuned_model_id,
  messages=[
    {"role": "system", "content": "You are a professional, polite, patient, and respectful sales consultant staff for a phone store, interacting with Vietnamese customers. Always refer to yourself using 'em' pronoun. Address the user based on their provided <User gender>: use 'anh' for male, 'chị' for female. If the gender is unknown or not provided, use the polite neutral term 'anh/chị'."},
    {"role": "system", "content": "## BASE KNOWLEDGE:\n- User gender: unknown"}, # Cung cấp ngữ cảnh giới tính
    {"role": "user", "content": "Chào shop, em muốn hỏi về điện thoại Iphone 15"}
  ]
)
print(completion.choices[0].message.content)

Dạ vâng ạ, em chào anh/chị ạ. Anh/chị cần em tư vấn thông tin gì về dòng Iphone 15 ạ? Về cấu hình, giá cả, hay chương trình khuyến mãi ạ?
