 khi dùng của OpenAI để tinh chỉnh (fine-tune) các mô hình như GPT-3.5-Turbo, GPT-4o, GPT-4o-mini, không thể chọn các phương pháp tinh chỉnh tham số hiệu quả (Parameter-Efficient Fine-Tuning - PEFT) cụ thể như LoRA (Low-Rank Adaptation).

 Quá trình tinh chỉnh thông qua API của OpenAI là 1 (managed service). Bằng cách cung cấp training data (theo định dạng Chat Completions hoặc DPO) và một số siêu tham số cơ bản (như số epochs), còn OpenAI sẽ xử lý toàn bộ quá trình huấn luyện phức tạp ở phía sau. Kết quả nhận được một định danh mô hình mới đã được tinh chỉnh.


API của OpenAI: Chỉ hỗ trợ các phương pháp tinh chỉnh được mô tả trong docs (chủ yếu là SFT và DPO theo định dạng quy định). Người dùng không có quyền kiểm soát trực tiếp kiến trúc hoặc phương pháp cập nhật trọng số ở mức độ thấp như LoRA. OpenAI trừu tượng hóa quá trình này đi. Việc họ có sử dụng các kỹ thuật PEFT nào đó ở bên trong hệ thống của họ để tối ưu hóa hay không thì không được công bố rõ ràng, nhưng người dùng không thể yêu cầu hay cấu hình nó.

Tinh chỉnh Mô hình Open Source: Ngược lại, với các Opensource LLM (như Llama, Mistral, v.v.) trên các nền tảng như Hugging Face, hoàn toàn có thể áp dụng các kỹ thuật PEFT như LoRA, QLoRA, Adapter Tuning, v.v. Vì có quyền truy cập trực tiếp vào trọng số của mô hình và có thể tự mình kiểm soát toàn bộ vòng lặp huấn luyện.

Example format

```json
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
```

Multi-turn chat examples

 The default behavior during fine-tuning is to train on all assistant messages within a single example. To skip fine-tuning on specific assistant messages, a weight key can be added disable fine-tuning on that message, allowing you to control which assistant messages are learned. The allowed values for weight are currently 0 or 1. Some examples using weight for the chat format are below.

```json
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}

```

SFT 

Nếu không chỉ định (method), mặc định là (Supervised Fine-Tuning - SFT).



[Direct Preference Optimization - DPO](https://arxiv.org/abs/2305.18290)

tinh chỉnh các mô hình dựa trên lời nhắc và các cặp phản hồi. Cách tiếp cận này cho phép mô hình học hỏi từ các ưu tiên của con người, tối ưu hóa cho các đầu ra có nhiều khả năng được ưa chuộng hơn

For DPO:

set the `type` parameter to `dpo`
optionally set the `hyperparameters` property with any options you'd like to configure


```JSON
{
  "input": {
    "messages": [
      {
        "role": "user",
        "content": "Hello, can you tell me how cold San Francisco is today?"
      }
    ],
    "tools": [],
    "parallel_tool_calls": true
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "Today in San Francisco, it is not quite cold as expected. Morning clouds will give away to sunshine, with a high near 68°F (20°C) and a low around 57°F (14°C)."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "It is not particularly cold in San Francisco today."
    }
  ]
}
```


```python
job = client.fine_tuning.jobs.create(
    training_file="file-all-about-the-weather",
    model="gpt-4o-2024-08-06",
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {"beta": 0.1},
        },
    },
)
```

## Data preparation and analysis cost for fine-tuning

checks for format errors, provides basic statistics, and estimates token counts for fine-tuning costs. The method shown here corresponds to the current fine-tuning method for gpt-3.5-turbo. 

In [2]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [None]:
def validate_jsonl(file_path):
    try:
        with open(file_path, 'r') as file:
            for line in file:
                json.loads(line)
        print("Valid JSONL file")
    except json.JSONDecodeError as e:
        print(f"Invalid JSONL: {e}")
    except Exception as e:
        print(f"Error reading file: {e}")

validate_jsonl('path/to/your/file.jsonl')

'''
https://jsonltools.com/jsonl-validator
'''

Data loading
We first load the chat dataset from an example JSONL file.

In [3]:
import jsonlines

#data_path = "fine-tuning_data/tone1.jsonl"
data_path = "fine-tuning_data/toy_chat_fine_tuning.jsonl"
# Load the dataset

with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]
'''
with jsonlines.open(data_path) as reader:
    dataset = [line for line in reader]
'''
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 5
First example:
{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}
{'role': 'user', 'content': 'I fell off my bike today.'}
{'role': 'assistant', 'content': "It's great that you're getting exercise outdoors!"}


In [6]:
import jsonlines

data_path = "fine-tuning_data/tone1.jsonl"
#data_path = "fine-tuning_data/toy_chat_fine_tuning.jsonl"
# Load the dataset

'''
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]
'''
with jsonlines.open(data_path) as reader:
    dataset = [line for line in reader]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 158
First example:
{'role': 'system', 'content': "Bạn là trợ lý AI hỗ trợ khách hàng, luôn xưng 'em' và gọi người dùng là 'anh/chị/mình', thể hiện sự lịch sự, chuyên nghiệp, kiên nhẫn và tôn trọng."}
{'role': 'user', 'content': 'Chào shop, em muốn hỏi về cái tai nghe XYZ.'}
{'role': 'assistant', 'content': 'Dạ vâng ạ, em chào anh/chị. Anh/chị cần em tư vấn thông tin gì về mẫu tai nghe XYZ ạ?'}


Format validation

We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. Data Type Check: Checks whether each entry in the dataset is a dictionary (dict). Error type: data_type.
2. Presence of Message List: Checks if a messages list is present in each entry. Error type: missing_messages_list.
3. Message Keys Check: Validates that each message in the messages list contains the keys role and content. Error type: message_missing_key.
4. Unrecognized Keys in Messages: Logs if a message has keys other than role, content, weight, function_call, and name. Error type: message_unrecognized_key.
5. Role Validation: Ensures the role is one of "system", "user", or "assistant". Error type: unrecognized_role.
6. Content Validation: Verifies that content has textual data and is a string. Error type: missing_content.
7. Assistant Message Presence: Checks that each conversation has at least one message from the assistant. Error type: example_missing_assistant_message.

The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.

In [7]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


Token Counting Utilities

Lets define a few helpful utilities to be used in the rest of the notebook.

In [8]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

## Data Warnings and Token Counts
With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. Missing System/User Messages: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. Number of Messages Per Example: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. Total Tokens Per Example: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. Tokens in Assistant's Messages: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. Token Limit Warnings: Checks if any examples exceed the maximum token limit (16,385 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.

In [9]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 132, 302
mean / median: 204.54430379746836, 205.0
p5 / p95: 161.4, 244.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 36, 197
mean / median: 100.31012658227849, 98.5
p5 / p95: 60.7, 138.60000000000002

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning


## Cost Estimation

Estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count.

In [10]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~32318 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~96954 tokens


See https://openai.com/pricing to estimate total costs.

# Fine-tuning

In [None]:
import openai
from env import env
from service.openai import _client , _chat_model
openai.api_key = env.OPENAI_API_KEY

TÀI LIỆU THAM KHẢO
- [OpenAI Fine-tuning Guide](https://platform.openai.com/docs/guides/fine-tuning)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference/chat/create)

Upload dataset.jsonl to OpenAI

In [None]:
#!openai files create -p fine-tune -f dataset.jsonl
file = _client.files.create(
  file=open("tone1.jsonl", "rb"),
  purpose="fine-tune"
)

Kết quả trả về sẽ có file-id (ghi lại để dùng bước sau).

create fine-tuning job

In [None]:
#!openai fine_tuning.jobs.create -m gpt-4o -t <file-id>
_client.fine_tuning.jobs.create(
  training_file=file.id,
  model="gpt-4o-mini-2024-07-18"  # Hoặc model mới nhất
)

monitoring job

In [None]:
!openai fine_tuning.jobs.list


using model fine-tuned

In [None]:
response = openai.ChatCompletion.create(
    model="ft:gpt-4o:org-id::model-id",
    messages=[
        {"role": "user", "content": "Bạn tên gì?"}
    ]
)
print(response['choices'][0]['message']['content'])