# Fine Tuning in OpenAI
## Introduction
Fine-tuning lets you get more out of the models available through the API by providing:

* Higher quality results than prompting
* Ability to train on more examples than can fit in a prompt
* Token savings due to shorter prompts
* Lower latency requests

OpenAI's text generation models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.

At a high level, fine-tuning involves the following steps:

1. Prepare and upload training data
2. Train a new fine-tuned model
3. Evaluate results and go back to step 1 if needed
4. Use your fine-tuned model

You can also fine-tune a fine-tuned model which is useful if you acquire additional data and don't want to repeat the previous training steps.
We expect gpt-3.5-turbo to be the right model for most users in terms of results and ease of use.

## Common Use Cases
Some common use cases where fine-tuning can improve results:
* Setting the style, tone, format, or other qualitative aspects
* Improving reliability at producing a desired output
* Correcting failures to follow complex prompts
* Handling many edge cases in specific ways
* Performing a new skill or task that’s hard to articulate in a prompt

One high-level way to think about these cases is when it’s easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.

## Preparing Your Dataset
### Example Format
Each example in the dataset should be a conversation in the same format as our Chat Completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.

In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:
```python
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

### Example count
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.

### Token limits
Token limits depend on the model you select. For gpt-3.5-turbo-1106, the maximum context length is 16,385 so each training example is also limited to 16,385 tokens. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under the limit.

### Estimate costs
Please refer to the [pricing page](https://openai.com/pricing) for details on cost per 1k input and output tokens (we do to charge for tokens that are part of the validation data). To estimate the costs for a specific fine-tuning job, use the following formula:

> base cost per 1k tokens * number of tokens in the input file * number of epochs trained

For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40 USD.

In [2]:
import json
import numpy as np

data_path = "data/toy_chat_fine_tuning.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 10
First example:
{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}
{'role': 'user', 'content': 'I fell off my bike today.'}
{'role': 'assistant', 'content': "It's great that you're getting exercise outdoors!"}


### Format validation
We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. **Data Type Check:** Checks whether each entry in the dataset is a dictionary (dict). Error type: **data_type**.
2. **Presence of Message List:** Checks if a messages list is present in each entry. Error type: **missing_messages_list**.
3. **Message Keys Check:** Validates that each message in the messages list contains the keys role and content. Error type: **message_missing_key**.
4. **Unrecognized Keys in Messages:** Logs if a message has keys other than role, content, and name. Error type: **message_unrecognized_key**.
5. **Role Validation:** Ensures the role is one of "system", "user", or "assistant". Error type: **unrecognized_role**.
6. **Content Validation:** Verifies that content has textual data and is a string. Error type: **missing_content**.
7. **Assistant Message Presence:** Checks that each conversation has at least one message from the assistant. Error type: **example_missing_assistant_message**.

In [4]:
# Format error checks

from collections import defaultdict

format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


### Token Counting Utilities
Lets define a few helpful utilities to be used in the rest of the notebook.

In [5]:
import tiktoken # for token counting

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

### Data Warnings and Token Count

With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. **Missing System/User Messages:** Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. **Number of Messages Per Example:** Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. **Total Tokens Per Example:** Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. **Tokens in Assistant's Messages:** Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. **Token Limit Warnings:** Checks if any examples exceed the maximum token limit (4096 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.

Note that token limit for **gpt-3.5-turbo-1106** is 16,385. Token limit for **gpt-3.5-turbo-0613** is 4,096. In this example we will use gpt-3.5-turbo-0613.

In [6]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 2
Num examples missing user message: 3

#### Distribution of num_messages_per_example:
min / max: 2, 9
mean / median: 3.7, 2.5
p5 / p95: 2.0, 9.0

#### Distribution of num_total_tokens_per_example:
min / max: 26, 8032
mean / median: 848.0, 36.5
p5 / p95: 26.0, 903.0999999999972

#### Distribution of num_assistant_tokens_per_example:
min / max: 4, 8000
mean / median: 810.6, 9.5
p5 / p95: 4.0, 825.1999999999972

1 examples may be over the 4096 token limit, they will be truncated during fine-tuning


### Cost Estimation
We estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count.

In [7]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~4544 tokens that will be charged for during training
By default, you'll train for 10 epochs on this dataset
By default, you'll be charged for ~45440 tokens


### Upload the Training File
Once you have the data validated, the file needs to be uploaded using the [Files API](https://platform.openai.com/docs/api-reference/files/create) in order to be used with a fine-tuning jobs.

After you upload the file, it may take some time to process. While the file is processing, you can still create a fine-tuning job but it will not start until the file processing has completed.

The maximum file upload size is 1 GB, though we do not suggest fine-tuning with that amount of data since you are unlikely to need that large of an amount to see improvements.

In [8]:
from openai import OpenAI
client = OpenAI()

file_object = client.files.create(
  file=open(data_path, "rb"),
  purpose="fine-tune"
)

## Creating a Fine-Tuned Model
After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job. We support creating fine-tuning jobs via the [fine-tuning UI](https://platform.openai.com/finetune) or programmatically.

To set additional fine-tuning parameters like the validation_file or hyperparameters, please refer to the [API specification for fine-tuning](https://platform.openai.com/docs/api-reference/fine-tuning/create).

After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size. After the model training is completed, the user who created the fine-tuning job will receive an email confirmation.

In [25]:
fine_tuning_job = client.fine_tuning.jobs.create(
  training_file=file_object.id, 
  model="gpt-3.5-turbo",
  suffix="for_mlteam",
  hyperparameters={
    "n_epochs":"auto",
    "batch_size":"auto",
    "learning_rate_multiplier":"auto"
  }
)

In [19]:
# In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.
import time
import datetime

# List 10 fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)

# Retrieve the state of a fine-tune
# status can be ['validating_files', 'queued', 'running', 'succeeded', 'failed', 'cancelled']
while True:
    time.sleep(5)
    fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
    if fine_tuning_job.status in ['validating_files', 'queued', 'running']:
        continue
    if fine_tuning_job.status == "succeeded":
        created_at = datetime.datetime.fromtimestamp(fine_tuning_job.created_at)
        print(f"New fine-tuned model '{fine_tuning_job.fine_tuned_model}' created at {created_at} from base model '{fine_tuning_job.model}'.")
        break
    if fine_tuning_job.status == "failed":
        print("Fine-tuning job failed:", fine_tuning_job.error)
        break
    else:
        print("Fine-tuning job is cancelled.")
        break

# List up to 10 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id=fine_tuning_job.id, limit=10)

# Cancel a job
#client.fine_tuning.jobs.cancel(fine_tuning_job.id)

# Delete a fine-tuned model (must be an owner of the org the model was created in)
#client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

New fine-tuned model 'ft:gpt-3.5-turbo-0613:personal:for-mlteam:8lCxC8vf' created at 2024-01-26 12:32:00 from base model 'gpt-3.5-turbo-0613'.


SyncCursorPage[FineTuningJobEvent](data=[FineTuningJobEvent(id='ftevent-xXtYbybt4GeiEECdeN17ARny', created_at=1706261906, level='info', message='The job has successfully completed', object='fine_tuning.job.event', data={}, type='message'), FineTuningJobEvent(id='ftevent-I5jlI7wHztCxRVMTAUmRyatH', created_at=1706261903, level='info', message='New fine-tuned model created: ft:gpt-3.5-turbo-0613:personal:for-mlteam:8lCxC8vf', object='fine_tuning.job.event', data={}, type='message'), FineTuningJobEvent(id='ftevent-BW4I5iWpJctub4gZ5hZ82Y2f', created_at=1706261883, level='info', message='Step 91/100: training loss=0.00', object='fine_tuning.job.event', data={'step': 91, 'train_loss': 1.430511474609375e-06, 'train_mean_token_accuracy': 1.0}, type='metrics'), FineTuningJobEvent(id='ftevent-zF39zWAN4EMK7xlDtNA85woK', created_at=1706261865, level='info', message='Step 81/100: training loss=0.00', object='fine_tuning.job.event', data={'step': 81, 'train_loss': 2.4371677227463806e-06, 'train_mean_

## Using your Fine-Tuned Model
When a job has succeeded, you will see the fine_tuned_model field populated with the name of the model when you retrieve the job details. You may now specify this model as a parameter to in the Chat Completions API.

After your job is completed, the model should be available right away for inference use. In some cases, it may take several minutes for your model to become ready to handle requests. If requests to your model time out or the model name cannot be found, it is likely because your model is still being loaded. If this happens, try again in a few minutes.

In [21]:
response = client.chat.completions.create(
  model=fine_tuning_job.fine_tuned_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)
print(response.choices[0].message.content)

Hello! How can I help you today?


## Analysing your Fine-Tuned Model

API provides the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase). While an active fine-tuning jobs is running, you can view an event object which contains some useful metrics:
```python
{
    "object": "fine_tuning.job.event",
    "id": "ftevent-abc-123",
    "created_at": 1693582679,
    "level": "info",
    "message": "Step 100/100: training loss=0.00",
    "data": {
        "step": 100,
        "train_loss": 1.805623287509661e-5,
        "train_mean_token_accuracy": 1.0
    },
    "type": "metrics"
}
```
After a fine-tuning job has finished, you can also see metrics around how the training process went by querying a fine-tuning job, extracting a file ID from the result_files, and then retrieving that files content. Each results CSV file has the following columns: step, train_loss, train_accuracy, valid_loss, and valid_mean_token_accuracy.

In [24]:
import pandas as pd

fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)

for i, file_id in enumerate(fine_tuning_job.result_files):
    content = client.files.content(file_id)
    # save content to file
    with open(f"output/fine_tuning_result{i}.csv", "wb") as f:
        f.write(content.text.encode("utf-8"))
    df = pd.read_csv(f"output/fine_tuning_result{i}.csv")
    print(df.head(2))

   step  train_loss  train_accuracy  valid_loss  valid_mean_token_accuracy
0     1     5.71738         0.27273         NaN                        NaN
1     2     3.42625         0.33333         NaN                        NaN


While metrics can he helpful, evaluating samples from the fine-tuned model provides the most relevant sense of model quality. We recommend generating samples from both the base model and the fine-tuned model on a test set, and comparing the samples side by side. The test set should ideally include the full distribution of inputs that you might send to the model in a production use case. If manual evaluation is too time-consuming, consider using our [Evals library](https://github.com/openai/evals) to automate future evaluations.

In [None]:
# TODO: Go further for Evals library

### Iterate
If the results from a fine-tuning job are not as good as you expected, continue to iterate:
1. Iterate on data quality
    * Check if your model has grammar, logic, or style issues
    * Consider the balance and diversity of data
    * Make sure your training examples contain all of the information needed for the response
    * Look at the consistency in the training examples
    * Make sure your all of your training examples are in the same format, as expected for inference
2. Iterate on data quantity
    
    Once you’re satisfied with the quality and distribution of the examples, you can consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases". 
3. Iterate on hyperparameters
    
    We allow you to specify the following hyperparameters:
    * **epochs:** If the model does not follow the training data as much as expected increase the number of epochs by 1 or 2. If the model becomes less diverse than expected decrease the number of epochs by 1 or 2
    * **learning rate multiplier:** If the model does not appear to be converging, increase the learning rate multiplier
    * **batch size:**

## Fine-Tuning with Function Calls
Including a long list of functions in the completions API can consume a considerable number of prompt tokens and sometimes the model hallucinates or does not provide valid JSON output.

Fine-tuning a model with function calling examples can allow you to:

* Get similarly formatted responses even when the full function definition isn't present
* Get more accurate and consistent outputs

Fine-tuning on function calling can also be used to customize the model's response to function outputs. To do this you can include a function response message and an assistant message interpreting that response.

NOTE: Remaining part of the tutorial will not run, because when we try to upload the training data ('data/weather_chat_fine_tuning.jsonl'), file API validation rules fail. But, actually the file content is valid and compatible with the latest function calling API. Unfortunately file validation rules are not up-to-date and they do not support the latest function calling API standard as of January 2024.

In [None]:
# Upload the training data
from openai import OpenAI
client = OpenAI()

file_object = client.files.create(
  file=open("data/weather_chat_fine_tuning.jsonl", "rb"),
  purpose="fine-tune"
)

In [None]:
# Start a fine tuning job
fine_tuning_job = client.fine_tuning.jobs.create(
  training_file=file_object.id, 
  model="gpt-3.5-turbo",
  suffix="weather"
)

In [None]:
# Wait till the new model is created
while True:
    time.sleep(5)
    fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
    if fine_tuning_job.status in ['validating_files', 'queued', 'running']:
        continue
    if fine_tuning_job.status == "succeeded":
        break
    if fine_tuning_job.status == "failed":
        print("Fine-tuning job failed:", fine_tuning_job.error)
        break
    else:
        print("Fine-tuning job is cancelled.")
        break

In [None]:
# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
import json

def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    if "tokyo" in location.lower():
        return json.dumps({"location": "Tokyo", "temperature": "10", "unit": unit})
    elif "san francisco" in location.lower():
        return json.dumps({"location": "San Francisco", "temperature": "72", "unit": unit})
    elif "paris" in location.lower():
        return json.dumps({"location": "Paris", "temperature": "22", "unit": unit})
    else:
        return json.dumps({"location": location, "temperature": "unknown"})

In [None]:
# Step 1: send the conversation and available functions to the model
messages = [{"role": "user", "content": "What's the weather like in Istanbul?"}]
tools = [
{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}
]

# Use the fine-tuned model
response = client.chat.completions.create(
    model=fine_tuning_job.fine_tuned_model,
    messages=messages,
    tools=tools
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Step 2: check if the model wanted to call a function
if tool_calls:
    # Step 3: call the function
    # Note: the JSON response may not always be valid; be sure to handle errors
    available_functions = {
        "get_current_weather": get_current_weather,
    }  # only one function in this example, but you can have multiple
    messages.append(response_message)  # extend conversation with assistant's reply
    # Step 4: send the info for each function call and function response to the model
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        function_to_call = available_functions[function_name]
        function_args = json.loads(tool_call.function.arguments)
        function_response = function_to_call(
            location=function_args.get("location"),
            unit=function_args.get("unit"),
        )
        messages.append(
            {
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": function_name,
                "content": function_response,
            }
        )  # extend conversation with function response
    # Again use the fine-tuned model
    second_response = client.chat.completions.create(
        model=fine_tuning_job.fine_tuned_model,
        messages=messages,
    )  # get a new response from the model where it can see the function response
    print(second_response.choices[0].message.content)