## DPO Fine-Tuning GPT-4-o Model for Text Q&A - A Python SDK Experience

Learn how to fine-tune the **gpt-4-o** model using Direct Preference Optimization (DPO) with Python SDK. 

You can either run this notebook locally or run on an **AML CPU Compute Standard_D13_v2** with Kernel type **Python 3.10 - SDK v2**.

### Prerequisites

* An Azure subscription.
* A [Microsoft Foundry project](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects) in Azure AI Foundry portal.
    * Fine-tuning access requires **Cognitive Services OpenAI Contributor** in the Microsoft Foundry resource.
    * A Microsoft Foundry resource created in a supported fine-tuning region (e.g. East US 2 or Sweden Central).
    * A model deployment of **gpt-4o** base model, named **gpt-4o**.  
* A Training and Validation datasets:
  * at least 10 high-quality samples are required.
  * must be formatted in the JSON Lines (JSONL) document with UTF-8 encoding.
* Python version at least: **3.10**
* The OpenAI Python library version for this test notebook: **1.58.1**
* [Jupyter Notebooks](https://jupyter.org/) or **Visual Studio Code** with the **Jupyter** notebook extension installed.
* An `.env` file to store Azure credentials as environmental variables. **Be sure not to share this file with others or upload it to a public GitHub repository.**

### Step 1: Setup

#### Retrieve resources values

1. Go to your Microsoft Foundry resource in the Azure portal
1. In the **Overview** section, copy and save the following values:
    * Microsoft Foundry Resource Name
    * Resource group
    * Subscription ID

    <img src="images/screenshot-foundry-overview.png" alt="Screenshot of the Azure OpenAI resource management pane." width="800"/>
1. Go to **Resource Management**, click **Keys and Endpoint** sub-section
1. Click **OpenAI**
1. Copy and save the following values:
    * **KEY 1**
    * **Endpoint URL**.
        * Select one of these endpoint links: **Language APIs**, **Dall-e APIs** or **Whisper APIs**

    <img src="images/screenshot-foundry-keys-and-endpoint.png" alt="Screenshot of the Azure OpenAI resource management pane." width="800"/>

#### Add credentials and variables

1. Rename the [.env.example](../../.env.example) to "**.env**".
1. Open the [.env](../../.env) file located in the root.
/" folder.
1. Paste saved values to the variables in the file **azure.env**:
    * AZURE_OPENAI_ENDPOINT = "_<Foundry_Endpoint_URL>_"
    * AZURE_OPENAI_API_KEY = "_<Foundry_KEY_1>_"
    * AZURE_SUBSCRIPTION_ID = "_<Foundry_Subscription_ID>_"
    * AZURE_PROJECT_NAME = "_<Foundry_Name>_"
    * AZURE_RESOURCE_GROUP = "_<Foundry_ResourceGroup_Name>_"
1. Save the file and close it. 

> **Note:** **Do not** distribute this file as this contains credential information! 

#### Authentication Setup

Before running the next cell, make sure you're authenticated with Azure CLI. 

* Open a terminal inside VSC.
    * Run the following command in your terminal:

```
az login --use-device-code
```

* This will provide you with a device code and URL to authenticate in your browser to Azure.
    * Authenticate using the skillable Azure Username and TAP(Temporary Access Pass).
* Go back to the terminal and select the default subscription.

The Device Token will be used in this lab for:

* Remote development environments
* Systems without a default browser
* Corporate environments with strict security policies

#### Jupyter Kernel

Inside this jupyter notebook within VSC, select the **Python Kernel**:

* Python 3.12.10

#### Install required Python libraries

In [None]:
%pip install -q "matplotlib==3.10.8"

#### Import required Python libraries 

In [2]:
import os
import requests
import pandas as pd
import matplotlib.pyplot as plt 
import json

from dotenv import load_dotenv
from openai import AzureOpenAI
from io import BytesIO, StringIO
from azure.identity import DefaultAzureCredential

#### Load environmental variables to assign credentials 

In [None]:
# Load env. file
load_dotenv(dotenv_path="../../.env")

# Assign Azure resources  
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID") # name of the Azure Subscription ID
resource_name = os.getenv("AZURE_PROJECT_NAME") # name of the Foundry resource
rg_name = os.getenv("AZURE_RESOURCE_GROUP") # name of the resource group


# Assign Foundry credentials 
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2025-02-01-preview", # This API version or later is required for DPO fine-tuning.
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

print("Azure OpenAI client data loaded successfully!")

#### Define helper functions

In [22]:
# Reads and displays the first few lines from a .jsonl (JSON Lines) file
def read_jsonl(file_path, top_lines=5):
    """Reads and displays the first few lines from a .jsonl (JSON Lines) file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        messages = [line for line in f]
        for mes in messages[:top_lines]:
            print(mes)

In [23]:
# Plot fine-tuning metrics including loss and accuracy for training and validation
def show_ft_metrics(results_df, window_size=5):
    """Plot fine-tuning metrics including loss and accuracy for training and validation."""
    # Convert timestamp column to datetime if it exists
    if 'timestamp' in results_df.columns:
        results_df['timestamp'] = pd.to_datetime(results_df['timestamp'])
    
    # Drop rows where valid_loss is NaN or valid_loss is -1.0
    filtered_df = results_df.dropna(subset=['valid_loss'])
    filtered_df = filtered_df.loc[filtered_df['valid_loss'] != -1.0]
    
    # Compute rolling means (only on numeric columns)
    results_df_smooth = results_df.select_dtypes(include=['number']).rolling(window=window_size).mean()
    filtered_df_smooth = filtered_df.select_dtypes(include=['number']).rolling(window=window_size).mean()
    
    # Plot the curves
    plt.figure(figsize=(16, 6))
    
    plt.subplot(1, 2, 1)
    plt.plot(results_df_smooth['step'], results_df_smooth['train_loss'],  color='blue')
    plt.title('Train Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')
    
    plt.subplot(1, 2, 2)
    plt.plot(filtered_df_smooth['step'], filtered_df_smooth['valid_loss'], color='red')
    plt.title('Validation Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')

    plt.tight_layout()
    plt.show()

In [24]:
# Create a pandas DataFrame from a dictionary and sort it by a 'created' or 'created_at' timestamp column for displaying OpenAI API tables
def date_sorted_df(details_dict):
    """Create a pandas DataFrame from a dictionary and sort it by a 'created' or 'created_at' timestamp column for displaying OpenAI API tables."""
    df = pd.DataFrame(details_dict)
    
    if 'created' in df.columns:
        df.rename(columns={'created': 'created_at'}, inplace=True)
    
    # Convert 'created_at' from Unix timestamp to human-readable date/time format
    df['created_at'] = pd.to_datetime(df['created_at'], unit='s').dt.strftime('%Y-%m-%d %H:%M:%S')

    if 'finished_at' in df.columns:
        # Convert 'finished_at' from Unix timestamp to human-readable date/time format, keeping null values as is
        df['finished_at'] = pd.to_datetime(df['finished_at'], unit='s', errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    
    # Sort DataFrame by 'created_at' in descending order
    df = df.sort_values(by='created_at', ascending=False)

    return df

### Step 2: Prepare Training & Validation Datasets

To fine-tune a base model like GPT-4o using Direct Preference Optimization (DPO), we need a dataset in which each sample includes:

- **User Prompt**: A natural user message or question that initiates the assistant's response.
- **Preferred Output**: An ideal, high-quality assistant response aligned with a specific tone or style (e.g., optimistic).
- **Non-Preferred Output**: A less desirable response, often with contrasting tone or reasoning (e.g., pessimistic).

#### Dataset Output Format (JSONL, One Line Per Sample)

Below is the format example for each sample of the dataset. These will be saved as .jsonl files for DPO fine-tuning on Azure:

```json
{"input": {"messages": [{"role": "user", "content": "..."}]}, "preferred_output": [{"role": "assistant", "content": "..."}], "non_preferred_output": [{"role": "assistant", "content": "..."}]}
{"input": {"messages": [{"role": "user", "content": "..."}]}, "preferred_output": [{"role": "assistant", "content": "..."}], "non_preferred_output": [{"role": "assistant", "content": "..."}]}
{"input": {"messages": [{"role": "user", "content": "..."}]}, "preferred_output": [{"role": "assistant", "content": "..."}], "non_preferred_output": [{"role": "assistant", "content": "..."}]}
...
```

> **Note:** For demonstration purposes, weâ€™ve pre-generated 10 training samples and 10 validation samples to help you save on compute costs and training time.

#### Do initial data checks

In [None]:
# Check some data samples 
training_file_path = "./data/gpt4o_generated_qa_dpo_train_10_samples.jsonl"
validation_file_path = "./data/gpt4o_generated_qa_dpo_validation_10_samples.jsonl" 

read_jsonl(training_file_path, top_lines=3)

### Step 3: Upload Datasets for Fine-Tuning

In [None]:
# Upload training file
training_response = client.files.create(
    file = open(training_file_path, "rb"), purpose="fine-tune")

training_file_id = training_response.id

# Upload validation file
validation_response = client.files.create(
    file = open(validation_file_path, "rb"), purpose="fine-tune")

validation_file_id = validation_response.id

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

### Step 4: Configure and Start Fine-Tuning Job

Here is some guidance if you want to adjust the hyperparameters of the fine-tuning process. 

> **Note:** We configured these parameters to reduce resource consumption and accelerate the fine-tuning process. 

| Hyperparameter | Description |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Beta` | "auto" or number, is a new option that is only available for DPO. It's a floating point number between 0 and 2 that controls how strictly the new model will adhere to its previous behavior, versus aligning with the provided preferences. A high number will be more conservative (favoring previous behavior), and a lower number will be more aggressive (favor the newly provided preferences more often). |
| `Batch size` | The batch size to use for training. When set to default, batch_size is calculated as 0.2% of examples in training set and the max is 256. |
| `Learning rate multiplier` | The fine-tuning learning rate is the original learning rate used for pre-training multiplied by this multiplier. We recommend experimenting with values between 0.5 and 2. Empirically, we've found that larger learning rates often perform better with larger batch sizes. Must be between 0.0 and 5.0. |
| `Number of epochs` | Number of training epochs. An epoch refers to one full cycle through the data set. If set to default, number of epochs will be determined dynamically based on the input data. |
| `Seed` | The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but may differ in rare cases. If a seed is not specified, one will be generated for you. |

> **Note:** The Fine Tuning Job might take some time to complete. Time may varied from 30 minutes to 50 minutes.

In [28]:
# Submit fine-tuning training job
project_name = "gpt4_o_dpo_ft"

ft_job = client.fine_tuning.jobs.create(
    suffix=project_name,
    training_file = training_file_id,
    validation_file = validation_file_id,
    model="gpt-4o-2024-08-06", # baseline model name (not the deployment name)
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {
                "beta": 1.0,
                "batch_size": 32,
                "learning_rate_multiplier": 5.0,
                "n_epochs": 1
            },
        },
    },
    seed=3 # seed parameter controls reproducibility of the fine-tuning job. If no seed is specified one will be generated automatically.
)

### Step 5: Track Fine-Tuning Job Status

#### Track the training job status

In [None]:
# Check the fine-tuning job status
client.fine_tuning.jobs.list(limit=1).to_dict()

#### List fine-tuning events


> **Note:** API version: 2024-05-01-preview or later is required for this command.

This step is helpful to examine the individual fine-tuning events that were generated during training. in this example it is used to retrieve the fine tuninig job id of this exercise.

In [None]:
# List 5 recent fine-tuning jobs
ft_jobs = client.fine_tuning.jobs.list(limit=5).to_dict()
jobs_df = date_sorted_df(pd.DataFrame(ft_jobs["data"]))
jobs_df


In [None]:
# Get the latest fine-tuning job ID
latest_ft_job_id = jobs_df.iloc[0]["id"]
latest_ft_job_id

> **Note:** 
>
> The following snipet must be executed ONLY after the fine tuning job has been **Completed**.
>
> Else and error message will be received.

In [None]:
# Retrieve the name of your newly DPO fine-tuned model
ft_job = client.fine_tuning.jobs.retrieve(latest_ft_job_id) # Latest FT job will be retrieved. To use another Job ID, replace "latest_ft_job_id" with the actual job-id in your list
fine_tuned_model = ft_job.to_dict()['fine_tuned_model']
fine_tuned_model

#### Retrieve fine-tuning metrics

In [None]:
# Retrieve fine-tuning metrics from result file
result_file_id = ft_job.to_dict()['result_files'][0]
results_content = client.files.content(result_file_id).content.decode()

data_io = StringIO(results_content)
results_df = pd.read_csv(data_io)
display(results_df)

In [None]:
# Plot train and validation metrics
show_ft_metrics(results_df)

### Step 6: Deploy the Fine-Tuned Model

> **Note:** 
>
> Only one deployment is permitted for a customized model. An error occurs if you select an already-deployed customized model.
>
> The deployment process may take 10 to 20 mins.

The code below shows how to deploy the model using the Control Plane API. 

In [None]:
# Deploy the fine-tuned model as an Azure Managed Online Endpoint
# fine_tuned_model = MS Foundry deployment name

credential = DefaultAzureCredential()
token = credential.get_token("https://management.azure.com/.default").token

deploy_params = {'api-version': "2023-05-01"} 
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}

deploy_data = {
    "sku": {"name": "standard", "capacity": 50}, 
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": fine_tuned_model, # retrieve this value from the previous calls, it will look like gpt-4o-2024-08-06.ft-b044a9d3cf9c4228b5d393567f693b83
            "version": "1"
        }
    }
}
deploy_data = json.dumps(deploy_data)

request_url = f'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{rg_name}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{project_name}'

print('Creating a new deployment...')

r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)

print(r)
print(r.reason)
print(r.json())

### Step 7: Test the Deployed Fine-Tuned Model

After your fine-tuned model is deployed, you can use it like any other deployed model in either the [Chat Playground in Azure AI Foundry](https://ai.azure.com/), or via the chat completion API. 

For example, you can send a chat completion call to your deployed model, as shown in the following Python code snippet. 

> **Note:** 
>
> The following snipet must be executed ONLY after the fine tuning model has been **Deployed**.
>
> Else and error message will be received.

In [None]:
# Check output from the deployed DPO fine-tuned model via Foundry API
test_messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Will AI improve the logistics of renewable energy?'}
]

response = client.chat.completions.create(
    model=project_name, 
    messages=test_messages, 
    temperature=0.7,
    max_tokens=800)

print(response.choices[0].message.content)

### Step 8: Evaluate the Base GPT-4o and the DPO Fine-Tuned GPT-4o Models

To keep the demo lightweight, cost-effective and within an acceptable timeframe, we previously used both the Base GPT-4o model and the DPO Fine-Tuned GPT-4o model to answer a pre-selected set of 49 test questions. The table below presents the detailed results.

In [None]:
comparison_df = pd.read_csv("./data/test_qa_pairs_base_gpt_4o_versus_dpo_fine_tuned_gpt_4o.csv")
comparison_df.info()
comparison_df.head()

Each row represents a test case, and the columns are defined as follows:

- **question**: The original user prompt or test question.
- **gpt_4o_base_answer**: The response generated by the Base GPT-4o model.
- **gpt_4o_base_answer_label**: An automatically assigned label (e.g., Positive, Neutral, Negative) by the Base GPT-4o model.
- **gpt_4o_base_answer_explanation**: A brief explanation justifying the assigned label by the Base GPT-4o model.
- **gpt_4o_dpo_fine_tuned_answer**: The response generated by the DPO Fine-Tuned GPT-4o model.
- **gpt_4o_dpo_fine_tuned_answer_label**: An automatically assigned label (e.g., Positive, Neutral, Negative) by the Base GPT-4o model.
- **gpt_4o_dpo_fine_tuned_answer_explanation**: A brief explanation justifying the assigned label by the Base GPT-4o model.

In [None]:
# Assuming df is your DataFrame
base_label_counts = comparison_df['gpt_4o_base_answer_label'].value_counts().sort_index()
dpo_label_counts = comparison_df['gpt_4o_dpo_fine_tuned_answer_label'].value_counts().sort_index()

# Combine into a single DataFrame
label_comparison_df = pd.DataFrame({
    'Base GPT-4o': base_label_counts,
    'DPO Fine-Tuned GPT-4o': dpo_label_counts
}).fillna(0).astype(int)

# Display table
display(label_comparison_df)

# Plot grouped bar chart
label_comparison_df.plot(
    kind='bar',
    figsize=(8, 5),
    color=['#1f77b4', '#2ca02c'],
    edgecolor='black'
)

plt.title('Answer Label Distribution: Base vs. DPO Fine-Tuned GPT-4o')
plt.xlabel('Answer Label')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Model')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Based on the label distribution table and the bar chart, we observe a clear shift in tone: the DPO fine-tuned GPT-4o model consistently produces more positive responses compared to the base model. This suggests that the fine-tuning process successfully aligned the model toward a more optimistic attitude when answering philosophical and forward-thinking questions.

### Step 9: Delete the Deployment

It is **strongly recommended** that once you're done with this tutorial and have tested a few chat completion calls against your fine-tuned model, that you delete the model deployment, since the fine-tuned / customized models have an [hourly hosting cost](https://azure.microsoft.com/zh-cn/pricing/details/cognitive-services/openai-service/#pricing) associated with them once they are deployed.