# DPO Fine-Tuning with Intel Orca Dataset on Azure AI

This notebook demonstrates how to fine-tune language models using **Direct Preference Optimization (DPO)** with the Intel Orca DPO Pairs dataset.

## What You'll Learn
1. Understand DPO fine-tuning
2. Prepare and format DPO training data  
3. Upload datasets to Azure AI
4. Create and monitor a DPO fine-tuning job
5. Evaluate your fine-tuned model

Note: Execute each cell in sequence.

## 1. Setup and Installation

Install all required packages from requirements.txt

In [1]:
pip install -r requirements.txt

Collecting azure-ai-projects>=2.0.0b1 (from -r requirements.txt (line 2))
  Using cached azure_ai_projects-2.0.0b2-py3-none-any.whl.metadata (63 kB)
Collecting openai (from -r requirements.txt (line 5))
  Using cached openai-2.14.0-py3-none-any.whl.metadata (29 kB)
Collecting azure-identity (from -r requirements.txt (line 8))
  Using cached azure_identity-1.25.1-py3-none-any.whl.metadata (88 kB)
Collecting azure-mgmt-cognitiveservices (from -r requirements.txt (line 9))
  Using cached azure_mgmt_cognitiveservices-14.1.0-py3-none-any.whl.metadata (32 kB)
Collecting azure-ai-evaluation>=1.13.0 (from -r requirements.txt (line 12))
  Using cached azure_ai_evaluation-1.13.7-py3-none-any.whl.metadata (49 kB)
Collecting python-dotenv (from -r requirements.txt (line 15))
  Using cached python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Collecting isodate>=0.6.1 (from azure-ai-projects>=2.0.0b1->-r requirements.txt (line 2))
  Using cached isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Coll


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Import Libraries

In [4]:
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

print(" All libraries imported successfully")

 All libraries imported successfully


## 3. Define Evaluation Function

Function to evaluate model performance using Azure AI Evaluation SDK.

In [5]:
def evaluate_model(deployment_name, num_samples=10):
    """
    Evaluate a model deployment using Azure AI Evaluation SDK.
    
    Args:
        deployment_name: Name of the deployed model to evaluate
        num_samples: Number of samples to evaluate (default: 10)
    
    Returns:
        Dictionary containing evaluation metrics
    """
    import json
    from azure.ai.evaluation import evaluate, CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator, SimilarityEvaluator, GroundednessEvaluator
    from openai import AzureOpenAI
    
    print(f"Evaluating deployment: {deployment_name}")
    print(f"Using {num_samples} samples from training.jsonl")
    
    azure_openai_client = AzureOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_KEY"),
        api_version="2024-08-01-preview"
    )
    
    print("Generating model responses...")
    eval_data = []
    with open("training.jsonl", 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= num_samples:
                break
            sample = json.loads(line)
            
            messages = sample["input"]["messages"]
            query = next((msg["content"] for msg in messages if msg["role"] == "user"), "")
            
            response = azure_openai_client.chat.completions.create(
                model=deployment_name,
                messages=messages,
                temperature=0.7,
                max_tokens=500
            )
            model_response = response.choices[0].message.content
            
            ground_truth = next((msg["content"] for msg in sample["preferred_output"] if msg["role"] == "assistant"), "")
            
            eval_data.append({
                "query": query,
                "response": model_response,
                "ground_truth": ground_truth
            })
            # Print progress every 10 samples or on the last sample
            if (i + 1) % 10 == 0 or (i + 1) == num_samples:
                print(f"  Processed {i+1}/{num_samples}")
    
    eval_file = f"evaluation_data_{deployment_name.replace('-', '_')}.jsonl"
    with open(eval_file, 'w', encoding='utf-8') as f:
        for item in eval_data:
            f.write(json.dumps(item) + '\n')
    
    model_config = {
        "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
        "api_key": os.getenv("AZURE_OPENAI_KEY"),
        "azure_deployment": deployment_name,
        "api_version": "2024-08-01-preview",
    }
    
    print("Running evaluation with 5 metrics...")
    results = evaluate(
        data=eval_file,
        evaluators={
            "coherence": CoherenceEvaluator(model_config=model_config),
            "fluency": FluencyEvaluator(model_config=model_config),
            "relevance": RelevanceEvaluator(model_config=model_config),
            "groundedness": GroundednessEvaluator(model_config=model_config),
            "similarity": SimilarityEvaluator(model_config=model_config)
        },
        evaluator_config={
            "default": {
                "column_mapping": {
                    "query": "${data.query}",
                    "response": "${data.response}",
                    "ground_truth": "${data.ground_truth}"
                }
            },
            "groundedness": {
                "column_mapping": {
                    "query": "${data.query}",
                    "response": "${data.response}",
                    "context": "${data.ground_truth}"
                }
            }
        },
        output_path=f"./evaluation_results_{deployment_name.replace('-', '_')}"
    )
    
    print(f"\nEVALUATION RESULTS: {deployment_name}\n")
    
    if "metrics" in results:
        metrics = results["metrics"]
        
        coherence = metrics.get('coherence.coherence', metrics.get('coherence'))
        fluency = metrics.get('fluency.fluency', metrics.get('fluency'))
        relevance = metrics.get('relevance.relevance', metrics.get('relevance'))
        groundedness = metrics.get('groundedness.groundedness', metrics.get('groundedness'))
        similarity = metrics.get('similarity.similarity', metrics.get('similarity'))
        
        if coherence is not None:
            print(f"Coherence:      {coherence:.4f} (1-5 scale)")
        if fluency is not None:
            print(f"Fluency:        {fluency:.4f} (1-5 scale)")
        if relevance is not None:
            print(f"Relevance:      {relevance:.4f} (1-5 scale)")
        if groundedness is not None:
            print(f"Groundedness:   {groundedness:.4f} (1-5 scale)")
        if similarity is not None:
            print(f"Similarity:     {similarity:.4f} (1-5 scale)")
    
    print("="*60)
    print(f"Detailed results saved to: ./evaluation_results_{deployment_name.replace('-', '_')}")
    print(f"Detailed results saved to: ./evaluation_results_{deployment_name.replace('-', '_')}")
    
    return results


## 4. Configure Azure Environment
Set your Azure AI Project endpoint and model name. We're using **gpt-4.1-mini** in this example, but you can use other supported GPT models. Create a `.env` file with: 

```
# Required for DPO Fine-Tuning
AZURE_AI_PROJECT_ENDPOINT=<your-endpoint> 
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>
AZURE_AOAI_ACCOUNT=<your-foundry-account-name>
MODEL_NAME=<your-base-model-name>

# Required for Model Evaluation
AZURE_OPENAI_ENDPOINT=<your-azure-openai-endpoint>
AZURE_OPENAI_KEY=<your-azure-openai-api-key>
DEPLOYMENT_NAME=<your-deployment-name>
```

In [6]:
# Load environment variables
load_dotenv()

endpoint = os.environ.get("AZURE_AI_PROJECT_ENDPOINT")
model_name = os.environ.get("MODEL_NAME")

# Define dataset file paths
training_file_path = "training.jsonl"
validation_file_path = "validation.jsonl"

print(f"Base model: {model_name}")

Base model: gpt-4.1-mini


## 4. Connect to Azure AI Project

Connect to Azure AI Project using Azure credential authentication. This initializes the project client and OpenAI client needed for fine-tuning workflows. Ensure you have the **Azure AI User** role assigned to your account for the Azure AI Project resource.

In [7]:
credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
openai_client = project_client.get_openai_client()

print("Connected to Azure AI Project")

Connected to Azure AI Project


## 5. Upload Training Files

Upload the training and validation JSONL files to Azure AI. Each file is assigned a unique ID that will be referenced when creating the fine-tuning job.

In [9]:
print("Uploading training file...")
with open(training_file_path, "rb") as f:
    train_file = openai_client.files.create(file=f, purpose="fine-tune")
print(f" Training file ID: {train_file.id}")

print("\nUploading validation file...")
with open(validation_file_path, "rb") as f:
    validation_file = openai_client.files.create(file=f, purpose="fine-tune")
print(f" Validation file ID: {validation_file.id}")

Uploading training file...
 Training file ID: file-4121fac45b5144bab840f2a8bea3eb9c

Uploading validation file...
 Validation file ID: file-b3b53f48582b4bf89741bfbd1f6fd7a1


In [10]:
print("Waiting for files to be processed...")
openai_client.files.wait_for_processing(train_file.id)
openai_client.files.wait_for_processing(validation_file.id)
print(" Files ready!")

Waiting for files to be processed...
 Files ready!


## 7. Evaluate Base Model

Establish baseline performance metrics by evaluating the base model before DPO fine-tuning. This provides a comparison point to measure improvements after training.



In [None]:
base_deployment = os.getenv("DEPLOYMENT_NAME")
print(f"Evaluating base model: {base_deployment}\n")

base_results = evaluate_model(base_deployment, num_samples=30)

Evaluating base model: gpt-4.1-mini

Evaluating deployment: gpt-4.1-mini
Using 30 samples from training.jsonl
Generating model responses...
  Processed 10/30
  Processed 20/30
  Processed 30/30
Running evaluation with 5 metrics...
2026-01-02 17:08:05 +0530   26432 execution.bulk     INFO     Finished 1 / 30 lines.
2026-01-02 17:08:05 +0530   26432 execution.bulk     INFO     Average execution time for completed lines: 8.81 seconds. Estimated time for incomplete lines: 255.49 seconds.
2026-01-02 17:08:05 +0530   23304 execution.bulk     INFO     Finished 1 / 30 lines.
2026-01-02 17:08:05 +0530   23304 execution.bulk     INFO     Average execution time for completed lines: 8.99 seconds. Estimated time for incomplete lines: 260.71 seconds.
2026-01-02 17:08:05 +0530   12916 execution.bulk     INFO     Finished 1 / 30 lines.
2026-01-02 17:08:05 +0530   12916 execution.bulk     INFO     Average execution time for completed lines: 9.04 seconds. Estimated time for incomplete lines: 262.16 seco

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20260102_113756_916688"
Run status: "Completed"
Start time: "2026-01-02 11:37:56.916688+00:00"
Duration: "0:00:26.550113"

2026-01-02 17:08:23 +0530   12916 execution.bulk     INFO     Finished 30 / 30 lines.
2026-01-02 17:08:23 +0530   12916 execution.bulk     INFO     Average execution time for completed lines: 0.89 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "groundedness_20260102_113756_932699"
Run status: "Completed"
Start time: "2026-01-02 11:37:56.932699+00:00"
Duration: "0:00:27.429327"

2026-01-02 17:09:07 +0530   19684 execution.bulk     INFO     Finished 21 / 30 lines.
2026-01-02 17:09:07 +0530   19684 execution.bulk     INFO     Average execution time for completed lines: 3.34 seconds. Estimated time for incomplete lines: 30.06 seconds.
2026-01-02 17:09:07 +0530   19684 execution.bulk     INFO     Finished 22 / 30 lines.
2026-01-02 17:09:07 +0530   19684 execution.bulk     INFO     Average execution time for completed lines: 3

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "fluency_20260102_113756_922698"
Run status: "Completed"
Start time: "2026-01-02 11:37:56.922698+00:00"
Duration: "0:01:11.299292"


{
    "coherence": {
        "status": "Completed",
        "duration": "0:00:26.550113",
        "completed_lines": 30,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "fluency": {
        "status": "Completed",
        "duration": "0:01:11.299292",
        "completed_lines": 30,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "relevance": {
        "status": "Completed",
        "duration": "0:00:25.459820",
        "completed_lines": 30,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "groundedness": {
        "status": "Completed",
        "duration": "0:00:27.429327",
        "completed_lines": 30,
        "failed_lines":

 Job ID: ftjob-eb2842107abe43f2a0e0dd3d271146ad
Status: pending
 Job ID: ftjob-aaded7f5c0c44f8da1b577efd79899ee
Status: pending
Status: running
Preprocessing completed for file validation file.
Preprocessing completed for file training file.
Job enqueued. Waiting for jobs ahead to complete.


## 8. Create DPO Fine-Tuning Job
Create a DPO fine-tuning job with your uploaded datasets. Configure the following hyperparameters to control the training process:

1. n_epochs (3): Number of complete passes through the training dataset. More epochs can improve performance but may lead to overfitting. Typical range: 1-10.
2. batch_size (1): Number of training examples processed together in each iteration. Smaller batches (1-2) are common for DPO to maintain training stability.
3. learning_rate_multiplier (1.0): Scales the default learning rate. Values < 1.0 make training more conservative, while values > 1.0 speed up learning but may cause instability. Typical range: 0.1-2.0.
Adjust these values based on your dataset size and desired model behavior. 

Start with these defaults and experiment if needed.

In [11]:
fine_tuning_job = openai_client.fine_tuning.jobs.create(
    training_file=train_file.id,
    validation_file=validation_file.id,
    model=model_name,
    method={
        "type": "dpo",
        "dpo": {
            "hyperparameters": {
                "n_epochs": 3,
                "batch_size": 1,
                "learning_rate_multiplier": 1.0
            }
        }
    },
    extra_body={"trainingType": "GlobalStandard"}
)

print(f" Job ID: {fine_tuning_job.id}")
print(f"Status: {fine_tuning_job.status}")

 Job ID: ftjob-4cad7de198a34baeb4f0c95ff01ac844
Status: pending


## 9. Monitor Training Progress
Check the status of your fine-tuning job and track progress. You can view the current status, and recent training events. Training duration varies based on dataset size, model, and hyperparameters - typically ranging from minutes to several hours.

In [12]:
job_status = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
print(f"Status: {job_status.status}")

Status: pending


In [13]:
# View recent events
events = list(openai_client.fine_tuning.jobs.list_events(fine_tuning_job.id, limit=10))
for event in events:
    print(event.message)

Job enqueued. Waiting for jobs ahead to complete.


## 10. Retrieve Fine-Tuned Model
After the fine-tuning job succeeded, retrieve the fine-tuned model ID. This ID is required to make inference calls with your customized model.

In [14]:
completed_job = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)

if completed_job.status == "succeeded":
    fine_tuned_model_id = completed_job.fine_tuned_model
    print(f" Fine-tuned Model ID: {fine_tuned_model_id}")
else:
    print(f"Status: {completed_job.status}")

Status: pending


## 11. Deploy the fine-tuned Model

Deploy the fine-tuned model to Azure OpenAI as a deployment endpoint. This step is required before making inference calls. The deployment uses GlobalStandard SKU with 50 capacity.

In [None]:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentProperties, DeploymentModel, Sku
import time

subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group = os.environ.get("AZURE_RESOURCE_GROUP")
account_name = os.environ.get("AZURE_AOAI_ACCOUNT")

deployment_name = "gpt-4.1-mini-dpo-finetuned"

with CognitiveServicesManagementClient(credential=credential, subscription_id=subscription_id) as cogsvc_client:
    deployment_model = DeploymentModel(format="OpenAI", name=fine_tuned_model_id, version="1")
    deployment_properties = DeploymentProperties(model=deployment_model)
    deployment_sku = Sku(name="GlobalStandard", capacity=50)
    deployment_config = Deployment(properties=deployment_properties, sku=deployment_sku)
    
    print(f"Deploying fine-tuned model: {fine_tuned_model_id}")
    deployment = cogsvc_client.deployments.begin_create_or_update(
        resource_group_name=resource_group,
        account_name=account_name,
        deployment_name=deployment_name,
        deployment=deployment_config,
    )
    
    print("Waiting for deployment to complete...")
    deployment.result()

print(f" Model deployment completed: {deployment_name}")

## 12. Test Your Fine-Tuned Model

Validate your fine-tuned model by running test inferences. This helps you assess whether the DPO training successfully aligned the model with your preferred response patterns from the training data

In [None]:
print(f"Testing fine-tuned model via deployment: {deployment_name}")

response = openai_client.responses.create(
    model=deployment_name,
    input=[{"role": "user", "content": "Explain machine learning in simple terms."}]
)

print(f"Model response: {response.output_text}")

## 13. Evaluate Fine-Tuned Model

Evaluate your model using Azure AI Evaluation SDK to measure quality improvements from DPO fine-tuning.

We'll assess 5 key metrics:
- **Coherence**: Logical flow and structure
- **Fluency**: Grammatical correctness and naturalness
- **Relevance**: How well responses address the query
- **Groundedness**: Factual accuracy against context
- **Similarity**: Alignment with preferred outputs

In [None]:
print(f"Evaluating fine-tuned model: {deployment_name}\n")

finetuned_results = evaluate_model(deployment_name, num_samples=50)

print("\nCompare base model vs fine-tuned model metrics to see DPO improvements!")

## 14. Next Steps

Congratulations! You've successfully fine-tuned a model with DPO.

### What's Next?
- Deploy your model to production
- Evaluate on more test cases
- Experiment with hyperparameters
- Try different datasets