### 3. Monitor the fine tuning job

The training time depends on the number of tokens and number of epochs, typically you can expect a job this size to run for a little over an hour and a half. We have already fine-tuned and deployed a model so you can use it directly without waiting for your fine tuning job to complete.

[Fine tuned model](https://oai.azure.com/resource/finetune/ftjob-6d1293138cd844e7bab02a141a60c697/details?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)

[Fine tuned model deployment](https://oai.azure.com/resource/deployments/%2Fsubscriptions%2F3c791225-4905-4a40-860b-0a0c9cd2af91%2FresourceGroups%2FRG-FineTuning-AIGBBWorkshop%2Fproviders%2FMicrosoft.CognitiveServices%2Faccounts%2Faoai-raft-gbb-workshop%2Fdeployments%2Fgpt-4o-mini-ft-raft-banking?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)

You can monitor your fine tuning job from this notebook or in the Azure OpenAI's new studio.

Go to Tools > Fine-tuning > Click on your job 

![alt text](./static/ft_monitor.png "Azure OpenAI Studio Fine tuning job")

## Overview
![](./doc/raft-process-deploy.png)

**We can also monitor the job from this notebook**

In [1]:
import os
from dotenv import load_dotenv

# Variables passed by previous notebooks
load_dotenv(".env.state")

job_id = os.getenv("STUDENT_OPENAI_JOB_ID")
STUDENT_MODEL_NAME = os.getenv("STUDENT_MODEL_NAME")
ds_name = os.getenv("DATASET_NAME")
print(f"Dataset name {ds_name}")
print(f"Student OpenAI Job ID {job_id}")
print(f"Student model name {STUDENT_MODEL_NAME}")

Dataset name zava-articles
Student OpenAI Job ID ftjob-8972c351eabc479691d407c4d2216381
Student model name gpt-4.1-nano


In [2]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential
from azure.identity import get_bearer_token_provider

aoai_endpoint = os.getenv("FINETUNE_AZURE_OPENAI_ENDPOINT")

# Authenticate using the default Azure credential chain
azure_credential = DefaultAzureCredential()

client = AzureOpenAI(
  azure_endpoint = aoai_endpoint,
  api_version = "2024-05-01-preview",  # This API version or later is required to access seed/events/checkpoint features
  azure_ad_token_provider = get_bearer_token_provider(
    azure_credential, "https://cognitiveservices.azure.com/.default"
  )
)

In [3]:
from IPython.display import clear_output
import time

start_time = time.time()

# Get the status of our fine-tuning job.
response = client.fine_tuning.jobs.retrieve(job_id)

status = response.status

# If the job isn't done yet, poll it every 10 seconds.
while status not in ["succeeded", "failed"]:
    response = client.fine_tuning.jobs.retrieve(job_id)
    #print(response.model_dump_json(indent=2))
    print(f"Waiting for job {job_id} to complete")
    print("Elapsed time: {} minutes {} seconds".format(int((time.time() - start_time) // 60), int((time.time() - start_time) % 60)))
    status = response.status
    print(f'Status: {status}')
    clear_output(wait=True)
    time.sleep(5)

print(f'Fine-tuning job {job_id} finished with status: {status}')

# List all fine-tuning jobs for this resource.
print('Checking other fine-tune jobs for this resource.')
response = client.fine_tuning.jobs.list()
print(f'Found {len(response.data)} fine-tune jobs.')

Fine-tuning job ftjob-8972c351eabc479691d407c4d2216381 finished with status: succeeded
Checking other fine-tune jobs for this resource.
Found 2 fine-tune jobs.


In [5]:
# Retrieve fine_tuned_model name
response = client.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model = response.fine_tuned_model
print(f"fine_tuned_model = {fine_tuned_model}")

fine_tuned_model = gpt-4.1-nano-2025-04-14.ft-8972c351eabc479691d407c4d2216381


### 4. Analyze the fine tuned model in Azure OpenAI Studio

Head here for a fine tuned model in the shared AI GBB tenant:
[Fine tuning job](https://oai.azure.com/resource/finetune/ftjob-6d1293138cd844e7bab02a141a60c697/details?wsid=/subscriptions/3c791225-4905-4a40-860b-0a0c9cd2af91/resourceGroups/RG-FineTuning-AIGBBWorkshop/providers/Microsoft.CognitiveServices/accounts/aoai-raft-gbb-workshop&tid=604b58b3-fa4e-4a57-b566-cac3f88a3ae8)

##### 4.a Training plots

When the model is done training, head to your Azure OpenAI Studio to analyze your model training metrics.

Two charts are available to analyze your fine tuning job and sanity check that the training went smoothly:
- Loss curve: Value of the loss function (how wrong the model is) over time during training process --> this curve should go down over time as the model weights converge towards the optimum. 
- Token Accuracy: Shows the accuracy of the model's predictions at the token level (e.g., words or subwords) over time during training. A higher token accuracy suggests that the model is better able to capture the nuances of the language and generate more accurate text.

Each of these charts has the metrics computed both on the training data and on the validation set. 

To analyze these plots, one should look for the following:

- A smooth curve: A smooth curve indicates that the model is learning consistently. Sharp changes or spikes in the curve could indicate issues with the learning rate or data preprocessing.
- Plateau: A plateau in the curve indicates that the model has stopped improving and further training may not be necessary.
- Overfitting: If the training loss continues to decrease but the validation loss starts to increase, it could be a sign of overfitting. This means that the model is not generalizing well to new data and may perform poorly on unseen data.
- Underfitting: If both the training and validation loss remain high, it could be a sign of underfitting. This means that the model is not learning the patterns in the data well enough and may need a more complex - architecture or more training data.
- Optimal stopping point: By analyzing the loss curve and token accuracy plot, one can determine the optimal stopping point for training, where the model has reached its best performance without overfitting.

Now head to the studio and ensure your curves look roughly like the below



![Alt text](./static/ft_metrics.png "AOAI training plots")

##### 4.b Model Checkpoints

In the Studio, go to the checkpoints tab, you'll see a model checkpoint corresponding to each completed epoch. A checkpoint is a fully functional version of a model which can both be deployed and used as the target model for subsequent fine-tuning jobs. Checkpoints can be particularly useful, as they can provide a snapshot of your model prior to overfitting having occurred. 

![Alt text](./static/ft_checkpoints.png "AOAI training plots")

### 5. Create a new deployment with the fine tuned model

When the fine-tuning job succeeds, the value of the fine_tuned_model variable in the response body is set to the name of your customized model. Your model is now also available for discovery from the list Models API. However, you can't issue completion calls to your customized model until your customized model is deployed. You must deploy your customized model to make it available for use with completion calls

#### 5.a From the notebook
To create a new deployment from a notebook, you'll need an access token from Azure, 
Open a terminal and run:

`az login`

`az account get-access-token`

paste the token in the next cell

In [6]:
from utils import update_state
STUDENT_DEPLOYMENT_NAME = f"ft-raft-{STUDENT_MODEL_NAME}-{ds_name}"
print(f"Student deployment name {STUDENT_DEPLOYMENT_NAME}")
update_state("STUDENT_DEPLOYMENT_NAME", STUDENT_DEPLOYMENT_NAME)
update_state("STUDENT_AZURE_OPENAI_ENDPOINT", aoai_endpoint)

Student deployment name ft-raft-gpt-4.1-nano-zava-articles
Updating state file with STUDENT_DEPLOYMENT_NAME=ft-raft-gpt-4.1-nano-zava-articles
Updating state file with STUDENT_AZURE_OPENAI_ENDPOINT=https://aoai-rogteyz-cvi-brk443-2.openai.azure.com/


In [7]:
# Deploy fine-tuned model
import requests
import json

access_token = azure_credential.get_token("https://management.azure.com/.default")

token = access_token.token
subscription = os.getenv("AZURE_SUBSCRIPTION_ID")
resource_group = os.getenv("AZURE_RESOURCE_GROUP")
resource_name = aoai_endpoint.split("https://")[1].split(".")[0]

deploy_params = {'api-version': "2023-05-01"}
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}

deploy_data = {
    "sku": {"name": "developertier", "capacity": 4},
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": fine_tuned_model, #retrieve this value from the previous call, it will look like gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83
            "version": "1"
        }
    }
}
deploy_data = json.dumps(deploy_data)

In [8]:
request_url = f'https://management.azure.com/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{STUDENT_DEPLOYMENT_NAME}'

print('Creating a new deployment...')

r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)

print(r)
print(r.reason)
print(r.json())

Creating a new deployment...
<Response [201]>
Created
{'id': '/subscriptions/7a880728-70d3-49d0-adde-4250716cfd94/resourceGroups/rg-cvi-brk443-2/providers/Microsoft.CognitiveServices/accounts/aoai-rogteyz-cvi-brk443-2/deployments/ft-raft-gpt-4.1-nano-zava-articles', 'type': 'Microsoft.CognitiveServices/accounts/deployments', 'name': 'ft-raft-gpt-4.1-nano-zava-articles', 'sku': {'name': 'developertier', 'capacity': 4}, 'properties': {'model': {'format': 'OpenAI', 'name': 'gpt-4.1-nano-2025-04-14.ft-8972c351eabc479691d407c4d2216381', 'version': '1'}, 'versionUpgradeOption': 'NoAutoUpgrade', 'capabilities': {'chatCompletion': 'true', 'area': 'US', 'responses': 'true', 'assistants': 'true'}, 'provisioningState': 'Creating', 'rateLimits': [{'key': 'request', 'renewalPeriod': 60, 'count': 4}, {'key': 'token', 'renewalPeriod': 60, 'count': 4000}]}, 'systemData': {'createdBy': 'cedricvidal@caai2510.onmicrosoft.com', 'createdByType': 'User', 'createdAt': '2025-08-28T00:42:29.4741964Z', 'lastMod

### Wait for deployment to complete

In [10]:
def retrieve_deployment(url):
    return requests.get(url, params=deploy_params, headers=deploy_headers).json()

start_time = time.time()

# Get the status of our fine-tuning job.
response = retrieve_deployment(request_url)
print(response)

status = response['properties']['provisioningState']

# If the job isn't done yet, poll it every 10 seconds.
while status.lower() not in ["succeeded", "failed"]:
    response = retrieve_deployment(request_url)
    #print(response.model_dump_json(indent=2))
    print(f"Waiting for model {STUDENT_DEPLOYMENT_NAME} deployment to complete")
    print("Elapsed time: {} minutes {} seconds".format(int((time.time() - start_time) // 60), int((time.time() - start_time) % 60)))
    status = response['properties']['provisioningState']
    print(f'Status: {status}')
    clear_output(wait=True)
    time.sleep(5)

print(f'Deployment {STUDENT_DEPLOYMENT_NAME} finished with status: {status}')

{'id': '/subscriptions/7a880728-70d3-49d0-adde-4250716cfd94/resourceGroups/rg-cvi-brk443-2/providers/Microsoft.CognitiveServices/accounts/aoai-rogteyz-cvi-brk443-2/deployments/ft-raft-gpt-4.1-nano-zava-articles', 'type': 'Microsoft.CognitiveServices/accounts/deployments', 'name': 'ft-raft-gpt-4.1-nano-zava-articles', 'sku': {'name': 'developertier', 'capacity': 4}, 'properties': {'model': {'format': 'OpenAI', 'name': 'gpt-4.1-nano-2025-04-14.ft-8972c351eabc479691d407c4d2216381', 'version': '1'}, 'versionUpgradeOption': 'NoAutoUpgrade', 'capabilities': {'chatCompletion': 'true', 'area': 'US', 'responses': 'true', 'assistants': 'true'}, 'provisioningState': 'Succeeded', 'rateLimits': [{'key': 'request', 'renewalPeriod': 60, 'count': 4}, {'key': 'token', 'renewalPeriod': 60, 'count': 4000}]}, 'systemData': {'createdBy': 'cedricvidal@caai2510.onmicrosoft.com', 'createdByType': 'User', 'createdAt': '2025-08-28T00:42:29.4741964Z', 'lastModifiedBy': 'cedricvidal@caai2510.onmicrosoft.com', 'la

In [11]:
from utils import update_state
update_state("STUDENT_AZURE_OPENAI_DEPLOYMENT", STUDENT_DEPLOYMENT_NAME)

Updating state file with STUDENT_AZURE_OPENAI_DEPLOYMENT=ft-raft-gpt-4.1-nano-zava-articles


## Next step -> Evaluation

[./4_eval.ipynb](./4_eval.ipynb) to start evaluating the deployed student model