## Finetuning using MaaP (Model-as-a-Platform)

This sample shows how use `chat-completion` components from Azure Machine Learning's `azureml` system registry to finetune a model. We then deploy the fine tuned model to an online endpoint for real time inference.

### Model
We will use the `Phi-3.5-mini-instruct` model to show how user can finetune a model for chat-completion task.

### Outline
* Setup pre-requisites such as compute.
* Pick a model to fine tune.
* Pick and explore training data.
* Configure the fine tuning job.
* Run the fine tuning job.
* Review training and evaluation metrics. 
* Register the fine tuned model. 
* Deploy the fine tuned model for real time inference.
* (Optional) Download the fine tuned model.

### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name
* Check or create compute. A single GPU node can have multiple GPU cards. For example, in one node of `Standard_NC24rs_v3` there are 4 NVIDIA V100 GPUs while in `Standard_NC12s_v3`, there are 2 NVIDIA V100 GPUs. Refer to the [docs](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) for this information. The number of GPU cards per node is set in the param `gpus_per_node` below. Setting this value correctly will ensure utilization of all GPUs in the node. The recommended GPU compute SKUs can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).

Install dependencies by running below cell. This is not an optional step if running in a new environment.

In [1]:
%pip install azure-ai-ml
%pip install azure-identity
%pip install datasets==2.9.0
%pip install mlflow
%pip install azureml-mlflow

Collecting azure-storage-blob>=12.10.0 (from azure-ai-ml)
  Using cached azure_storage_blob-12.23.1-py3-none-any.whl.metadata (26 kB)
Using cached azure_storage_blob-12.23.1-py3-none-any.whl (405 kB)
Installing collected packages: azure-storage-blob
  Attempting uninstall: azure-storage-blob
    Found existing installation: azure-storage-blob 12.19.0
    Uninstalling azure-storage-blob-12.19.0:
      Successfully uninstalled azure-storage-blob-12.19.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
azureml-mlflow 1.58.0 requires azure-storage-blob<=12.19.0,>=12.5.0, but you have azure-storage-blob 12.23.1 which is incompatible.[0m[31m
[0mSuccessfully installed azure-storage-blob-12.23.1
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the ker

In [2]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
except:
    workspace_ml_client = MLClient(
        credential,
        subscription_id="AddyoursubscriptionGUID",
        resource_group_name="Addyourresorcesgroupname",
        workspace_name="Addyourworkspacename",
    )

# the models, fine tuning pipelines and environments are available in the AzureML registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
experiment_name = "phi35-mini-ft-sentiment"

# generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

### 2. Pick a foundation model to fine tune

`Phi-3.5-mini-instruct` is a lightweight, state-of-the-art open model built upon datasets used for `Phi-3` models, with a focus on very high-quality, reasoning dense data. The model belongs to the `Phi-3` model family and supports `128k` token context length. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.

You can browse these models in the Model Catalog in the AzureML Studio, filtering by the `chat-completion` task. In this example, we use the `Phi-3.5-mini-instruct` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in AzureML Studio Model Catalog. 

In [3]:
model_name = "Phi-3.5-mini-instruct"
foundation_model = registry_ml_client.models.get(model_name, label="latest")
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)



Using model name: Phi-3.5-mini-instruct, version: 4, id: azureml://registries/azureml/models/Phi-3.5-mini-instruct/versions/4 for fine tuning


### 3. Create a compute to be used with the job

The finetune job works `ONLY` with `GPU` compute. The size of the compute depends on how big the model is and in most cases it becomes tricky to identify the right compute for the job. In this cell, we guide the user to select the right compute for the job.
* The computes listed below work with the most optimized configuration. Any changes to the configuration might lead to `Cuda Out Of Memory` error. In such cases, try to upgrade the compute to a bigger compute size.
* While selecting the compute_cluster_size below, make sure the compute is available in your resource group. If a particular compute is not available you can make a request to get access to the compute resources.

In [4]:
import ast

if "finetune_compute_allow_list" in foundation_model.tags:
    computes_allow_list = ast.literal_eval(
        foundation_model.tags["finetune_compute_allow_list"]
    )  # convert string to python list
    print(f"Please create a compute from the above list - {computes_allow_list}")
else:
    computes_allow_list = None
    print("`finetune_compute_allow_list` is not part of model tags")

Please create a compute from the above list - ['Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96amsr_A100_v4']


In [5]:
# If you have a specific compute size to work with change it here. By default we use the 8 x V100 compute from the above list
compute_cluster_size = "Standard_NC96ads_A100_v4"

# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "ignite2024demo"

try:
    compute = workspace_ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            max_instances=2,  # For multi node training set this to an integer value more than 1
        )
        workspace_ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        try:
            print(
                "Attempt #2 - Trying to create a low priority compute. Since this is a low priority compute, the job could get pre-empted before completion."
            )
            compute = AmlCompute(
                name=compute_cluster,
                size=compute_cluster_size,
                tier="LowPriority",
                max_instances=2,  # For multi node training set this to an integer value more than 1
            )
            workspace_ml_client.compute.begin_create_or_update(compute).wait()
        except Exception as e:
            print(e)
            raise ValueError(
                f"WARNING! Compute size {compute_cluster_size} not available in workspace"
            )


# Sanity check on the created compute
compute = workspace_ml_client.compute.get(compute_cluster)
if compute.provisioning_state.lower() == "failed":
    raise ValueError(
        f"Provisioning failed, Compute '{compute_cluster}' is in failed state. "
        f"please try creating a different compute"
    )

if computes_allow_list is not None:
    computes_allow_list_lower_case = [x.lower() for x in computes_allow_list]
    if compute.size.lower() not in computes_allow_list_lower_case:
        raise ValueError(
            f"VM size {compute.size} is not in the allow-listed computes for finetuning"
        )
else:
    # Computes with K80 GPUs are not supported
    unsupported_gpu_vm_list = [
        "standard_nc6",
        "standard_nc12",
        "standard_nc24",
        "standard_nc24r",
    ]
    if compute.size.lower() in unsupported_gpu_vm_list:
        raise ValueError(
            f"VM size {compute.size} is currently not supported for finetuning"
        )


# This is the number of GPUs in a single node of the selected 'vm_size' compute.
# Setting this to less than the number of GPUs will result in underutilized GPUs, taking longer to train.
# Setting this to more than the number of GPUs will result in an error.
gpu_count_found = False
workspace_compute_sku_list = workspace_ml_client.compute.list_sizes()
available_sku_sizes = []
for compute_sku in workspace_compute_sku_list:
    available_sku_sizes.append(compute_sku.name)
    if compute_sku.name.lower() == compute.size.lower():
        gpus_per_node = compute_sku.gpus
        gpu_count_found = True
# if gpu_count_found not found, then print an error
if gpu_count_found:
    print(f"Number of GPU's in compute {compute.size}: {gpus_per_node}")
else:
    raise ValueError(
        f"Number of GPU's in compute {compute.size} not found. Available skus are: {available_sku_sizes}."
        f"This should not happen. Please check the selected compute cluster: {compute_cluster} and try again."
    )

The compute cluster already exists! Reusing it for the current run
Number of GPU's in compute Standard_NC96ads_A100_v4: 4


### 4. Prepare the dataset for fine-tuning the model

For this demo, we want to showcase finetuning `Phi-3.5-mini-instruct` model with sentiment classification data. Below is a snippet of our training data `train.jsonl` that we have prepared with 50k+ prompt-response sets that finetune the model to classify content sentiments in a particular way. 


``` json
{
    "prompt_id": 100,
    "prompt": "XXX Recovers XXX Website After Ransomware Attack.",
    "messages": [
        {
            "content": "XXX Recovers XXX Website After Ransomware Attack. What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.",
            "role": "user"
        },
        {
            "content": "negative","role": "assistant"
        }
    ]
}
```

In addition to the `training` dataset, we also specify a `validation` dataset, that is in the same form as the training dataset but with different values. Training datasets are used to fit machine learning models that 'teach' or 'train the model. In contrast, validation datasets contain different samples to *evaluate* the trained ML models by assessing the model performance and finetuning the parameters of the model. This becomes an iterative process of the model learning from the training data, and then gets validated on the validation set. A validation dataset tells us how well the model is learning and adapting, allowing for adjustments and optimizations to be made to the model's parameters or hyperparameters before it's finally put to the test.

### 5. Submit the fine tuning job using the the model and data as inputs
 
Create the job that uses the `chat-completion` pipeline component. [Learn more](https://github.com/Azure/azureml-assets/blob/main/assets/training/finetune_acft_hf_nlp/components/pipeline_components/chat_completion/README.md) about all the parameters supported for fine tuning.

Training parameters define the training aspects. Below are few of the parameters that belong to this category.
* learning rate
* number of training steps
* batch size

Optimization parameters help in optimizing the GPU memory and effectively using the compute resources. Below are few of the parameters that belong to this category. _The optimization parameters differs for each model and are packaged with the model to handle these variations._
* deepspeed and LoRA
* mixed precision training
* multi-node training 

Note: Supervised finetuning may result in loosing alignment or catastrophic forgetting. We recommend checking for this issue and running an alignment stage after you finetune.

In [6]:
# Default training parameters
training_parameters = dict(
    num_train_epochs=5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
)
# Default optimization parameters
optimization_parameters = dict(
    apply_lora="true",
    merge_lora_weights="true",
    lora_alpha=32,
    lora_r=8,
    lora_dropout=0,
    apply_ort="false",
    apply_deepspeed="true",
    deepspeed_stage=2,
    ignore_mismatched_sizes="false",
    precision=16,
    evaluation_strategy="steps",
    eval_steps=50, # make this 5 for smaller dataset
    logging_strategy="steps",
    logging_steps=10,
    save_total_limit=1,
    apply_early_stopping="true",
    early_stopping_patience=3,
    batch_size=1,
    max_seq_length=4096,
)
# Let's construct finetuning parameters using training and optimization paramters.
finetune_parameters = {**training_parameters, **optimization_parameters}

# Each model finetuning works best with certain finetuning parameters which are packed with model as `model_specific_defaults`.
# Let's override the finetune_parameters in case the model has some custom defaults.
# if "model_specific_defaults" in foundation_model.tags:
#     print("Warning! Model specific defaults exist. The defaults could be overridden.")
#     finetune_parameters.update(
#         ast.literal_eval(  # convert string to python dict
#             foundation_model.tags["model_specific_defaults"]
#         )
#     )
print(
    f"The following finetune parameters are going to be set for the run: {finetune_parameters}"
)

The following finetune parameters are going to be set for the run: {'num_train_epochs': 5, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 8, 'learning_rate': 5e-06, 'lr_scheduler_type': 'cosine', 'apply_lora': 'true', 'merge_lora_weights': 'true', 'lora_alpha': 32, 'lora_r': 8, 'lora_dropout': 0, 'apply_ort': 'false', 'apply_deepspeed': 'true', 'deepspeed_stage': 2, 'ignore_mismatched_sizes': 'false', 'precision': 16, 'evaluation_strategy': 'steps', 'eval_steps': 50, 'logging_strategy': 'steps', 'logging_steps': 10, 'save_total_limit': 1, 'apply_early_stopping': 'true', 'early_stopping_patience': 3, 'batch_size': 1, 'max_seq_length': 4096}


In [7]:
# Set the pipeline display name for distinguishing different runs from the name
def get_pipeline_display_name():
    batch_size = (
        int(finetune_parameters.get("per_device_train_batch_size", 1))
        * int(finetune_parameters.get("gradient_accumulation_steps", 8))
        * int(gpus_per_node)
        * int(finetune_parameters.get("num_nodes_finetune", 1))
    )
    scheduler = finetune_parameters.get("lr_scheduler_type", "cosine")
    deepspeed = finetune_parameters.get("apply_deepspeed", "true")
    ds_stage = finetune_parameters.get("deepspeed_stage", "2")
    if deepspeed == "true":
        ds_string = f"ds{ds_stage}"
    else:
        ds_string = "nods"
    lora = finetune_parameters.get("apply_lora", "true")
    if lora == "true":
        lora_string = "lora"
    else:
        lora_string = "nolora"
    save_limit = finetune_parameters.get("save_total_limit", -1)
    seq_len = finetune_parameters.get("max_seq_length", -1)
    return (
        model_name
        + "-"
        + "phi35mini-sentiment"
        + "-"
        + f"bs{batch_size}"
        + "-"
        + f"{scheduler}"
        + "-"
        + ds_string
        + "-"
        + lora_string
        + f"-save_limit{save_limit}"
        + f"-seqlen{seq_len}"
    )


pipeline_display_name = get_pipeline_display_name()
print(f"Display name used for the run: {pipeline_display_name}")

Display name used for the run: Phi-3.5-mini-instruct-phi35mini-sentiment-bs32-cosine-ds2-lora-save_limit1-seqlen4096


In [8]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import Input

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(
    name="chat_completion_pipeline", label="latest"
)


# define the pipeline job
@pipeline(name=pipeline_display_name)
def create_pipeline():
    chat_completion_pipeline = pipeline_component_func(
        mlflow_model_path=foundation_model.id,
        compute_model_import=compute_cluster,
        compute_preprocess=compute_cluster,
        compute_finetune=compute_cluster,
        compute_model_evaluation=compute_cluster,
        # map the dataset splits to parameters
        train_file_path=Input(
            type="uri_file", path="./finance_data/fingpt-sentiment-train/train.jsonl"
        ),
        test_file_path=Input(
            type="uri_file", path="./finance_benchmarks/fiqa_sa/fiqa_sa.jsonl"
        ),
        # Training settings
        number_of_gpu_to_use_finetuning=gpus_per_node,  # set to the number of GPUs available in the compute
        **finetune_parameters
    )
    return {
        # map the output of the fine tuning job to the output of pipeline job so that we can easily register the fine tuned model
        # registering the model is required to deploy the model to an online or batch endpoint
        "trained_model": chat_completion_pipeline.outputs.mlflow_model_folder
    }


pipeline_object = create_pipeline()

# don't use cached results from previous jobs
pipeline_object.settings.force_rerun = True

# set continue on step failure to False
pipeline_object.settings.continue_on_step_failure = False

Submit the job

In [9]:
# submit the pipeline job
pipeline_job = workspace_ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)
# wait for the pipeline job to complete
workspace_ml_client.jobs.stream(pipeline_job.name)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading train.jsonl[32m (< 1 M

RunId: red_shark_xwwvvcl0z3
Web View: https://ml.azure.com/runs/red_shark_xwwvvcl0z3?wsid=/subscriptions/d676d072-7cbf-4197-b2d6-17ecf38370d0/resourcegroups/Ignite2024-Demo/workspaces/Ignite2024-Demo

Streaming logs/azureml/executionlogs.txt

[2024-11-11 22:28:37Z] Submitting 1 runs, first five are: 6ce7a840:0b0ed5e9-8199-4885-85aa-b7e1dc29a017
[2024-11-11 22:59:59Z] Completing processing run id 0b0ed5e9-8199-4885-85aa-b7e1dc29a017.

Execution Summary
RunId: red_shark_xwwvvcl0z3
Web View: https://ml.azure.com/runs/red_shark_xwwvvcl0z3?wsid=/subscriptions/d676d072-7cbf-4197-b2d6-17ecf38370d0/resourcegroups/Ignite2024-Demo/workspaces/Ignite2024-Demo



### 6. Register the fine tuned model with the workspace

We will register the model from the output of the fine tuning job. This will track lineage between the fine tuned model and the fine tuning job. The fine tuning job, further, tracks lineage to the foundation model, data and training code.

In [10]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# check if the `trained_model` output is available
print("pipeline job outputs: ", workspace_ml_client.jobs.get(pipeline_job.name).outputs)

# fetch the model from pipeline job output - not working, hence fetching from fine tune child job
model_path_from_job = "azureml://jobs/{0}/outputs/{1}".format(
    pipeline_job.name, "trained_model"
)

finetuned_model_name = "phi35mini-finetuned-sentiment"
print (finetuned_model_name)
print("path to register model: ", model_path_from_job)
prepare_to_register_model = Model(
    path=model_path_from_job,
    type=AssetTypes.MLFLOW_MODEL,
    name=finetuned_model_name,
    version=timestamp,  # use timestamp as version to avoid version conflict
    description="phi35mini-finetuned-sentiment",
)
print("prepare to register model: \n", prepare_to_register_model)
# register the model from pipeline job output
registered_model = workspace_ml_client.models.create_or_update(
    prepare_to_register_model
)
print("registered model: \n", registered_model)

pipeline job outputs:  {'trained_model': <azure.ai.ml.entities._job.pipeline._io.base.PipelineOutput object at 0x114771eb0>}
phi35mini-finetuned-sentiment
path to register model:  azureml://jobs/red_shark_xwwvvcl0z3/outputs/trained_model
prepare to register model: 
 description: phi35mini-finetuned-sentiment
name: phi35mini-finetuned-sentiment
path: azureml://jobs/red_shark_xwwvvcl0z3/outputs/trained_model
properties: {}
tags: {}
type: mlflow_model
version: '1731362985'

registered model: 
 creation_context:
  created_at: '2024-11-11T23:19:16.504930+00:00'
  created_by: Gina Lee
  created_by_type: User
  last_modified_at: '2024-11-11T23:19:16.504930+00:00'
  last_modified_by: Gina Lee
  last_modified_by_type: User
description: phi35mini-finetuned-sentiment
flavors:
  hftransformersv2:
    code: code
    hf_config_class: AutoConfig
    hf_pretrained_class: AutoModelForCausalLM
    hf_tokenizer_class: AutoTokenizer
    model_data: data
    pytorch_version: 2.2.2
    task_type: chat-compl

### 7. Deploy the fine tuned model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [11]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
    OnlineRequestSettings,
)

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name

online_endpoint_name = "phi35miniftignite2024-" + timestamp
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for "
    + registered_model.name
    + ", phi35mini sentimentdata",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

You can find here the list of SKU's supported for deployment - [Managed online endpoints SKU list](https://learn.microsoft.com/en-us/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list)

In [None]:
import ast

instance_type = "Standard_NC24ads_A100_v4"

# Inference compute allow list that supports deployment
if "inference_compute_allow_list" in foundation_model.tags:
    inference_computes_allow_list = ast.literal_eval(
        foundation_model.tags["inference_compute_allow_list"]
    )  # convert string to python list
    print(f"Please create a compute from the above list - {computes_allow_list}")
else:
    inference_computes_allow_list = None
    print("`inference_compute_allow_list` is not part of model tags")

# Check if the compute is in the allow listed computes
if (
    inference_computes_allow_list is not None
    and instance_type not in inference_computes_allow_list
):
    print(
        f"`instance_type` is not in the allow listed compute. Please select a value from {inference_computes_allow_list}"
    )


# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="phi35miniftignite24",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type=instance_type,
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=600),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

### 8. Test the endpoint with sample data

We will use some sample prompts from the test dataset and submit to online endpoint for inference to test the finetuned model.
Below is a simple Python script to test the finetuned model in a terminal. After putting in your API Key and API endpoint, you can simply run this Python script to test different prompts.

In [None]:
# pip install azure-ai-inference
import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from colorama import init, Fore, Style

# Initialize colorama
init()


api_key = '<YOUR API KEY>'
if not api_key:
  raise Exception("A key should be provided to invoke the endpoint")

client = ChatCompletionsClient(
    endpoint='<YOUR API ENDPOINT>',
    credential=AzureKeyCredential(api_key)
)

model_info = client.get_model_info()
print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider_name)

user_message = input("\nUser: ")

payload = {
  "messages": [
    {
      "role": "user",
      "content": user_message
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.8,
  "top_p": 0.1,
  "presence_penalty": 0,
  "frequency_penalty": 0
}
response = client.complete(payload)

print("\nResponse:")
print(Fore.YELLOW + response.choices[0].message.content + Style.RESET_ALL)
#print("\nModel:", response.model)
print("\nUsage:")
print("	Prompt tokens:", response.usage.prompt_tokens)
print("	Total tokens:", response.usage.total_tokens)
print("	Completion tokens:", response.usage.completion_tokens)

### 9. (Optional) Download the finetuned model

If you would like to download a local copy of the finetuned model's tensors, you can go to Azure Machine Learning Studio, go to `Models`, select your finetuned model instance, and click on `Artifacts` tab to download individual files.